Count the unique appearance of a substring in a wordlist without knowing the substring?

Question

Count the unique appearance of a substring in a wordlist without knowing the substring?

* I am trying to count the unique occurrences of a substring within a wordlist * So check the wordlist and determine if any words have substrings based on minimum characters that occur multiple times and count them. I don't know any substrings.

This is a working solution where you know the substring, but what if you don't? Theres the minimum number of characters the words are based on.

Find all words where "Book" is a substring of the word. With bottom php function.

Desired result:

book count (5)
stor count (2)

0

substring php count preg-match preg-match-all

Rubytastic June 21. '12 at 9:18

source to share

2 answers

This is my first approximation: unfinished, untested, has at least 1 bug and is written in eiffel. I'm not going to do all the work for you.

deferred class
    SUBSTRING_COUNT
feature
    threshold : INTEGER_32 =5

    biggest_starting_substring_length(a,b:STRING):INTEGER_32
        deferred
    end

    biggest_starting_substring(a,b:STRING):STRING
    do
        Result := a.substring(0,biggest_starting_substring_length(a,b))
    end

    make_list_of_substrings(a,b:STRING)
    local
        index:INTEGER_32
        this_one: STRING
    do
        from
            a_index := b_index + 1
        invariant
            a_index >=0 and a_index <= a.count
        until
            a_index >= a.count
        loop
            this_one := biggest_starting_substring(a.substring (a_index, a.count-1),b)
            if this_one.count > threshold then
                list.extend (this_one)
            end
        variant
            a.count - a_index
        end
    end -- biggest_substring

    list : ARRAYED_LIST[STRING]

end

+1

ctrl-alt-delor June 21. 12 at 10:08

source to share

kontur · Accepted Answer · 2012-06-21T10:21:38+0000

For string 100

book bookstore bookworm booking book cooking boring bookingservice.... ok
0123456789...                                                     ... 100

your algorithm could be:

Explore substrings from different starting points and lengths of the substring. You take all substrings starting at 0 with lengths from 1 to 100, so: 0-1, 0-2, 0-3, ... and see if any of those substrings occur more than once in the total string ... Move through the string, starting at increasing positions, looking for all substrings starting at 1, i.e. 1-2, 1-3, 1-4, ... and so on until you reach 99-100.

Save a table of all substrings and their number of events and you can sort them.

You can optimize by specifying a minimum and maximum length, which will greatly reduce your search and accuracy. Also, once you find a substring, store them in an array of found substrings. If you come across a substring again, skip it. (i.e. the beats for book

that you have already counted, you shouldn't count again when you hit the next substring book

). Plus, you never have to search for lines longer than half of the total line.

In the example line, you can run an additional test for the uniquness of the line. You will have

o              x ..
oo             x  7
bo             x  7
ok             x  6 
book           x  5
booking        x  2
bookingservice x  1

excluding bites less than 3 (and more than half of the total text string), you will get

book           x  5
booking        x  2
bookingservice x  1

which is already quite plausible.

[edit] This will obviously scan the entire line, not just natural words.

[edit] I usually don't like writing code for the OP, but in this case I got a little curious:

$string = "book bookshelf booking foobar bar booking ";
$string .= "selfservice bookingservice cooking";

function search($string, $min = 4, $max = 16, $threshhold = 2) {
    echo "<pre><br/>";
    echo "searching <em>'$string'</em> for string occurances ";
    echo "of length $min - $max: <br/>";

    $hits = array();
    $foundStrings = array();

    // no string longer than half of the total string will be found twice
    if ($max > strlen($string) / 2) {
        $max = strlen($string);
    }

    // examin substrings:
    // start from 0, 1, 2...
    for ($start = 0; $start < $max; $start++) {

        // and string length 1, 2, 3, ... $max
        for ($length = $min; $length < strlen($string); $length++) {

            // get the substring in question, 
            // but search for natural words (trim)
            $substring = trim(substr($string, $start, $length));

            // if substring was not counted yet, 
            // add the found count to the hits
            if (!in_array($substring, $foundStrings)) {
                preg_match_all("/$substring/i", $string, $matches);
                $hits[$substring] = count($matches[0]);
            }
        }
    }

    // sort the hits array desc by number of hits
    arsort($hits);

    // remove substring hits with hits less that threshhold
    foreach ($hits as $substring => $count) {
        if ($count < $threshhold) {
            unset($hits[$substring]);
        }
    }

    print_r($hits);
}

search($string);

?>

Comments and variable names should make the code explain itself. $ string will appear in the file to be read. This exmaple will output:

searching 'book bookshelf booking foobar bar booking selfservice 
bookingservice cooking' for string occurances of length 4 - 16: 
Array
(
    [ook] => 6
    [book] => 5
    [boo] => 5
    [bookin] => 3
    [booking] => 3
    [booki] => 3
    [elf] => 2
)

Let me know how you implement it :)

Count the unique appearance of a substring in a wordlist without knowing the substring?

More articles: