Count the unique appearance of a substring in a wordlist without knowing the substring?
* I am trying to count the unique occurrences of a substring within a wordlist * So check the wordlist and determine if any words have substrings based on minimum characters that occur multiple times and count them. I don't know any substrings.
This is a working solution where you know the substring, but what if you don't? Theres the minimum number of characters the words are based on.
Find all words where "Book" is a substring of the word. With bottom php function.
Desired result:
book count (5)
stor count (2)
source to share
For string 100
book bookstore bookworm booking book cooking boring bookingservice.... ok
0123456789... ... 100
your algorithm could be:
Explore substrings from different starting points and lengths of the substring. You take all substrings starting at 0 with lengths from 1 to 100, so: 0-1, 0-2, 0-3, ... and see if any of those substrings occur more than once in the total string ... Move through the string, starting at increasing positions, looking for all substrings starting at 1, i.e. 1-2, 1-3, 1-4, ... and so on until you reach 99-100.
Save a table of all substrings and their number of events and you can sort them.
You can optimize by specifying a minimum and maximum length, which will greatly reduce your search and accuracy. Also, once you find a substring, store them in an array of found substrings. If you come across a substring again, skip it. (i.e. the beats for book
that you have already counted, you shouldn't count again when you hit the next substring book
). Plus, you never have to search for lines longer than half of the total line.
In the example line, you can run an additional test for the uniquness of the line. You will have
o x ..
oo x 7
bo x 7
ok x 6
book x 5
booking x 2
bookingservice x 1
excluding bites less than 3 (and more than half of the total text string), you will get
book x 5
booking x 2
bookingservice x 1
which is already quite plausible.
[edit] This will obviously scan the entire line, not just natural words.
[edit] I usually don't like writing code for the OP, but in this case I got a little curious:
$string = "book bookshelf booking foobar bar booking ";
$string .= "selfservice bookingservice cooking";
function search($string, $min = 4, $max = 16, $threshhold = 2) {
echo "<pre><br/>";
echo "searching <em>'$string'</em> for string occurances ";
echo "of length $min - $max: <br/>";
$hits = array();
$foundStrings = array();
// no string longer than half of the total string will be found twice
if ($max > strlen($string) / 2) {
$max = strlen($string);
}
// examin substrings:
// start from 0, 1, 2...
for ($start = 0; $start < $max; $start++) {
// and string length 1, 2, 3, ... $max
for ($length = $min; $length < strlen($string); $length++) {
// get the substring in question,
// but search for natural words (trim)
$substring = trim(substr($string, $start, $length));
// if substring was not counted yet,
// add the found count to the hits
if (!in_array($substring, $foundStrings)) {
preg_match_all("/$substring/i", $string, $matches);
$hits[$substring] = count($matches[0]);
}
}
}
// sort the hits array desc by number of hits
arsort($hits);
// remove substring hits with hits less that threshhold
foreach ($hits as $substring => $count) {
if ($count < $threshhold) {
unset($hits[$substring]);
}
}
print_r($hits);
}
search($string);
?>
Comments and variable names should make the code explain itself. $ string will appear in the file to be read. This exmaple will output:
searching 'book bookshelf booking foobar bar booking selfservice
bookingservice cooking' for string occurances of length 4 - 16:
Array
(
[ook] => 6
[book] => 5
[boo] => 5
[bookin] => 3
[booking] => 3
[booki] => 3
[elf] => 2
)
Let me know how you implement it :)
source to share
This is my first approximation: unfinished, untested, has at least 1 bug and is written in eiffel. I'm not going to do all the work for you.
deferred class
SUBSTRING_COUNT
feature
threshold : INTEGER_32 =5
biggest_starting_substring_length(a,b:STRING):INTEGER_32
deferred
end
biggest_starting_substring(a,b:STRING):STRING
do
Result := a.substring(0,biggest_starting_substring_length(a,b))
end
make_list_of_substrings(a,b:STRING)
local
index:INTEGER_32
this_one: STRING
do
from
a_index := b_index + 1
invariant
a_index >=0 and a_index <= a.count
until
a_index >= a.count
loop
this_one := biggest_starting_substring(a.substring (a_index, a.count-1),b)
if this_one.count > threshold then
list.extend (this_one)
end
variant
a.count - a_index
end
end -- biggest_substring
list : ARRAYED_LIST[STRING]
end
source to share