Finding a substring position in a larger string
I have a large string and a large number of smaller substrings, and I am trying to check if each substring exists in the larger string and get the position of each of those substrings.
string="some large text here"
sub_strings=["some", "text"]
for each_sub_string in sub_strings:
if each_sub_string in string:
print each_sub_string, string.index(each_sub_string)
The problem is that I have a large number of substrings (about a million), it takes about an hour of processing time. Is there a way to reduce this time, perhaps using regular expressions or some other way?
source to share
The best way to solve this is with a tree implementation. As Rishav mentioned, you are repeating a lot of work here. Ideally this should be implemented as a tree-based FSM. Consider the following example:
Large String: 'The cat sat on the mat, it was great'
Small Strings: ['cat', 'sat', 'ca']
Then imagine a tree where each level is an additional letter.
small_lookup = {
'c':
['a', {
'a': ['t']
}], {
's':
['at']
}
}
Sorry for the rough formatting, but I think it's helpful to go back to the python data structure right away. You can build a tree where the initials are the initials and they are matched against a list of possible final substrings that can be performed. If you click on something that is a list item and you have nothing more nested, you are in the sheet, and you know you are in the first instance of that substring.
Keeping this tree in memory is a little cumbersome, but if you only have a millionth row, this should be the most efficient implementation. You should also make sure you trim the tree when you find the first copy of the words.
For those of you with CS chops, or if you want to know more about this approach, this is a simplified version of the Aho-Corasick string matching algorithm .
If you are interested in learning more about these approaches, there are three main algorithms used in practice:
- Aho-Corasick (Base fgrep) [Worst case: O (m + n)]
- Commentz-Walter (Vanilla GNU grep base) [Worst case: O (mn)]
- Rabin-Karp (Used to detect plagiarism) [Worst case: O (mn)]
There are domains in which all of these algorithms will outperform others, but based on the fact that you have a very large number of substrings that you are looking for, and there is probably a lot of overlap between them. that Aho-Corasick will give you significantly better performance than the other two methods as it avoids the O(mn)
worst case scenario
There is also an excellent python library that implements an algorithm Aho-Corasick
found here that should keep you from writing generic implementation details.
source to share
Depending on the distribution of the lengths of your substrings, you can save a lot of time using preprocessing.
Say the set of lengths of your substrings from the set {23, 33, 45} (this means that you may have millions of substrings, but each of them is one of these three lengths).
Then, for each of those lengths, find the Rabin Window above your big string, and put the results into a dictionary for that length. That is, take 23. Step down the big line and find the 23-window hashes. Let's say the hash for position 0 is 13. So you insert into the dictionary rabin23
that 13 maps to [0]
. Then you see that for position 1 the hash is 13. Then in rabin23
update that 13 maps to [0, 1]
. Then at position 2 the hash is 4. So at rabin23
4 is mapped to [2].
Now, given the substring, you can compute your Rabin hand and immediately check the corresponding dictionary for the indices of its occurrence (which you need to compare).
By the way, in many cases the lengths of your substrings will exhibit Pareto behavior, where it is said that 90% of the strings are 10% of the length. If so, you can only do it for these lengths.
source to share
This approach is suboptimal compared to the other answers, but can be good enough and simple to implement. The idea is to turn the algorithm around so that instead of testing each substring in turn against the larger string, iterate over the larger string and test for possible matching substrings at each position, using a dictionary to narrow down the number of substrings you need. check.
The output will differ from the original in that it will be sorted in ascending order of the index, not by substring, but you can post-process the output to sort by substring if you like.
Create a dictionary containing a list of substrings starting with all possible 1-3 characters. Then we iterate over the string and read each character after it 1-3 characters and check the match in this position for every substring in the dictionary that starts with these 1-3 characters:
string="some large text here"
sub_strings=["some", "text"]
# add each of the substrings to a dictionary based the first 1-3 characters
dict = {}
for s in sub_strings:
if s[0:3] in dict:
dict[s[0:3]].append(s)
else:
dict[s[0:3]] = [s];
# iterate over the chars in string, testing words that match on first 1-3 chars
for i in range(0, len(string)):
for j in range(1,4):
char = string[i:i+j]
if char in dict:
for word in dict[char]:
if string[i:i+len(word)] == word:
print word, i
If you don't need to match any 1 or 2 character long substrings, you can get rid of the loop for j
and just assign the char withchar = string[i:3]
Using this second approach, I timed the algorithm by reading War and Peace in Tolstoy and dividing it into unique words, for example:
with open ("warandpeace.txt", "r") as textfile:
string=textfile.read().replace('\n', '')
sub_strings=list(set(string.split()))
It took 124 seconds to do a full search for each unique word in the text and display each instance of each word.
source to share