Find strings matching a list of substrings

This is an interview question.

I have a text file with web addresses, for example:

www.yahoo.com
www.google.com
www.apple.com
www.microsoft.com

      

and I have a list of substrings like oo, goog, app. How can I find all the strings that match one of the substrings? In this example, I would:

www.yahoo.com
www.google.com
www.apple.com

      

The interviewer didn’t like going in line and checking if any substring occurred in the string. Then I said that we could use a trie, but that would only be useful if the first character of the substring matched the first character in the string, which is similar to how the suggestion function works in Google.

thank

+3


source to share


1 answer


You can use regular expressions. For example, an expression oo|goog|app

would do this.

If you have a very large number of substrings and are going to search for a lot of text, you should use something like the Aho-Corasick string matching algorithm .



It is interesting to note that the brute force approach (using the standard string matching algorithm) and the Aho-Corasick algorithm outputs two matches for "www.google.com" ("oo" and "goog"), but the regex solution will only output one.

As for your comment on the appropriateness of the question, it may not be intended to get the "correct" answer, but rather to find out how you think about the problems. For example, using the standard string search algorithm will take time proportional to MxN, where M is the number of strings to search for and N is the number of substrings to find. The regex solution will be faster because you only need to run the regex once for each line you are looking for. The Aho-Corasick algorithm is even faster because its state machine finds all matches in one pass. The approach you use depends on many factors, including the number of strings and substrings you have, how often you should run this,and how long do you need to implement the solution. This is a good question to ask how you approach a complex problem and how you identify and evaluate potential solutions.

+2


source







All Articles