Given a string s and an array of smaller strings, T, how do I create a method to find s for each small string in T?

For string s and an array of smaller strings T, create a search method s for each small string in T.

Thank.

+2


source to share


7 replies


Assuming you have a significant number of smaller lines, Rabin-Karp is the standard way to find multiple small lines in very large lines. if you only have a few small lines just repeating Boyer-Moore there might be a better alternative for each one.



+8


source


The fastest way I know of this solves this problem: the Aho-Corasick algorithm . For large strings and a large number of patterns to search for, this is faster than applying a linear time search (eg KMP, Rabin-Karp, Boyer-Moore) for each pattern.



But are you sure you want something like this and your strings are too long for a simple string matching method?

+1


source


It sounds like a simple loop:

for(string t : T)
{
    if (t.equals(s)) {
        /* do stuff with t */
    }
}

      

From How to use for each cycle

0


source


You cannot choose the "best" algorithm without knowing the details of the dataset.

  • Are these statistically random lines?
  • Are there many or few repetitions on small strings?
  • Do you want to optimize execution speed or low memory consumption?
  • Will you perform this search multiple times with the same substrings (T) or with the same base strings?

Without this information, the "best" solution is the simplest.

static IEnumerable<string> FindIn(this IEnumerable<string> T, string s) {
    return T.Where(t => s.Contains(t));
}

      

0


source


Could you please clarify the situation?

** The algorithm HUGE depends on what you mean by "Search". **

  • Do you want to find if every string in T is a valid substring of S? Or Any line?

  • Do you need a Yes / No answer or indexes?

  • You don't care if the answers overlap (eg "ABCDE" contains both "ABC" and "CDE", but ONLY if you don't care about overlap).

Simplest method (assuming the search strings start out completely differently):

  • You have a "first character" map => map_of_first_2_characters__to__list_of_strings.

  • Move through each position in S, find the symbol as a key on the map above.

    • The value will be another map, mapping 2-character strings to a list of substrings starting with those two characters.

    • Find the symbol and its right neighbor in the subcap, the value will be a list of strings starting with these two values.

    • Assuming a fair even distribution of starting characters in T and T is not too large (if it is too large, just build the data structure one level more by matching 3 characters) - we just found a very short list of plausible matches starting at the current position. String - Compare them all. Check those (if any) that are substrings of S starting at the current position. If the goal is not to find ALL matches for ALL strings, eliminate the ones you found as matches from the data structure.

You might want to read this for advanced stuff

0


source


Let's turn it into a Java solution

boolean isSubset(String[] t, String s) {
    for (String sample: t)
        if (!sample.equals(s))
            return false;
    return true;
}

      

You can do it faster using Falaina's guidelines, but do you really need it?

0


source


If you have room for the pointer table (pointer size * NumCharsInSource), you can sort each line in the source (line starting with a character) using something like QSort. Then you can BSearch smaller rows in the pointer table. Assuming N characters and M substrings, the sort will have a performance of O (N lg N) and the search results will have a performance of O (M lg N). Overall performance should be O ((N + M) lg N).

However, there can be degenerate cases where the lines in the source are strongly repeated (i.e. 100,000 a followed by a). This will make the comparison for the sorting part very slow :-( to get around this, you can use special cases for long strings, but it gets a lot more complicated.

The selection algorithm really depends on your input data and the amount of free memory you have to work with.

0


source







All Articles