Efficient way to compare two strings (order of characters doesn't matter)

I am trying to find an algorithm to compare two strings. It would register a match with any words containing the same letters. For example, annuity and tern would be equivalent because they both contain the letters r, e, n, t.

EDIT I'm sorry for being so vague. The comparison will be made for two sets of several thousand words hundreds of times. This is just a small part of the overall code, so I don't want this to fail.

For those who asked the question about overfulfillment, it would be very important, for example, the rent would also correspond to the tolerance.

EDIT 2 For a match like rent == ternicate, ternicate does not match rent. It's more like word two contains the letters of one word. So if you have additional letters, it will still match if the word contains all letters of the first word.

+2


source to share


12 replies


Okay, this is a really bad idea, but this is just crazy, it might work!

  • Create a list of the first 26 primes.

    primes = [2, 3, 5, 7, 11, 13, 17, 19, 23, ...]
    
          

  • For each letter in the word, find the corresponding prime number. A β†’ 2, B β†’ 3, C β†’ 5, etc.

  • Multiply these prime numbers together. You end up with a (very large) number.

Words that have the same letters will have the same number. Words with different letters are guaranteed to have different numbers. Why is this?

As we multiply prime numbers, we will always have unique products for unique letter combinations. The numbers can be decomposed back into their prime factors, and the factors tell us exactly what letters were in the original word. The order of the letters is not preserved, but what letters were in the word and how many there were.

For example, take the words "face" and "cafe".



FACE = 13 * 2 * 5 * 11 = 1430  
CAFE = 5 * 2 * 13 * 11 = 1430

      

Ha! What could be more efficient than a simple integer comparison?

...

Okay, no, maybe not. It's too funny to actually use it. It's neat though.

+11


source


Just sort the characters of each string first, then compare them.



rent == tern
enrt == enrt

      

+6


source


The key here, given the ambiguity in the question, is that there is no need to count how many times a letter appears, only that it appears.

Therefore, assuming all letters are in the range a-z

, and also assuming that it is possible to index the original wordlists as arrays using integer indices:

1.

create two arrays (one for each list).

2.

for each word in both lists, calculates the bitmap as follows:

bitmap = 0
foreach (character in word) {
    bitmap |= (1 << (character - 'a'))
}
arrayX[index] = bitmap;

      

this bitmap is a collection of all letters that appear in that word.

3.

, then for each word in the set A we iterate over the set B and match when

arrayA[indexA] | arrayB[indexB] == arrayB[indexB]

      

This test will only be true if the character set in this word A is a subset of the characters in word B. The "or" operation for bits is the equivalent of the union (βˆͺ) operator for real sets.

See the Wikipedia entry to establish math - A βŠ† B if and only if A βˆͺ B = B.

BTW, Step 3 is O (n ^ 2), but should still be very fast because this is just a bitwise comparison. Several thousand words in each list (~ 4M tests) should take less than a second.

+4


source


One option is to count the numbers of each character in each line and compare the counters. A simple implementation should take time O(max(N, A))

, where N is the length of the larger string and A is the size of the array you are using to store the counters. For example, in Java:

public boolean equalIgnoringOrder(String s1, String s2) {
    if (s1.length() != s2.length()) {
        return false;
    }
    // Assuming characters in the range ASCII 0 to 127 
    int[] c1 = new int[128];
    int[] c2 = new int[128];
    for (int i = 0; i < s1.length(); i++) {
        c1[s1.charAt(i)]++;
        c2[s2.charAt(i)]++;
    }
    for (int i = 0; i < c1.length; i++) {
        if (c1[i] != c2[i]) {
            return false;
        }
    }
    return true;
}

      

There are several possible improvements. For example, you can deal with an arbitrary set of characters by performing range reduction; those. perform an initial pass through s1

and s2

, looking for the smallest and largest symbols in each, and use that to determine the size c1

and c2

and offset of the base. This will use less space on average and reduce the time needed to initialize the count arrays. It also offers a short circuit for comparison; for example when the smallest and largest characters for s1

and s2

do not match.

By comparison, comparing strings sorted using heapsort or quicksort will O(NlogN)

average with a space O(N)

, where N is the longest string length.

However, as @pst points out, proportionality constants can make the algorithm O(NlogN)

or even O(N*N)

better the algorithm O(N)

if N is not large. In this case, the average lengths of the strings being compared are probably the most important factor.

The above code efficiently performs Radix sort with multiple short circuits. (Three if you include range-decreasing short-circuiting.) So it ultimately comes down to whether quicksort / heap sort or radius sort would be better. And it depends on the length of the input strings and the character ranges.


Differently. @John's answer is that we are calculating the product of prime numbers. If we do the calculation using an arbitrary precision representation, the resulting values ​​will be unique for each individual "equal order ignore" rowset. Unfortunately, the calculation will be O(N*N)

. (Each intermediate has numbers O(N)

, and multiplying an N-digit number by a constant is O(N)

. Do this for N characters and you get O(N*N)

.)

But if we do the calculation modulo (say) 64, the result is actually a really good hash that is insensitive to character order; eg.

long hash = 1;
for (int i = 0; i < s.length(); i++) {
    hash = hash * primes[s.charAt(i)];
}

      

So, I would say that the algorithm that gives the best performance and space usage on average for comparing randomly generated strings is likely to be:

if (s1.length() != s2.length()) {
    return false;
}
if (hash(s1) != hash(s2)) { // computed as above
    return false;
}
// Compare using sorting or character counting as above.

      


One last moment. Assuming the string pointers are not identical and that the strings are of unequal length, any algorithm that evaluates the predicate equals

must be in O(N)

or worse. It must check every character on both lines to make this determination, and it does the operations O(N)

.

Any algorithm that performs less than 2 * N

or less than samples 2 * N

, further operations on the selected values ​​in this scenario are assumed to be incorrect.

+3


source


I have to agree with Stephen C - it's not good enough for an answer .

I'm not going to go downstairs, but could you please explain, for example, how much the rent is equivalent to a terrible one? You have responders who believe it is (people think the number of occurrences doesn't matter, and other responders who accept the worst. One of these groups is wasting their time.

Also, since your concern is with performance, we need to know more about your call pattern. Can you please explain if you will be looking at a pair of sets more than once, or if the sets are changing?

And just as the terminology jerks, you may already know this, but with the current wording, your algorithm is not symmetrical.

You say that the lease will match the thermal, but obviously the ternar will not match the lease. So you're not really looking for equivalence. You are looking for something like "is in" or "can be done from".

This means you need to take care of the ordering - you will get different results depending on how you visit your kits.

Don't get me wrong: this is an interesting problem ... I just don't know what the problem is.

+2


source


may not fastet, but probably the shortest solution using java + google-collection + guava (casting char[]

β†’ List<Character>

)

import com.google.common.collect.ImmutableMultiset;
import com.google.common.primitives.Chars;

public class EqualsOrderignore {
private static boolean compareIgnoreOrder(final String s1, String s2) {
    return ImmutableMultiset.copyOf(Chars.asList(s1.toCharArray()))
            .equals(ImmutableMultiset.copyOf(Chars.asList(s2.toCharArray())));
} 
}

      

execution time of this algorithm: O (s1.length + s2.length)

I am quite convinced that this solution will en-par with an O (N1 + N2) solution with manual processing on the VM server.

as a plus, this solution will work for any character instances, not just aZ.

+1


source


I did a lot of code that worked with dictionaries and anagrams. The usual approach is to convert the word to a sorted key, so that, as mentioned above, "rent" matches "tern" because both maps match "enrt". However, once you start with this route, it becomes really useful to have a dictionary of symbols and number of occurrences. Here is some python code that converts an unsorted string to a dictionary using (key = character, value = count):

import collections

# Create a defaultdict(int) from a string
def create_collections_dict(key):
    dk = collections.defaultdict(int)
    for k in key:
        dk[k] += 1
    return dk

      

Now you can score words against others by instantly seeing how many letters they have:

# Score the similarity of a defaultdict(int) against a string
# (which is temporarily converted to a defaultdict(int))
def score(dk, cand) :
    dc = create_collections_dict(cand)
    return sum(min(dk[k], dc[k]) for k in dk.keys() if k in dc)

if __name__ == '__main__':
    base = create_collections_dict('rent')
    for word in ['tern', 'ternicate', 'foobar']:
        print word, score(base, word)

      

Results:

tern 4
ternicate 4
foobar 1

      

+1


source


Assuming that:

  • your words only consist of ascii characters.
  • It doesn't matter
  • abc matches abcde and abcde does not match abc

You can go through the match string (s2) counting characters, then go through the value (s1) and check that all characters are present in another, something like (pseudocode, unchecked):

boolean matches(String s1, String s2) {
   int[]  counts = new int[256];
   char[] c1;
   char[] c2;

   c1 = s1.getCharArray();
   c2 = c2.getCharArray();

   // count char occurences in longest string
   for (int n = 0; n < c2.length; n++) {
       counts[(int)c2[n]]++;
   }

   // check all chars in shortest string are foud in the longest
   for (int n = 0; n < c1.length; n++) {
       if (0 == counts[(int)c1[n]]) {
          return false;
       }
   }

   return true;
}

      

This will be O (n) for the sum of the lengths of the arguments.

Edit: The question was changed to an asymmetric function between s1 and s2.

0


source


This is pretty vague, but I would use an associative array to solve it:

Use each letter of each word as a key to an associative array of integers. Letter letters increase values ​​while others decrease. Then at the end you can run foreach through all the keys and check that all values ​​are zero and then it matches. This gives you base rent == tren.

Caveats for uncertainty: 1. If multiple letters are in order, eg rent == rreenntt, then when adding letters to the array, check if the key exists, and if it does, do not add it again.
2. If the additional letters are okay, eg rent == renter but fernt! = Renter, then when checking the array values ​​at the end, check that 1 and -1 are not both in the array. In other words, 1 and 0 are only ok, or -1 and 0 are ok, but not 1 and -1 cannot be in the array at the same time.

I don't know how fast this would be relative to other approaches, but it would be easy to implement.

0


source


I think you should build a tree. I wrote some Python code to illustrate this idea, but this is probably a bug:

class knot():
    def __init__(self, char, is_word, string = "" way = 0):
        self.children = []
        self.shortest_way = way
        self.char = char
        self.word = is_word
        self.string = string

def comparing(strings):
    sorted_strings = []
    for string in strings:
        array_of_string = []
        for char in string:
            array_of_string.append(char)
        sorted_strings.append(array_of_string.sort())
    sorted_strings.sort()

    start = []
    matches = []

    for array_of_string in sorted_strings:
        matches += insert_string(array_of_string, start)

def insert_string(array, start):
    for match_string in test_string(array, start):
        matches += (array, match_string.string)
    add_string(array, start, 0):

def add_string(array, knots, n):
    minimum = 0
    maximum = len(knots) - 1
    while minimum != maximum:
        num = int((minimum + maximum) / 2)
        if (knots[num].char > array[n]):
            minimum = num
        elif (knots[num].char < array[n]):
            maximum = num
        elif (knots[num].char == array[n]):
            return add_string(array, knots[num], n+1)
    knots.append(new_knots(array, n))
    knots.sort

    """ more insertion routine needed"""

def search_children(array, knots):
    minimum = 0
    maximum = len(knots) - 1
    while minimum != maximum:
        num = int((minimum + maximum) / 2)
        if (knots[num].char > array[0]):
            minimum = num
        elif (knots[num].char < array[0]):
            maximum = num
        elif (knots[num].char == array[0]):
            return test_string(array, knots[num])
    return []

def test_string(array, target_knot):
    if len(array) > target_knot.sortest_way + 1:
        return []
    match_knots = []
    if len(array) == 1 and target_knot.is_word == True:
        match_knots.append(target_knot)
    for i in range(1, len(array)):
        match_knots += search_children(array[i:], target_knot.children)
    return match_knots

      

0


source


Assuming you are just looking for subsets and are limited to common English letters, then an efficient bar chart will do. I would look at using an unsigned 64-bit integer, with 2 bits to count to 2 occurrences, and an extra 12 bits to add an overflow flag and count up to three occurrences of "etao i nsrhl d". Bits are padded, not binary used (so for three e's, you have 111, otherwise you need something more complex than binary to check containment). To test for a subset relationship, you check the overflow bit of the subset you are testing, and if it is not set, you can simply size up and test the subset. Revert to checking the O (Length) of the sorted row content if the histogram overflows.

0


source


Any algorithm you choose can be optimized for strings of equal length. All you have to do is XOR each character, if the result is 0 then they contain the same letters. It doesn't help in the case of a substring, but it can help short-circuit a more expensive comparison.

0


source







All Articles