String similarity with different lengths, but the same

I know there are many string similarity algorithms out there, but I don't know which would be better for my problem.

My strings vary in length, but usually have some extra fluff added to one or the other. I want the algorithm to give high points of similarity when strings contain the same words without typos. For example, Stuff and Things corp.

matches Stuff and Things corporation

or 101, Stuff and Things corporat

or Stuff and Things

.

But the lines color

and colour

, Loremipsum

and Olremipsum

in my case are completely different. My strings would never have characters to be tricked or replaced, or strings from 1 to 50 characters long.

EDIT: The order of the same words New York city

will be very importat different or have a low level of similarity withYork New city

Thanks for any help

+3


source to share


1 answer


Ok, even if the rules are still not so clear, I'll give it a try.

A summary of your requirements:

  • Find the longest common sequence of words in another sentence
  • At least two words need to be shared, so New York

    and New Delhi

    are not equal
  • the order is important, therefore, New York city

    and York New city

    not equal

The method FindCommonWords

returns a sequence of words that are common in both sentences, or an empty sequence ( Enumerable.Empty<string>

) if no common sequence of words was found.

It first splits both lines into a predefined list of two word separators string[]

. Then it checks all "subsequences" regardless of whether it is contained in another array in the same order (using an extension method IndexOfSequence

).

private static readonly char[] wordSeparators = { '\n', '\t', ',', '.', '!', '?', ';', ':', ' ', '-', '/', '\\', '[', ']', '(', ')', '<', '>', '@', '"', '\'' };

public static IEnumerable<string> FindCommonWords(string str1, string str2, StringComparer comparer = null)
{
    if (str1 == null)
        throw new ArgumentNullException("str1", "Both input strings must not be null!");
    if (str2 == null)
        throw new ArgumentNullException("str2", "Both input strings must not be null!");

    if (comparer == null) comparer = StringComparer.CurrentCulture;
    str1 = str1.Trim();
    str2 = str2.Trim();

    string[] words1 = str1.Split(wordSeparators, StringSplitOptions.RemoveEmptyEntries);
    string[] words2 = str2.Split(wordSeparators, StringSplitOptions.RemoveEmptyEntries);
    if(Math.Min(words1.Length, words2.Length) < 2)
        return Enumerable.Empty<string>(); // one word is not supposed to be a commnon word sequence

    // use for-loop to find the longest common words
    for (int wordCount = words1.Length - 1; wordCount >= 2; wordCount--)
    {
        // scan word-count from left to right
        for (int skipCount = 0; wordCount + skipCount <= words1.Length; skipCount++)
        {
            // take wordCount-words from left side and walk from left to right
            IEnumerable<string> wordSeq = words1.Skip(skipCount).Take(wordCount);
            // search sequence in other words
            int indexInWords2 = words2.IndexOfSequence(wordSeq, comparer);
            if (indexInWords2 >= 0)
            {
                // found match in other words, must be longest common sequence
                return wordSeq;
            }
        }
    }
    return Enumerable.Empty<string>();
}

      



Here's an extension that might also be useful for other requirements:

public static int IndexOfSequence<TSource>(this IEnumerable<TSource> input, IEnumerable<TSource> sequence, IEqualityComparer<TSource> comparer)
{
    if (input == null) throw new ArgumentNullException("input");
    if (sequence == null) throw new ArgumentNullException("sequence");
    if (!sequence.Any()) throw new ArgumentException("Sequence must not be empty", "sequence");
    if (comparer == null)
    {
        comparer = EqualityComparer<TSource>.Default;
    }
    int index = -1, firstIndex = -1, lastFoundIndex = -1;
    bool found = false;

    using (IEnumerator<TSource> enumerator = input.GetEnumerator())
    {
        using (IEnumerator<TSource> enumerator2 = sequence.GetEnumerator())
        {
            enumerator2.MoveNext();
            while (enumerator.MoveNext())
            {
                index++;
                found = comparer.Equals(enumerator.Current, enumerator2.Current);
                if (found && firstIndex == -1)
                    firstIndex = index;
                else if (found && index != lastFoundIndex + 1)
                    found = false; // sequence must be consecutive
                if (found && !enumerator2.MoveNext())
                    return firstIndex;
                if(found)
                    lastFoundIndex = index;
            }
        }
    }
    return -1;
}

      

Here are your three samples:

var commonWords = FindCommonWords(
     "Stuff and Things corporation", 
     "101, Stuff and Things corporat", 
     StringComparer.CurrentCultureIgnoreCase);
Console.WriteLine(string.Join(" ", commonWords));   // Stuff and Things

commonWords = FindCommonWords(
     "101, Stuff and Things corporat",
     "or Stuff and Things.",
     StringComparer.CurrentCultureIgnoreCase);
Console.WriteLine( string.Join(" ", commonWords) ); // Stuff and Things

commonWords = FindCommonWords(
     "New York city",
     "York New city",
     StringComparer.CurrentCultureIgnoreCase);
Console.WriteLine(string.Join(" ", commonWords));  // empty sequence, no match

      

Please note that it is written from scratch and is not fully tested.

+2


source







All Articles