String similarity with different lengths, but the same
I know there are many string similarity algorithms out there, but I don't know which would be better for my problem.
My strings vary in length, but usually have some extra fluff added to one or the other. I want the algorithm to give high points of similarity when strings contain the same words without typos. For example, Stuff and Things corp.
matches Stuff and Things corporation
or 101, Stuff and Things corporat
or Stuff and Things
.
But the lines color
and colour
, Loremipsum
and Olremipsum
in my case are completely different. My strings would never have characters to be tricked or replaced, or strings from 1 to 50 characters long.
EDIT: The order of the same words New York city
will be very importat different or have a low level of similarity withYork New city
Thanks for any help
source to share
Ok, even if the rules are still not so clear, I'll give it a try.
A summary of your requirements:
- Find the longest common sequence of words in another sentence
- At least two words need to be shared, so
New York
andNew Delhi
are not equal - the order is important, therefore,
New York city
andYork New city
not equal
The method FindCommonWords
returns a sequence of words that are common in both sentences, or an empty sequence ( Enumerable.Empty<string>
) if no common sequence of words was found.
It first splits both lines into a predefined list of two word separators string[]
. Then it checks all "subsequences" regardless of whether it is contained in another array in the same order (using an extension method IndexOfSequence
).
private static readonly char[] wordSeparators = { '\n', '\t', ',', '.', '!', '?', ';', ':', ' ', '-', '/', '\\', '[', ']', '(', ')', '<', '>', '@', '"', '\'' };
public static IEnumerable<string> FindCommonWords(string str1, string str2, StringComparer comparer = null)
{
if (str1 == null)
throw new ArgumentNullException("str1", "Both input strings must not be null!");
if (str2 == null)
throw new ArgumentNullException("str2", "Both input strings must not be null!");
if (comparer == null) comparer = StringComparer.CurrentCulture;
str1 = str1.Trim();
str2 = str2.Trim();
string[] words1 = str1.Split(wordSeparators, StringSplitOptions.RemoveEmptyEntries);
string[] words2 = str2.Split(wordSeparators, StringSplitOptions.RemoveEmptyEntries);
if(Math.Min(words1.Length, words2.Length) < 2)
return Enumerable.Empty<string>(); // one word is not supposed to be a commnon word sequence
// use for-loop to find the longest common words
for (int wordCount = words1.Length - 1; wordCount >= 2; wordCount--)
{
// scan word-count from left to right
for (int skipCount = 0; wordCount + skipCount <= words1.Length; skipCount++)
{
// take wordCount-words from left side and walk from left to right
IEnumerable<string> wordSeq = words1.Skip(skipCount).Take(wordCount);
// search sequence in other words
int indexInWords2 = words2.IndexOfSequence(wordSeq, comparer);
if (indexInWords2 >= 0)
{
// found match in other words, must be longest common sequence
return wordSeq;
}
}
}
return Enumerable.Empty<string>();
}
Here's an extension that might also be useful for other requirements:
public static int IndexOfSequence<TSource>(this IEnumerable<TSource> input, IEnumerable<TSource> sequence, IEqualityComparer<TSource> comparer)
{
if (input == null) throw new ArgumentNullException("input");
if (sequence == null) throw new ArgumentNullException("sequence");
if (!sequence.Any()) throw new ArgumentException("Sequence must not be empty", "sequence");
if (comparer == null)
{
comparer = EqualityComparer<TSource>.Default;
}
int index = -1, firstIndex = -1, lastFoundIndex = -1;
bool found = false;
using (IEnumerator<TSource> enumerator = input.GetEnumerator())
{
using (IEnumerator<TSource> enumerator2 = sequence.GetEnumerator())
{
enumerator2.MoveNext();
while (enumerator.MoveNext())
{
index++;
found = comparer.Equals(enumerator.Current, enumerator2.Current);
if (found && firstIndex == -1)
firstIndex = index;
else if (found && index != lastFoundIndex + 1)
found = false; // sequence must be consecutive
if (found && !enumerator2.MoveNext())
return firstIndex;
if(found)
lastFoundIndex = index;
}
}
}
return -1;
}
Here are your three samples:
var commonWords = FindCommonWords(
"Stuff and Things corporation",
"101, Stuff and Things corporat",
StringComparer.CurrentCultureIgnoreCase);
Console.WriteLine(string.Join(" ", commonWords)); // Stuff and Things
commonWords = FindCommonWords(
"101, Stuff and Things corporat",
"or Stuff and Things.",
StringComparer.CurrentCultureIgnoreCase);
Console.WriteLine( string.Join(" ", commonWords) ); // Stuff and Things
commonWords = FindCommonWords(
"New York city",
"York New city",
StringComparer.CurrentCultureIgnoreCase);
Console.WriteLine(string.Join(" ", commonWords)); // empty sequence, no match
Please note that it is written from scratch and is not fully tested.
source to share