C # search with similarity / affinity

Suppose we have a, IEnumerable Collection

with objects of 20,000 people . Then suppose we created another Person object.

We want to list all Persons who own this Person. This means, for example, if the last name has an affinity of more than 90% , add this Person to the list.

eg. ("Andrew" vs. "Andrew")

What's the most efficient / fastest way to do this?

Iterating through collection and comparison of char by char with affinity definition? OR? Any ideas?

Thank!

+2


source to share


4 answers


You may be interested in:

Levenshtein distance algorithm



Peter Norvig - How to Write a Spelling Corrector (you will be interested in the part where he compares a word to a collection of existing words)

+6


source


Depending on how often you need this search, the brute force iteration and comparison method can be quite fast. Twenty thousand records is really not that many, and if the number of requests is high, your performance may be acceptable.

However, you will need to implement the comparison logic yourself, and if you need a greater degree of flexibility (or if you need to find what you need to work for performance), you may need to look at something like Lucene.Net . Most of the text search engines I've seen and worked with were more file-based, but I think you can index objects in memory as well (I'm not sure about that, though).



Good luck!

+1


source


I'm not sure if you are asking for help finding a search given your existing proximity function, or if you are asking for help writing a proximity function. So for now, I am assuming that you are completely lost.

With this assumption in mind, you will notice that I split the problem in two and that you need to do too. You need to write a function that takes two string inputs and returns a boolean indicating whether the inputs are similar enough. Then you need a separate lookup for a delegate that will match any function with such a signature.

The basic signature for your proximity function might look like this:

bool IsAffinityMatch(string p1, string p2)

      

And then your search will look like this:

MyPersonCollection.Where(p => IsAffinityMatch(p.Surname, OtherPerson.Surname));

      

+1


source


I provide the source code for this Affinity method:

        /// <summary>
        /// Compute Levenshtein distance according to the Levenshtein Distance Algorithm
        /// </summary>
        /// <param name="s">String 1</param>
        /// <param name="t">String 2</param>
        /// <returns>Distance between the two strings.
        /// The larger the number, the bigger the difference.
        /// </returns>
        private static int Compare(string s, string t)
        {
            /* if both string are not set, its uncomparable. But others fields can still match! */
            if (string.IsNullOrEmpty(s) && string.IsNullOrEmpty(t)) return 0;

            /* if one string has value and the other one hasn't, it definitely not match */
            if (string.IsNullOrEmpty(s) || string.IsNullOrEmpty(t)) return -1;

            s = s.ToUpper().Trim();
            t = t.ToUpper().Trim();

            int n = s.Length;
            int m = t.Length;
            int[,] d = new int[n + 1, m + 1];

            int cost;

            if (n == 0) return m;
            if (m == 0) return n;

            for (int i = 0; i <= n; d[i, 0] = i++) ;
            for (int j = 0; j <= m; d[0, j] = j++) ;

            for (int i = 1; i <= n; i++)
            {
                for (int j = 1; j <= m; j++)
                {
                    cost = (t.Substring(j - 1, 1) == s.Substring(i - 1, 1) ? 0 : 1);

                    d[i, j] = System.Math.Min(System.Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
                              d[i - 1, j - 1] + cost);
                }
            }

            return d[n, m];
        }

      

that is, if 0 is returned, the 2 strings are identical.

+1


source







All Articles