How does String.Contains () Fuzzy path in C #?

I have a list of faces that I want to find while filtering. Every time the user enters a search string, filtering is applied.

There are two issues to consider:

  • User can enter part of names
  • User may be wrong

The first one is simply resolved by searching for substrings, eg. String.Contains (). The second can be resolved with a fuzzy implementation (e.g. https://fuzzystring.codeplex.com )

But I don't know how to deal with the calls at the same time.

For example: I want to find the person "Dr. Martin Fowler" when entering one of:

  • "Martin"
  • "Fawler"
  • "Marten Fauler"

I think I need to write "FuzzyContains ()" logic that satisfies my needs and also has acceptable performance. Any tips on how to get started?

+3


source to share


7 replies


I modified Oliver's answer that suggested Distance Levenshtein algorithms, this is not the best choice here as the calculated distance is great when only parts of the names were entered. So I ended up using the Longest Common Subsequence algorithm implemented with the awesome FuzzyString Lib .



const int TOLERANCE = 1;
string userInput = textInput.Text.ToLower();
var matchingPeople = people.Where(p =>
{
     //Check Contains
     bool contains = p.Name.ToLower().Contains(userInput);
     if(contains) return true;

     //Check LongestCommonSubsequence
     bool subsequenceTolerated = p.Name.LongestCommonSubsequence(userInput).Length >= userInput.Length - TOLERANCE;

     return subsequenceTolerated;
}).ToList();

      

+3


source


This seems to be a job for the Levenshtein distance algorithm ( one of a dozen C # implementations ).

You give this algorithm two lines (one entered by the user and one from your list). It then calculates how many characters need to be replaced, added, or removed to move from the first line to the second. Then you can take all items from your list where the distance is less than or equal to three (for example) to find simple typos.



If you have this method, you can use it this way:

var userInput = textInput.Text.ToLower();
var matchingEmployees = EmployeeList.Where(x => x.Name.ToLower().Contains(userInput)
                                                || LevenshteinDistance.Compute(x.Name.ToLower(), userInput) <= 3)
                                    .ToList();

      

+4


source


I have done this myself before and started off with some of the methods listed on wikipedia, roughly matching string . When I finished, I set up my algorithm in a way that was not as general, but gave me the best matches on my domain.

If your entire dictionary is in memory and not too large, you can simply apply the appropriate algorithm to each member in the dictionary. If your vocabulary is large, this is likely to overuse your resources and you will need a better algorithm. You might also want to use the full text search feature in your database.

In my case, I repeated, although each line in my dictionary was comparing "matching runs", i.e. 2 points for a 2-character match, 3 for 3-character matches up to an 8-character match. I ran despite all possible pairs, threes, etc. - scoring every dictionary entry and choosing the highest scoring result. Tolerance of typos, word order, etc, but expensive computing - my dictionary was only a few thousand phrases, so it worked very well for me. This is a modified version of Bone Ratio.

+1


source


Have you tried brute force? The easiest way is to match the search string to substrings of the target strings starting from the beginning, and then match the closest match of all matches.

But this may not be acceptable from the performance view.

0


source


Perhaps you could use this implementation of soundex: CodeProject What soundex does is it compares two strings and calculates the "pronunciation" as a percentage. Some time ago I was building a search using this function (built-in PHP hasit)

0


source


I had a project for a school a while ago where we had a text box where students could search for every employee, student, who has something to do with the school. We talked about several hundred people. The simple Linq query we used was incredibly fast on a Core i3 processor. The request is called every time the user types something into the text box. In the TextChanged event, we called a request that looked like this:

var resultData = EmployeeList.Where(x=>x.Name.ToLower().Contains(textInput.Text.ToLower())).ToList();

      

Of course this logic only applies if you have Dr. Martin Fowler in one property or member.

0


source


In other languages ​​like python we have cool stuff for word processing including distance calculations. Some algorithms exist, such as Levenshtein , which calculates the fuzzy distance between two lines. I've seen some implementations in C # ( here ) and also another module was difflib which is available here . the outputs of these algorithms are numeric. the closer to 0, the better.

0


source







All Articles