How do I prepare Unicode strings for indexing?

This question is about normalizing international characters for storing local names in indices. I would like to discuss the problem in general and also learn about possible existing solutions (classes / libraries) in Java.


In the global application, users enter their name, and the application writes each name to the search index so that other users can search for names and find users. Trivial in English, but not so trivial in many other languages ​​and / or globally, as specific letters must be transliterated and / or can be written in multiple forms. For example, the German name Häußler can be written as

  • Häußler (Germany)
  • Haeussler (Germany, international transliteration)
  • Häussler (Switzerland)
  • Hausler (English transliteration)

Java has

    Normalizer.normalize(entry, Normalizer.Form.NFD) // NFC

      

but that doesn't work in many cases and / or I don't know how to use it correctly. A good read here is also http://en.wikipedia.org/wiki/Unicode_equivalence , but I couldn't find enough information on this topic.

Does anyone know of an existing open source project where someone has already worked on this issue? Any libraries that can be used? Web sites?

Like you Japanese, Chinese, Arabs, etc. Do you transliterate your languages ​​into English? How do major social networks like Facebook transliterate their usernames to ensure they can be found internationally?

+3


source to share


1 answer


You are on the right track - one search term you can add, "canonical".



I believe the ICU project is the most reliable open source software that has handled this. Pay special attention to the normalization components especially the NFKC_Casefold implementation which handles the German ß example among many others.

+1


source







All Articles