Remove dakuten / handakuten in Java (aka ten-ten, ぱ β†’ は)

Many Japanese hiragana and kana have dakuten and a handcrafted version.
Example: は becomes ば or ぱ (note the γ‚› and γ‚œ parts)

Question . How can I remove them from String in Java?

For example, I want to はばぱハバパ1aγ‚γ‚’δΊœ

become はははハハハ1aγ‚γ‚’δΊœ

.

Efficiency is important.

Context: match the content to the legacy system.

+3


source to share


1 answer


Characters with (han) dakuten can be decomposed into base kana and combining label, Java has a class Normalizer

for java.text

.

String decomposed = Normalizer.normalize(input, Normalizer.Form.NFD);

      

Then the union marks (han) dakuten can be removed with replace

or replaceAll

, for example



String noVoicingMarks = decomposed.replace("\u3099", "").replace("\u309A", "");

      

Or (slightly faster in my tests)

String noVoicingMarks = decomposed.replaceAll("\u3099|\u309A", "");

      

+4


source







All Articles