Remove dakuten / handakuten in Java (aka ten-ten, γ± β γ―)
Many Japanese hiragana and kana have dakuten and a handcrafted version.
Example: γ― becomes γ° or γ± (note the γ and γ parts)
Question . How can I remove them from String in Java?
For example, I want to γ―γ°γ±γγγ1aγγ’δΊ
become γ―γ―γ―γγγ1aγγ’δΊ
.
Efficiency is important.
Context: match the content to the legacy system.
source to share
Characters with (han) dakuten can be decomposed into base kana and combining label, Java has a class Normalizer
for java.text
.
String decomposed = Normalizer.normalize(input, Normalizer.Form.NFD);
Then the union marks (han) dakuten can be removed with replace
or replaceAll
, for example
String noVoicingMarks = decomposed.replace("\u3099", "").replace("\u309A", "");
Or (slightly faster in my tests)
String noVoicingMarks = decomposed.replaceAll("\u3099|\u309A", "");
source to share