Remove dakuten / handakuten in Java (aka ten-ten, ぱ → は)

Question

Many Japanese hiragana and kana have dakuten and a handcrafted version.
Example: は becomes ば or ぱ (note the ゛ and ゜ parts)

Question . How can I remove them from String in Java?

For example, I want to はばぱハバパ1aあア亜

become はははハハハ1aあア亜

.

Efficiency is important.

Context: match the content to the legacy system.

+3

Nicolas raoul 08 june 17 at 14:17

1 answer

harold · Answer 1 · 2017-06-08T14:38:08+0000

Characters with (han) dakuten can be decomposed into base kana and combining label, Java has a class Normalizer

for java.text

.

String decomposed = Normalizer.normalize(input, Normalizer.Form.NFD);

Then the union marks (han) dakuten can be removed with replace

or replaceAll

, for example

String noVoicingMarks = decomposed.replace("\u3099", "").replace("\u309A", "");

Or (slightly faster in my tests)

String noVoicingMarks = decomposed.replaceAll("\u3099|\u309A", "");