Segment a Korean word into separate syllables - C ++ / Python

Question

Segment a Korean word into separate syllables - C ++ / Python

I am trying to segment a Korean string into a separate syllable. So the input will be a string like "서울 특별시" and the result is "서", "울", "특", "별", "시". I tried with C ++ and Python to segment a string, but the result is a series? or white spaces respectively (the line itself, however, may be printed correctly on the screen). In C ++, I first initialized the input string as string korean="서울특별시"

and then used string::iterator

to traverse the string and print each individual component. In Python, I just used a simple loop for

.

I am wondering if there is a solution to this problem. Thank.

+3

c ++ python string encoding tokenize

user1718064 Jan 31. 13 at 11:13

source to share

1 answer

Steve jessop · Accepted Answer · 2013-01-31T11:25:37+0000

I do not know Korean at all and cannot comment on the syllable division, but in Python 2 the following works:

# -*- coding: utf-8 -*- 
print(repr(u"서울특별시"))
print(repr(u"서울특별시"[0]))

Output:

u'\uc11c\uc6b8\ud2b9\ubcc4\uc2dc'
u'\uc11c'

In Python 3, you don't need strings u

for Unicode.

The outputs are the unicode values of the characters in the string, which means the string was cut correctly in this case. The reason I printed them with repr

is because the font in the terminal I was using cannot represent them, and so without repr

I just see square margins. But this is purely a rendering issue repr

showing that the data is correct.

So, if you logically know how to identify syllables, you can use repr

to find out what your code actually did. Unicode NFC sounds like a good candidate for actually defining them (thanks to R. Martino Fernandez), but unicodedata.normalize()

a way to get it.

Segment a Korean word into separate syllables - C ++ / Python

More articles: