Options for Japanese MeCab tokenizer on iOS?

Question

Options for Japanese MeCab tokenizer on iOS?

I am using the iPhone library for MeCab found at https://github.com/FLCLjp/iPhone-libmecab . I'm having trouble getting it to label every possible word. Specifically, I cannot label "吉本興業" into two parts "吉本" and "興業". Are there any options I could use to fix this? The iPhone library doesn't expose anything, but uses C ++ under the objective-c wrapper. I guess there must be some setting that I could change to give finer control, but I don't know where to start.

By the way, if anyone wants to tag this "mecab" that would probably be appropriate. I am not yet allowed to create new tags.

UPDATE: The iOS library calls mecab_sparse_tonode2 () defined in libmecab.cpp. If anyone can point me to any English documentation in this file, that might be enough.

+3

ios tokenize cjk mecab

arsenius 04 Feb At 15:35

source to share

1 answer

jogojapan · Accepted Answer · 2013-02-05T02:45:09+0000

There is nothing iOS specific about this. The dictionary you use with mecab (possibly ipadic) has an entry for the company name 吉本興業. Although both parts of the name are also listed as separate nouns, mecab has a strong preference for marking the compound name as a single word.

The Mecab lacks a feature that allows the user to choose whether to split connections. Note that such a feature is usually difficult to implement because not everyone agrees on which connections can be split and which cannot. For example: 容疑者 is a compound of 容疑 and 者? From a purely morphological point of view, perhaps yes, but for most practical applications, probably not.

If you have a list of connections that you want to segment, quickly fix this to create a custom dictionary for the parts they are composed of and make the mecab use that in addition to the main dictionary.

There is Japanese documentation on how to do this here . For your specific example, it will include the following steps.

Make a user dictionary with two entries, one for 吉本 and one for 興業:
```
吉本,,,100,名詞,固有名詞,人名,名,*,*,よしもと,ヨシモト,ヨシモト
興業,,,100,名詞,一般,*,*,*,*,こうぎょう,コウギョウ,コウギョウ

      

        
        
        
      

    
```
I suspect both entries already exist in the default dictionary, but by adding them to a custom dictionary and specifying a relatively low specificity score (I used 100

for both - the lower, the more likely it is to be split), you can get a mecab to lean towards preference parts as a whole.
Compile the user dictionary:
```
$> $MECAB/libexec/mecab/mecab-dict-index  -d /usr/lib64/mecab/dic/ipadic -u mydic.dic -f utf-8 -t utf-8 ./mydic

      

        
        
        
      

    
```
You may need to customize the command. The above assumes:
- Mecab was installed from source in $MECAB
  
  . If you are using the mecab installed by your package manager, you might have a hard time finding the tool mecab-dict-index
  
  . Best install from source.
- The default dictionary is in /usr/lib64/mecab/dict/ipadic
  
  . It is not part of the mecab package; it comes as a separate package (like this ) and you might find it hard to find that too.
- mydic
  
  is the name of the custom dictionary you created in step 1. mydic.dic
  
  is the name of the compiled dictionary you will get as output (optional).
- Both system dictionaries ( -t
  
  ) and user dictionary (option -f
  
  ) are encoded in UTF-8. This may not be correct, in which case you will receive an error later when you use the mecab.
Change the mecab configuration. In a system-wide installation, this is a file named /usr/lib64/mecab/dic/ipadic/dicrc

or something similar. In your case, it may be somewhere else. Add the following line to the end of the config file:
```
userdic = home/myhome/mydic.dic

      

        
        
        
      

    
```
Make sure the absolute path to the dictionary compiled above is correct.

If you then run a mecab against your input, it will split the composition into its parts (I tested it using mecab 0.994 on a Linux system).

A more complete fix would be to get the default dictionary source and manually remove any nouns you want to split, then recompile the dictionary. As a general note, using the CJK tokenizer for a serious application in production mode for a longer period of time usually involves some amount of dictionary maintenance (adding / removing entries) on a regular basis.

Options for Japanese MeCab tokenizer on iOS?

More articles: