Options for Japanese MeCab tokenizer on iOS?

I am using the iPhone library for MeCab found at https://github.com/FLCLjp/iPhone-libmecab . I'm having trouble getting it to label every possible word. Specifically, I cannot label "ๅ‰ ๆœฌ ่ˆˆๆฅญ" into two parts "ๅ‰ ๆœฌ" and "่ˆˆๆฅญ". Are there any options I could use to fix this? The iPhone library doesn't expose anything, but uses C ++ under the objective-c wrapper. I guess there must be some setting that I could change to give finer control, but I don't know where to start.

By the way, if anyone wants to tag this "mecab" that would probably be appropriate. I am not yet allowed to create new tags.

UPDATE: The iOS library calls mecab_sparse_tonode2 () defined in libmecab.cpp. If anyone can point me to any English documentation in this file, that might be enough.

+3


source to share


1 answer


There is nothing iOS specific about this. The dictionary you use with mecab (possibly ipadic) has an entry for the company name ๅ‰ ๆœฌ ่ˆˆๆฅญ. Although both parts of the name are also listed as separate nouns, mecab has a strong preference for marking the compound name as a single word.

The Mecab lacks a feature that allows the user to choose whether to split connections. Note that such a feature is usually difficult to implement because not everyone agrees on which connections can be split and which cannot. For example: ๅฎน ็–‘ ่€… is a compound of ๅฎน ็–‘ and ่€…? From a purely morphological point of view, perhaps yes, but for most practical applications, probably not.

If you have a list of connections that you want to segment, quickly fix this to create a custom dictionary for the parts they are composed of and make the mecab use that in addition to the main dictionary.

There is Japanese documentation on how to do this here . For your specific example, it will include the following steps.



  • Make a user dictionary with two entries, one for ๅ‰ ๆœฌ and one for ่ˆˆๆฅญ:

    ๅ‰ๆœฌ,,,100,ๅ่ฉž,ๅ›บๆœ‰ๅ่ฉž,ไบบๅ,ๅ,*,*,ใ‚ˆใ—ใ‚‚ใจ,ใƒจใ‚ทใƒขใƒˆ,ใƒจใ‚ทใƒขใƒˆ
    ่ˆˆๆฅญ,,,100,ๅ่ฉž,ไธ€่ˆฌ,*,*,*,*,ใ“ใ†ใŽใ‚‡ใ†,ใ‚ณใ‚ฆใ‚ฎใƒงใ‚ฆ,ใ‚ณใ‚ฆใ‚ฎใƒงใ‚ฆ
    
          

    I suspect both entries already exist in the default dictionary, but by adding them to a custom dictionary and specifying a relatively low specificity score (I used 100

    for both - the lower, the more likely it is to be split), you can get a mecab to lean towards preference parts as a whole.

  • Compile the user dictionary:

    $> $MECAB/libexec/mecab/mecab-dict-index  -d /usr/lib64/mecab/dic/ipadic -u mydic.dic -f utf-8 -t utf-8 ./mydic
    
          

    You may need to customize the command. The above assumes:

    • Mecab was installed from source in $MECAB

      . If you are using the mecab installed by your package manager, you might have a hard time finding the tool mecab-dict-index

      . Best install from source.

    • The default dictionary is in /usr/lib64/mecab/dict/ipadic

      . It is not part of the mecab package; it comes as a separate package (like this ) and you might find it hard to find that too.

    • mydic

      is the name of the custom dictionary you created in step 1. mydic.dic

      is the name of the compiled dictionary you will get as output (optional).

    • Both system dictionaries ( -t

      ) and user dictionary (option -f

      ) are encoded in UTF-8. This may not be correct, in which case you will receive an error later when you use the mecab.

  • Change the mecab configuration. In a system-wide installation, this is a file named /usr/lib64/mecab/dic/ipadic/dicrc

    or something similar. In your case, it may be somewhere else. Add the following line to the end of the config file:

    userdic = home/myhome/mydic.dic
    
          

    Make sure the absolute path to the dictionary compiled above is correct.

If you then run a mecab against your input, it will split the composition into its parts (I tested it using mecab 0.994 on a Linux system).

A more complete fix would be to get the default dictionary source and manually remove any nouns you want to split, then recompile the dictionary. As a general note, using the CJK tokenizer for a serious application in production mode for a longer period of time usually involves some amount of dictionary maintenance (adding / removing entries) on a regular basis.

+10


source







All Articles