Can unidic be balanced against unidic-neologd?

With the sentence "場所 は 多少 わ か り づ ら い ん で す け ど, 感 じ の い い と こ ろ で し た." (That is, "It's a little hard to find, but it's a good place.") With mecab

a -d mecab-unidic-neologd

first line of output:

場所  バショ バショ 場所  名詞-固有名詞-人名-姓

      

those. he says "場所" is the person's last name. Using the usual mecab-unidic, he more accurately says that "場所" is just a simple noun.

場所  バショ バショ 場所  名詞-普通名詞-一般      

      

My first question is, did unidic-neologd replace all entries in unidic, or did it just add its own 3 million nouns?

Then, secondly, assuming this is a merge, is it possible to overload the scales in order to prefer simple single-disk recordings a little more? That is, I would like to receive 中 居正 広 の ミ に な る 図 書館 and SMAP each are recognized as separate proper names, but I also need to see that 場所 will always mean "place" (unless accompanied by a name suffix such like さ ん or 様, of course).

Links: unidic-neologd

+3


source to share


1 answer


Neologd is merged with unidic (or ipadic), so it keeps "unified" in the name. If a record has multiple parts of speech, for example, 場所, then the record to be used is selected by minimizing the sentence cost using part-of-speech transitions, and for words in the dictionary, the cost of each marker.

If you look in the CSV file that contains the neologd dictionary entries, you will see two entries for 場所:

場所,4786,4786,4329,名詞,固有名詞,一般,*,*,*,バショ,場所,場所,バショ,場所,バショ,固,*,*,*,*                              
場所,4790,4790,4329,名詞,固有名詞,人名,姓,*,*,バショ,場所,場所,バショ,場所,バショ,固,*,*,*,*

      

And in lex.csv

, the standard unified dictionary:

場所,5145,5145,4193,名詞,普通名詞,一般,*,*,*,バショ,場所,場所,バショ,場所,バショ,混,*,*,*,*

      

The fourth column is cost. The option with a lower value will most likely be chosen, so in that case you can increase the value of 場所 as a proper noun, although, to be honest, I'll just delete it. You can read more about how to tinker with the cost here (Japanese).

If you want to heavily load all unique records by default, you can modify the neolog CSV file to increase all weights. This is one way to create such a file:

awk -F, 'BEGIN{OFS=FS}{$4 = $4 * 100; print $0}' neolog.csv > neolog.fix.csv

      



You will need to delete the original csv file before building (see note 2 below).

In this particular case, I think you should report it as a bug in the Neologd project.


Note 1: As mentioned above, since the selected entry depends on the proposal as a whole, it is possible to get a non-proper-noun tag even with the default configuration. Sample sentence:

お店の場所知っている?

      


Note 2: The way the neologd dictionary is combined with the standard unified dictionary is based on a subtle aspect of the Mecab dictionary collection. In particular, all CSV files in the dictionary build directory are used when creating the system dictionary. The order is not specified, so it is not clear what happens in the event of collisions.

This feature is mentioned in the Mecab documentation here (Japanese).

+2


source







All Articles