Can unidic be balanced against unidic-neologd?
With the sentence "ĺ ´ć 㯠ĺ¤ĺ° ă ă ă 㼠ă ă ă 㧠ă ă ăŠ, ć ă ㎠ă ă 㨠ă ă 㧠ă ă." (That is, "It's a little hard to find, but it's a good place.") With mecab
a -d mecab-unidic-neologd
first line of output:
ĺ ´ć ăăˇă§ ăăˇă§ ĺ ´ć ĺčŠ-ĺşćĺčŠ-äşşĺ-ĺ§
those. he says "ĺ ´ć" is the person's last name. Using the usual mecab-unidic, he more accurately says that "ĺ ´ć" is just a simple noun.
ĺ ´ć ăăˇă§ ăăˇă§ ĺ ´ć ĺčŠ-ćŽéĺčŠ-ä¸čŹ
My first question is, did unidic-neologd replace all entries in unidic, or did it just add its own 3 million nouns?
Then, secondly, assuming this is a merge, is it possible to overload the scales in order to prefer simple single-disk recordings a little more? That is, I would like to receive ä¸ ĺą ćŁ ĺş ăŽ ă ㍠㪠ă ĺł ć¸é¤¨ and SMAP each are recognized as separate proper names, but I also need to see that ĺ ´ć will always mean "place" (unless accompanied by a name suffix such like ă ă or ć§, of course).
Links: unidic-neologd
source to share
Neologd is merged with unidic (or ipadic), so it keeps "unified" in the name. If a record has multiple parts of speech, for example, ĺ ´ć, then the record to be used is selected by minimizing the sentence cost using part-of-speech transitions, and for words in the dictionary, the cost of each marker.
If you look in the CSV file that contains the neologd dictionary entries, you will see two entries for ĺ ´ć:
ĺ ´ć,4786,4786,4329,ĺčŠ,ĺşćĺčŠ,ä¸čŹ,*,*,*,ăăˇă§,ĺ ´ć,ĺ ´ć,ăăˇă§,ĺ ´ć,ăăˇă§,ĺş,*,*,*,*
ĺ ´ć,4790,4790,4329,ĺčŠ,ĺşćĺčŠ,äşşĺ,ĺ§,*,*,ăăˇă§,ĺ ´ć,ĺ ´ć,ăăˇă§,ĺ ´ć,ăăˇă§,ĺş,*,*,*,*
And in lex.csv
, the standard unified dictionary:
ĺ ´ć,5145,5145,4193,ĺčŠ,ćŽéĺčŠ,ä¸čŹ,*,*,*,ăăˇă§,ĺ ´ć,ĺ ´ć,ăăˇă§,ĺ ´ć,ăăˇă§,桡,*,*,*,*
The fourth column is cost. The option with a lower value will most likely be chosen, so in that case you can increase the value of ĺ ´ć as a proper noun, although, to be honest, I'll just delete it. You can read more about how to tinker with the cost here (Japanese).
If you want to heavily load all unique records by default, you can modify the neolog CSV file to increase all weights. This is one way to create such a file:
awk -F, 'BEGIN{OFS=FS}{$4 = $4 * 100; print $0}' neolog.csv > neolog.fix.csv
You will need to delete the original csv file before building (see note 2 below).
In this particular case, I think you should report it as a bug in the Neologd project.
Note 1: As mentioned above, since the selected entry depends on the proposal as a whole, it is possible to get a non-proper-noun tag even with the default configuration. Sample sentence:
ăĺşăŽĺ ´ćçĽăŁăŚăăďź
Note 2: The way the neologd dictionary is combined with the standard unified dictionary is based on a subtle aspect of the Mecab dictionary collection. In particular, all CSV files in the dictionary build directory are used when creating the system dictionary. The order is not specified, so it is not clear what happens in the event of collisions.
This feature is mentioned in the Mecab documentation here (Japanese).
source to share