Why is stenford corenlp gender identity non-deterministic?
I have the following results and as you can see the name edward has different results (null and male). This happened with several names.
edward, Gender: null
james, Gender: MALE
karla, Gender: null
edward, Gender: MALE
Also, how can I set up complete dictionaries? I want to add Spanish and Chinese names.
source to share
You raised a lot of problems!
1.) Karla is not in the default gender mappings file, so getting null
2.) If you want to create your own file, it must be in this format:
JOHN \ tMALE
The line must contain one name NAME \ tGENDER
The GenderAnnotator function can only accept 1 file for mappings, so you need to create a new file with the names you want to add.
The default gender mapping file is in stanford-corenlp-3.5.2-models.jar.
You can extract the default gender mappings file from this banner like this:
-
mkdir tmp-stanford-models-expand
-
cp / path / of / stanford-corenlp-3.5.2-models.jar tmp-stanford-models-extended
-
cd tmp-stanford-models-extended
-
jar xf stanford-corenlp-3.5.2-models.jar
-
should now be tmp-stanford-models-expand / edu
-
the file you want is tmp-stanford-models-extended / edu / stanford / nlp / models / gender / first_name_map_small
3.) Build your pipeline this way to use your custom gender vocabulary:
Properties props = new Properties();
props.setProperty("annotators",
"tokenize, ssplit, pos, lemma, gender, ner");
props.setProperty("gender.firstnames","/path/to/your/gender_dictionary.txt");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
4.) Try to run the floor to the end in your pipeline (see my annotator order above). It is possible for the RegexNERSequenceClassifier (which is the class that adds Gender tags) to block if the tokens already have NER tags. It seems to me that running the annotator by gender will fix the problem first. So when you build a conveyor, make sure the floor goes down to ner.
The sequence "edward james karla edward" is tagged with the tag "OO PERSON PERSON" by the NER tag. I'm not really sure why these first two tokens get "O" for their NER tags. I would like to point out that "Edward James Karla Edward" gives "HUMAN HUMAN FACE" and keep in mind the NER marker factors in position in the sentence, so maybe the lower scale at the beginning of the sentence causes the first "edward" token "to be marked as "O"?
If you have any problems with this please let me know and I'll be happy to help more!
TL; DR
1.) Karla is marked incorrectly because this name is not in the dictionary of words
2.) You can create your own gender mapping file with NAME \ tGENDER, make sure the "gender.firstnames" property is set to the path to your new gender mapping file.
3.) Make sure the gender annotator comes before the ner annotator, this should fix the problem!
source to share