Mallet - Topic Modeling - Stop Word Error
Although I add an extra table list and default stop code list, when I use MALLET to model topics, some stop words appear in the topic models. For example, "ın", "ıf", "ıt". How can I ensure that these stop words don't appear in theme models? The models are lower.
0 5 ı ıt time room door house people eyes thing night woman day make girl face mother voice car house
1 5 ıt ın fact meaning point experience order form human action general general religious law part change number case proof
2 5 time place work water long make a cut ın area large upper part of the house built-in machine building clay piece design
3 5 school people in development national American members social program system economic group problems education class students work politics children
4 5 year york week home music american city house president day school club william show white ın days family night
5 5 ıt time fire legs river long road side mile game earth run hit military pistol big ball started weapon
6 5 hands water white hand ın black food eyes face slow sun cold ıt life red head hot long body
7 5 ın number system data surface temperature high low type volume information material pressure feed shape fine results method shown
8 5 worldly life church god war time great death book english ın history of the century england french west soviet spirit of love
9 5 year state unified government general business federal department judicial tax value million company secretary to act publicly ın service industry
thanks for the advice
source to share
Check the spelling of your stop words. Mallet lowers your corpus by default, but it doesn't delay your stopwords!
Also check the format of the stop file: Mallet expects it to be one word per line.
And don't forget the option to the --stoplist-file yourstopwordfile.txt
command mallet import-dir
.
EDIT: Beware of OCR errors in your original file: I see that in topics, words like "ın" are spelled with an uncountable i (as used in Turkish spelling) and not with the usual dotted i. So either apply some OCR correction before modeling the theme, or create a spelling mistake with endless extra stopwatches.
EDIT2: There is another possible source for the countless "i", "ıf", "ıt": Mallet contains the bottom lines of all words in the corpus. When your locale is set to Turkish, Java reduces the capital i to countless i. Check your JAVA language settings and build your theme again from scratch.
source to share