Mallet - Topic Modeling - Stop Word Error

Although I add an extra table list and default stop code list, when I use MALLET to model topics, some stop words appear in the topic models. For example, "ın", "ıf", "ıt". How can I ensure that these stop words don't appear in theme models? The models are lower.

0 5 ı ıt time room door house people eyes thing night woman day make girl face mother voice car house

1 5 ıt ın fact meaning point experience order form human action general general religious law part change number case proof

2 5 time place work water long make a cut ın area large upper part of the house built-in machine building clay piece design

3 5 school people in development national American members social program system economic group problems education class students work politics children

4 5 year york week home music american city house president day school club william show white ın days family night

5 5 ıt time fire legs river long road side mile game earth run hit military pistol big ball started weapon

6 5 hands water white hand ın black food eyes face slow sun cold ıt life red head hot long body

7 5 ın number system data surface temperature high low type volume information material pressure feed shape fine results method shown

8 5 worldly life church god war time great death book english ın history of the century england french west soviet spirit of love

9 5 year state unified government general business federal department judicial tax value million company secretary to act publicly ın service industry

thanks for the advice

+3


source to share


1 answer


Check the spelling of your stop words. Mallet lowers your corpus by default, but it doesn't delay your stopwords!

Also check the format of the stop file: Mallet expects it to be one word per line.

And don't forget the option to the --stoplist-file yourstopwordfile.txt

command mallet import-dir

.



EDIT: Beware of OCR errors in your original file: I see that in topics, words like "ın" are spelled with an uncountable i (as used in Turkish spelling) and not with the usual dotted i. So either apply some OCR correction before modeling the theme, or create a spelling mistake with endless extra stopwatches.

EDIT2: There is another possible source for the countless "i", "ıf", "ıt": Mallet contains the bottom lines of all words in the corpus. When your locale is set to Turkish, Java reduces the capital i to countless i. Check your JAVA language settings and build your theme again from scratch.

+1


source







All Articles