How to tokenize a set of documents in unigram + bigram bagofwords using gensim?
1 answer
I think you could take a look at the simple_preprocess from utils
gensim.utils.simple_preprocess(doc, deacc=False, min_len=2, max_len=15)
Convert a document into a list of tokens.
This lowercases, tokenizes, de-accents (optional). – the output are final tokens = unicode strings, that won’t be processed any further.
+1
source to share