How to tokenize a set of documents in unigram + bigram bagofwords using gensim?
I know using scikit learn which I could use,
vectorizer = TfidfVectorizer(min_df=2,ngram_range=(1, 2),norm='l2')
corpus = vectorizer.fit_transform(text)
This piece of code. But how could I do this with gensim?
+3
Nipun Alahakoon
source
to share
1 answer
I think you could take a look at the simple_preprocess from utils
gensim.utils.simple_preprocess(doc, deacc=False, min_len=2, max_len=15)
Convert a document into a list of tokens.
This lowercases, tokenizes, de-accents (optional). – the output are final tokens = unicode strings, that won’t be processed any further.
+1
Peter Krejzl
source
to share