How to tokenize a set of documents in unigram + bigram bagofwords using gensim?

Question

How to tokenize a set of documents in unigram + bigram bagofwords using gensim?

I know using scikit learn which I could use,

vectorizer = TfidfVectorizer(min_df=2,ngram_range=(1, 2),norm='l2')

corpus = vectorizer.fit_transform(text)

This piece of code. But how could I do this with gensim?

+3

python-2.7 scikit-learn gensim

Nipun Alahakoon 13 nov. 14 at 5:36 am

source to share

1 answer

Peter Krejzl · Answer 1 · 2017-02-18T21:47:11+0000

I think you could take a look at the simple_preprocess from utils

gensim.utils.simple_preprocess(doc, deacc=False, min_len=2, max_len=15)
Convert a document into a list of tokens.

This lowercases, tokenizes, de-accents (optional). – the output are final tokens = unicode strings, that won’t be processed any further.

How to tokenize a set of documents in unigram + bigram bagofwords using gensim?

More articles: