How to tokenize a set of documents in unigram + bigram bagofwords using gensim?

I know using scikit learn which I could use,

vectorizer = TfidfVectorizer(min_df=2,ngram_range=(1, 2),norm='l2')

corpus = vectorizer.fit_transform(text)

      

This piece of code. But how could I do this with gensim?

+3


source to share


1 answer


I think you could take a look at the simple_preprocess from utils



gensim.utils.simple_preprocess(doc, deacc=False, min_len=2, max_len=15)
Convert a document into a list of tokens.

This lowercases, tokenizes, de-accents (optional). – the output are final tokens = unicode strings, that won’t be processed any further.

      

+1


source







All Articles