How to vectorize bitrams with hashing in scikit-learn?

Question

How to vectorize bitrams with hashing in scikit-learn?

I have a few bigrams, say: [('word','word'),('word','word'),...,('word','word')]

. How can I use scikit HashingVectorizer

to create a vector function that will later be represented by some classification algorithm, for example, SVC

or Naive Bayes or any type of classification algorithm?

+3

python scipy scikit-learn machine-learning nlp

tumbleweed 28 oct. '14 at 7:20

source to share

2 answers

Since you've extracted the binars yourself, you can vectorize with FeatureHasher

. The main thing you need to do is squash bigrams into lines. For example.

>>> data = [[('this', 'is'), ('is', 'a'), ('a', 'text')],
...         [('and', 'one'), ('one', 'more')]]
>>> from sklearn.feature_extraction import FeatureHasher
>>> fh = FeatureHasher(input_type='string')
>>> X = fh.transform(((' '.join(x) for x in sample) for sample in data))
>>> X
<2x1048576 sparse matrix of type '<type 'numpy.float64'>'
    with 5 stored elements in Compressed Sparse Row format>

+3

Fred foo 29 oct. '14 at 11:30

source to share

alvas · Accepted Answer · 2014-10-28T09:15:28+0000

First, you MUST understand what the various vectorizers do. Most vectorizers are based on approaches bag-of-word

where documents are tokens mapped to a matrix.

From sklearn, CountVectorizer and HashVectorizer documentation :

Converting a collection of text documents to a token count matrix

For example, these sentences

The Fulton County grand jury on Friday announced an investigation into the Atlanta case; the recent primary did not provide any evidence that any irregularities had occurred.

The jury further concluded that the City Executive Committee, which had all the charges for the election, "deserves the praise and gratitude of the City of Atlanta for running the election.

with this crude vectorizer:

from collections import Counter
from itertools import chain
from string import punctuation

from nltk.corpus import brown, stopwords

# Let say the training/testing data is a list of words and POS
sentences = brown.sents()[:2]

# Extract the content words as features, i.e. columns.
vocabulary = list(chain(*sentences))
stops = stopwords.words('english') + list(punctuation)
vocab_nostop = [i.lower() for i in vocabulary if i not in stops]

# Create a matrix from the sentences
matrix = [Counter([w for w in words if w in vocab_nostop]) for words in sentences]

print matrix

will become:

[Counter({u"''": 1, u'``': 1, u'said': 1, u'took': 1, u'primary': 1, u'evidence': 1, u'produced': 1, u'investigation': 1, u'place': 1, u'election': 1, u'irregularities': 1, u'recent': 1}), Counter({u'the': 6, u'election': 2, u'presentments': 1, u'``': 1, u'said': 1, u'jury': 1, u'conducted': 1, u"''": 1, u'deserves': 1, u'charge': 1, u'over-all': 1, u'praise': 1, u'manner': 1, u'term-end': 1, u'thanks': 1})]

Thus, it can be quite inefficient given the very large dataset, so the developers sklearn

have created more efficient code. One of the most important features sklearn

is that you don't even need to load the dataset into memory before vectorizing it.

Since it is not clear what your task is, I think what you are looking for is general use. Let's say you use it for a language identifier.

Let's say that your input file for training data is in train.txt

:

Pošto je EULEX obećao da će obaviti istragu o prošlosedmičnom izbijanju nasilja na sjeveru Kosova, taj incident predstavlja još jedan ispit kapaciteta misije da doprinese jačanju vladavine prava.
De todas as provações que teve de suplantar ao longo da vida, qual foi a mais difícil? O início. Qualquer começo apresenta dificuldades que parecem intransponíveis. Mas tive sempre a minha mãe do meu lado. Foi ela quem me ajudou a encontrar forças para enfrentar as situações mais decepcionantes, negativas, as que me punham mesmo furiosa.
Al parecer, Andrea Guasch pone que una relación a distancia es muy difícil de llevar como excusa. Algo con lo que, por lo visto, Alex Lequio no está nada de acuerdo. ¿O es que más bien ya ha conseguido la fama que andaba buscando?
Vo väčšine golfových rezortov ide o veľký komplex niekoľkých ihrísk blízko pri sebe spojených s hotelmi a ďalšími možnosťami trávenia voľného času – nie vždy sú manželky či deti nadšenými golfistami, a tak potrebujú iný druh vyžitia. Zaujímavé kombinácie ponúkajú aj rakúske, švajčiarske či talianske Alpy, kde sa dá v zime lyžovať a v lete hrať golf pod vysokými alpskými končiarmi.

And your respective labels are Bosnian, Portuguese, Spanish and Slovak i.e.

[bs,pt,es,sr]

Here's one way to use classifier CountVectorizer

and naive bayes. The following example from https://github.com/alvations/bayesline DSL shared task .

Start with a vector. First, the vectorizer takes an input file and then converts the training set to a vectorized matrix and initializes the vectorizer (i.e. Functions):

import codecs

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

trainfile = 'train.txt'
testfile = 'test.txt'

# Vectorizing data.
train = []
word_vectorizer = CountVectorizer(analyzer='word')
trainset = word_vectorizer.fit_transform(codecs.open(trainfile,'r','utf8'))
tags = ['bs','pt','es','sr']
print word_vectorizer.get_feature_names()

[output]:

[u'acuerdo', u'aj', u'ajudou', u'al', u'alex', u'algo', u'alpsk\xfdmi', u'alpy', u'andaba', u'andrea', u'ao', u'apresenta', u'as', u'bien', u'bl\xedzko', u'buscando', u'come\xe7o', u'como', u'con', u'conseguido', u'da', u'de', u'decepcionantes', u'deti', u'dificuldades', u'dif\xedcil', u'distancia', u'do', u'doprinese', u'druh', u'd\xe1', u'ela', u'encontrar', u'enfrentar', u'es', u'est\xe1', u'eulex', u'excusa', u'fama', u'foi', u'for\xe7as', u'furiosa', u'golf', u'golfistami', u'golfov\xfdch', u'guasch', u'ha', u'hotelmi', u'hra\u0165', u'ide', u'ihr\xedsk', u'incident', u'intranspon\xedveis', u'in\xedcio', u'in\xfd', u'ispit', u'istragu', u'izbijanju', u'ja\u010danju', u'je', u'jedan', u'jo\u0161', u'kapaciteta', u'kde', u'kombin\xe1cie', u'komplex', u'kon\u010diarmi', u'kosova', u'la', u'lado', u'lequio', u'lete', u'llevar', u'lo', u'longo', u'ly\u017eova\u0165', u'mais', u'man\u017eelky', u'mas', u'me', u'mesmo', u'meu', u'minha', u'misije', u'mo\u017enos\u0165ami', u'muy', u'm\xe1s', u'm\xe3e', u'na', u'nada', u'nad\u0161en\xfdmi', u'nasilja', u'negativas', u'nie', u'nieko\u013ek\xfdch', u'no', u'obaviti', u'obe\u0107ao', u'para', u'parecem', u'parecer', u'pod', u'pone', u'pon\xfakaj\xfa', u'por', u'potrebuj\xfa', u'po\u0161to', u'prava', u'predstavlja', u'pri', u'prova\xe7\xf5es', u'pro\u0161losedmi\u010dnom', u'punham', u'qual', u'qualquer', u'que', u'quem', u'rak\xfaske', u'relaci\xf3n', u'rezortov', u'sa', u'sebe', u'sempre', u'situa\xe7\xf5es', u'sjeveru', u'spojen\xfdch', u'suplantar', u's\xfa', u'taj', u'tak', u'talianske', u'teve', u'tive', u'todas', u'tr\xe1venia', u'una', u've\u013ek\xfd', u'vida', u'visto', u'vladavine', u'vo', u'vo\u013en\xe9ho', u'vysok\xfdmi', u'vy\u017eitia', u'v\xe4\u010d\u0161ine', u'v\u017edy', u'ya', u'zauj\xedmav\xe9', u'zime', u'\u0107e', u'\u010dasu', u'\u010di', u'\u010fal\u0161\xedmi', u'\u0161vaj\u010diarske']

Let's say your test documents are in test.txt

, which are Spanish es

and Portuguese pt

:

Por ello, ha insistido en que Europa tiene que darle un toque de atención porque Portugal esta incumpliendo la directiva del establecimiento del peaje
Estima-se que o mercado homossexual só na Cidade do México movimente cerca de oito mil milhões de dólares, aproximadamente seis mil milhões de euros

You can now mark test documents with the trained classifier as such:

import codecs, re, time
from itertools import chain

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

trainfile = 'train.txt'
testfile = 'test.txt'

# Vectorizing data.
train = []
word_vectorizer = CountVectorizer(analyzer='word')
trainset = word_vectorizer.fit_transform(codecs.open(trainfile,'r','utf8'))
tags = ['bs','pt','es','sr']

# Training NB
mnb = MultinomialNB()
mnb.fit(trainset, tags)

# Tagging the documents
codecs.open(testfile,'r','utf8')
testset = word_vectorizer.transform(codecs.open(testfile,'r','utf8'))
results = mnb.predict(testset)

print results

[output]:

['es' 'pt']

For more information on text classification, maybe you can find this helpful NLTK related question / answer see nltk NaiveBayesClassifier Tutorial for Sentiment Analysis

To use the HashingVectorizer, you need to notice that it creates negative vector values, and the MultinomialNaiveBayes classifier does not do negative values, so you will have to use a different classifier as such:

import codecs, re, time
from itertools import chain

from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import Perceptron

trainfile = 'train.txt'
testfile = 'test.txt'

# Vectorizing data.
train = []
word_vectorizer = HashingVectorizer(analyzer='word')
trainset = word_vectorizer.fit_transform(codecs.open(trainfile,'r','utf8'))
tags = ['bs','pt','es','sr']

# Training Perceptron
pct = Perceptron(n_iter=100)
pct.fit(trainset, tags)

# Tagging the documents
codecs.open(testfile,'r','utf8')
testset = word_vectorizer.transform(codecs.open(testfile,'r','utf8'))
results = pct.predict(testset)

print results

[output]:

['es' 'es']

But note that the perceptron results are worse in this small example. Different classifiers are suitable for different tasks, and different functions are suitable for different vectors, and different classifiers accept different vectors.

No perfect model, just better or worse

How to vectorize bitrams with hashing in scikit-learn?

More articles: