Scikit-learn categorization: binomial log regression?

I have texts that are graded on a continuous scale from -100 to +100. I am trying to classify them as positive or negative.

How can you perform binomial log regression to get the probability that the test data is -100 or +100?

The closest I got was the SGDClassifier (fine = 'l2', alpha = 1e-05, n_iter = 10), but that doesn't give the same results as SPSS when I use log binomial regression to predict the probabilities of -100 and +100. So I am assuming this is not the correct function?

+3


source to share


2 answers


SGDClassifier provides access to several linear classifiers, all of which are trained on stochastic gradient. It will default to using a linear vector support machine unless you call it with a different loss function. loss = 'log' will provide probabilistic logistic regression.

See documentation at:   http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html#sklearn.linear_model.SGDClassifier

Alternatively, you can use sklearn.linear_model.LogisticRegression to classify your logistic regression texts.

It is not clear to me that you will get exactly the same results as with SPSS due to differences in implementation. However, I would not expect to see statistically significant differences.

Edited to add:



My suspicion is that the 99% accuracy you get with SPSS logistic regression is training accuracy, while the 87% you see with scikits-learn logistic regression is test accuracy. I found this question in a data stack swap where another person is trying and a very similar problem and getting ~ 99% accuracy on training sets and 90% accuracy on test set.

https://datascience.stackexchange.com/questions/987/text-categorization-combining-different-kind-of-features

My recommended way forward is this: Try a few different basic classifiers in scikits-learn, including standard logistic regression and linear SVM, and rerun SPSS logistic regression multiple times with different train / test sets of your data and compare the results. If you still see a large discrepancy between classifiers that cannot be accounted for by providing the same train / test data splits, then post the results you see in your question and we can move forward from there.

Good luck!

+2


source


If pos / neg or pos probability is the only thing you need to output, then you can get binary labels y

like

y = score > 0

      

if you have estimates in a NumPy array score

.

You can then pass this to the instance LogisticRegression

using a continuous score to get the relative weights for the samples:



clf = LogisticRegression()
sample_weight = np.abs(score)
sample_weight /= sample_weight.sum()
clf.fit(X, y, sample_weight)

      

This gives maximum weight to tweets with scores of ± 100 and weight from zero to tweets that are labeled neutral, ramping between them.

If the dataset is very large, then as @brentlance showed you can use SGDClassifier

, but you have to give it loss="log"

if you want a logistic regression model; otherwise you will end up with a linear SVM.

0


source







All Articles