Scikit-learn categorization: binomial log regression?
I have texts that are graded on a continuous scale from -100 to +100. I am trying to classify them as positive or negative.
How can you perform binomial log regression to get the probability that the test data is -100 or +100?
The closest I got was the SGDClassifier (fine = 'l2', alpha = 1e-05, n_iter = 10), but that doesn't give the same results as SPSS when I use log binomial regression to predict the probabilities of -100 and +100. So I am assuming this is not the correct function?
source to share
SGDClassifier provides access to several linear classifiers, all of which are trained on stochastic gradient. It will default to using a linear vector support machine unless you call it with a different loss function. loss = 'log' will provide probabilistic logistic regression.
See documentation at: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html#sklearn.linear_model.SGDClassifier
Alternatively, you can use sklearn.linear_model.LogisticRegression to classify your logistic regression texts.
It is not clear to me that you will get exactly the same results as with SPSS due to differences in implementation. However, I would not expect to see statistically significant differences.
Edited to add:
My suspicion is that the 99% accuracy you get with SPSS logistic regression is training accuracy, while the 87% you see with scikits-learn logistic regression is test accuracy. I found this question in a data stack swap where another person is trying and a very similar problem and getting ~ 99% accuracy on training sets and 90% accuracy on test set.
My recommended way forward is this: Try a few different basic classifiers in scikits-learn, including standard logistic regression and linear SVM, and rerun SPSS logistic regression multiple times with different train / test sets of your data and compare the results. If you still see a large discrepancy between classifiers that cannot be accounted for by providing the same train / test data splits, then post the results you see in your question and we can move forward from there.
Good luck!
source to share
If pos / neg or pos probability is the only thing you need to output, then you can get binary labels y
like
y = score > 0
if you have estimates in a NumPy array score
.
You can then pass this to the instance LogisticRegression
using a continuous score to get the relative weights for the samples:
clf = LogisticRegression() sample_weight = np.abs(score) sample_weight /= sample_weight.sum() clf.fit(X, y, sample_weight)
This gives maximum weight to tweets with scores of ± 100 and weight from zero to tweets that are labeled neutral, ramping between them.
If the dataset is very large, then as @brentlance showed you can use SGDClassifier
, but you have to give it loss="log"
if you want a logistic regression model; otherwise you will end up with a linear SVM.
source to share