Logistic regression on a huge dataset

I need to run logical regression on a huge dataset (many GB of data). I am currently using the Julia GLM package for this. Although my regression works on subsets of the data, I am running out of memory when I try to run it on the full data set.

Is there a way to compute logistic regressions on huge, non-sparse datasets without using the forbidden amount of memory? I've been thinking about splitting the data into chunks, calculating regressions on each of them and aggregating them somehow, but I'm not sure if this will give reliable results.

+3


source to share


5 answers


Vowpal Wabbit is for this: linear models where the data (or even the model) doesn't fit in memory.

You can do the same manually using Stochastic Gradient Descent (SGD): write the "loss function" of your logistic regression (opposite of probability), minimize it a bit on a piece of data (do one step of gradient descent), do the same on another piece data and continue. After a few passes of data, you should have a good solution. It works best if the data arrives in a random order.



Another idea ( ADMM , I think), similar to what you propose, would be to split the data into chunks, and minimize the loss function on each chunk. Of course, the solutions on different pieces do not match. To solve this problem, we can change the objective functions by adding a small penalty for the difference between the solution on the data chunk and the average solution, and again optimize everything. After several iterations, the solutions get closer and closer and eventually converge. This has the added benefit of being able to parallelize.

+4


source


I haven't personally used it, but the StreamStats.jl package is for this use case. It supports linear and logical regression as well as other statistical streaming functions.



+2


source


Stay tuned for the awesome Josh Day OnlineStats package . Besides a ton of online algorithms for various statistics, regression, classification, dimensionality reduction and distribution estimation, we are also actively working on bringing all the missing features from StreamStats and merging the two.

Also, I've been working on a very experimental OnlineAI package (extending OnlineStats) that extends some of the online algorithms in the learning space machine.

+1


source


To add to Tom's answer, OnlineStats.jl has a statistical learning type ( StatLearn

) that relies on stochastic approximation algorithms, each using O (1) memory. Logical regression and supporting vector machines are available for binary response data. The model can be updated with new batches of data, so you don't need to download the entire data set at once. It's also very fast. Here's a basic example:

using OnlineStats, StatsBase
o = StatLearn(n_predictors, LogisticRegression())

# load batch 1
fit!(o, x1, y1)

# load batch 2
fit!(o, x2, y2)

# load batch 3
fit!(o, x3, y3)
...

coef(o)

      

0


source


Several scikit evaluators, including logistic regression, are implementing partial_fit

that allow batch preparation of large datasets from the kernel.

such models can be used for classification using a correspondence approach: examining data that does not fit into main memory.

pseudocode:

from sklearn.linear_model import SGDClassifier

clf = SGDClassifier(loss='log')
for batch_x, batch_y in some_generator:  # lazily read data in chunks
    clf.partial_fit(batch_x, batch_y)

      

0


source







All Articles