How to preprocess very large data in python

I have several files of 100 MB each. The format of these files is as follows:

0  1  2  5  8  67  9  122
1  4  5  2  5  8
0  2  1  5  6
.....

      

(note that there are no alignment spaces added in the actual file, only one space separates each element, adds alignment for aesthetic effect)

this first element on each line is the binary classification and the rest of the line are feature indices where the value is 1. For example, the third line says that row two, first, fifth and sixth are 1, the rest are zeros.

I tried to read every line from every file and use sparse.coo_matrix to create a sparse matrix like this:

for train in train_files:  
    with open(train) as f:
        row = []
        col = []
        for index, line in enumerate(f):
            record = line.rstrip().split(' ')
            row = row+[index]*(len(record)-4)
            col = col+record[4:]
        row = np.array(row)
        col = np.array(col)
        data = np.array([1]*len(row))
        mtx = sparse.coo_matrix((data, (row, col)), shape=(n_row, max_feature))
        mmwrite(train+'trans',mtx)

      

but it took forever. I started reading data at night and let the computer start up after I fell asleep and when I woke up it still hadn't finished the first file!

What are the best ways to handle this kind of data?

+3


source to share


1 answer


I think it will be slightly faster than your method because it doesn't read the file line by line. You can try this code with a small chunk of one file and compare with your code.
This code also requires you to know the function number in advance. If we don't know the function number, that would require another line of code that has been commented out.



import pandas as pd
from scipy.sparse import lil_matrix
from functools import partial


def writeMx(result, row):
    # zero-based matrix requires the feature number minus 1
    col_ind = row.dropna().values - 1
    # Assign values without duplicating row index and values
    result[row.name, col_ind] = 1


def fileToMx(f):
    # number of features
    col_n = 136
    df = pd.read_csv(f, names=list(range(0,col_n+2)),sep=' ')
    # This is the label of the binary classification
    label = df.pop(0)
    # Or get the feature number by the line below
    # But it would not be the same across different files
    # col_n = df.max().max()
    # Number of row
    row_n = len(label)
    # Generate feature matrix for one file
    result = lil_matrix((row_n, col_n))
    # Save features in matrix
    # DataFrame.apply() is usually faster than normal looping
    df.apply(partial(writeMx, result), axis=0)
    return(result)

for train in train_files:
    # result is the sparse matrix you can further save or use
    result = fileToMx(train)
    print(result.shape, result.nnz)
    # The shape of matrix and number of nonzero values
    # ((420, 136), 15)

      

0


source







All Articles