Vectorizing Pandas Dataframe into Numpy array

Question

Vectorizing Pandas Dataframe into Numpy array

I have a problem where I need to convert the pandas framework to an array of list of lists.

Example:

import pandas as pd
df = pd.DataFrame([[1,2,3],[2,2,4],[3,2,4]])

I know there is an as_matrix () function that returns below:

df.as_matrix():
# result:array([[1, 2, 3],
                [2, 2, 4],
                [3, 2, 4]])

However, I need something in this format

  [array([[1], [2], [3]]),
   array([[2], [2], [4]],
   array([[3], [2], [4]])]

IE. I need a list of arrays containing a list of lists, where the inner list contains one element and the outer list in the array is a string of data. The effect of this is that it basically vectorizes each row of the data frame into a vector of dimension 3.

This is especially useful when I need to do matrix / vector operations in numpy and currently the data source I have is in .csv format and I am struggling to find a way to convert the data to a vector.

Any help would be greatly appreciated.

+3

python numpy pandas matrix dataframe

SeekingAlpha 06 June 17 at 12:07

source to share

2 answers

First convert your DataFrame to matrix. Then add a dimension and convert it to a list.

Attempt:

df = pd.DataFrame([[1,2,3],[2,2,4],[3,2,4]])
my_matrix = df.as_matrix()
my_list_of_arrays_of_list_lists = list(np.expand_dims(my_matrix, axis=2))

my_list_of_arrays_of_list_lists

represents what you are looking for and gives you:

Out[42]: [array([[1],[2],[3]]),
          array([[2],[2],[4]]),
          array([[3],[2],[4]])]

0

Franz 06 June 17 at 12:21

source to share

Divakar · Accepted Answer · 2017-06-06T12:13:12+0000

Retrieve the underlying array data, add a new pointer along the last and then split along the first axis with np.vsplit

-

np.vsplit(df.values[...,None],df.shape[0])

Example run -

In [327]: df
Out[327]: 
   0  1  2
0  1  2  3
1  2  2  4
2  3  2  4

In [328]: expected_output = [np.array([[1], [2], [3]]),
     ...: np.array([[2], [2], [4]]),
     ...: np.array([[3], [2], [4]])]

In [329]: expected_output
Out[329]: 
[array([[1],
        [2],
        [3]]), array([[2],
        [2],
        [4]]), array([[3],
        [2],
        [4]])]

In [330]: np.vsplit(df.values[...,None],df.shape[0])
Out[330]: 
[array([[[1],
         [2],
         [3]]]), array([[[2],
         [2],
         [4]]]), array([[[3],
         [2],
         [4]]])]

If you are working with NumPy functions, then in most scenarios you should be able to split and use the extended version of the array directly.

Now it uses under the hoodsnp.vsplit

np.array_split

and it's basically a loop. Thus, a more realistic way would be to avoid overhead functions like:

np.array_split(df.values[...,None],df.shape[0])

Please note that this will have an extra dimension than the expected output. If you want to squeeze out a version, we could use a list comprehension in the expanded array version of the new axis, for example:

In [357]: [i for i in df.values[...,None]]
Out[357]: 
[array([[1],
        [2],
        [3]]), array([[2],
        [2],
        [4]]), array([[3],
        [2],
        [4]])]

Thus, another way would be to add a new axis to the loop -

[i[...,None] for i in df.values]

Vectorizing Pandas Dataframe into Numpy array

More articles: