How do I efficiently get a numpy array for a subset of columns from my framework?

Motivation

I often answer questions where I advocate converting dataframe values ​​to a basic numpy array for faster computations. However, there are some caveats for this and some ways that are better than others.

I will give my own answer to bring back the community. Hope you guys find this useful.

Problem
Consider a data blockdf

df = pd.DataFrame(dict(A=[1, 2, 3], B=list('xyz'), C=[9, 8, 7], D=[4, 5, 6]))
print(df)

   A  B  C  D
0  1  x  9  4
1  2  y  8  5
2  3  z  7  6

      

from dtypes

print(df.dtypes)

A     int64
B    object
C     int64
D     int64
dtype: object

      

I want to create a numpy array a

consisting of values ​​from columns a

and C

. Suppose there can be many columns and that I am targeting two specific columns a

andC

What i tried

I could do:

df[['A', 'C']].values

array([[1, 9],
       [2, 8],
       [3, 7]])

      

That's for sure!

However, I can do it faster with numpy

p = [df.columns.get_loc(i) for i in ['A', 'C']]
df.values[:, p]

array([[1, 9],
       [2, 8],
       [3, 7]], dtype=object)

      

It's faster but imprecise. Pay attention to dtype=object

. I want integers!

p = [df.columns.get_loc(i) for i in ['A', 'C']]
df.values[:, p].astype(int)

array([[1, 9],
       [2, 8],
       [3, 7]])

      

This is correct now, but I may not have known that I had integers.

Timing

# Clear and accurate, but slower
%%timeit 
df[['A', 'C']].values
1000 loops, best of 3: 347 µs per loop

# Not accurate, but close and fast
%%timeit 
p = [df.columns.get_loc(i) for i in ['A', 'C']]
df.values[:, p]
10000 loops, best of 3: 59.2 µs per loop

# Accurate for this test case and fast, needs to be more generalized.
%%timeit 
p = [df.columns.get_loc(i) for i in ['A', 'C']]
df.values[:, p].astype(int)
10000 loops, best of 3: 59.3 µs per loop

      

+3


source to share


2 answers


pandas

does not store one array for the entire dataframe in an attribute values

. When you call an attribute values

on a dataframe, it builds an array from the underlying objects that are stored, namely objects pd.Series

. It is useful to think of a data frame as pd.Series

of pd.Series

, where each column is one of those pd.Series

contained in the data frame. Each column can have dtype

one that is different from the rest. This is part of why dataframes are so useful. However, the numpy array must be of the same type. When we call an attribute values

on the dataframe, it goes into each column and fetches the data from each of the corresponding attributes values

and concatenates them together. If the columns matching dtypes are not compatible, thendtype

the resulting array will be forced to be object

.

Option 1
Slow but accurate

a = df[['A', 'C']].values

      

The reason this is slow is that you are asking pandas to create a new dataframe df[['A', 'C']]

, then go and build the array a

by clicking on each of the value attributes of the new dataframe columns.

Option 2
Find column positions then slicevalues

c = ['A', 'C']
p = [df.columns.get_loc(i) for i in c]
a = df.values[:, p].astype(df.dtypes[c[0]])

      

It's better because we are building an array of values ​​without rebuilding the new dataframe. I'm pretty sure we're getting an array with sequential dtypes. If there is a casting to happen, I will not be well versed here.

Option 3
My preferred approach
Only access the column values ​​that matter to me

a = np.column_stack([df[col].values for col in ['A', 'C']])

      



This uses the pandas framework as a container pd.Series

in which I values

only access the attribute for the columns I care about. Then I create a new array from these arrays. If processing needs to be done, numpy will handle it.


All approaches give the same result

array([[1, 9],
       [2, 8],
       [3, 7]])

      


Timing
small data

%%timeit 
a = df[['A', 'C']].values
1000 loops, best of 3: 338 µs per loop

%%timeit 
c = ['A', 'C']
p = [df.columns.get_loc(i) for i in c]
a = df.values[:, p].astype(df.dtypes[c[0]])
10000 loops, best of 3: 166 µs per loop

%timeit np.column_stack([df[col].values for col in ['A', 'C']])
The slowest run took 7.36 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 8.97 µs per loop

      

big data

df = pd.concat(
    [df.join(pd.DataFrame(
                np.random.randint(10, size=(3, 22)),
                columns=list(ascii_uppercase[4:])
            ))] * 10000, ignore_index=True
)


%%timeit 
a = df[['A', 'C']].values
The slowest run took 23.28 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 371 µs per loop
In [305]:

%%timeit 
c = ['A', 'C']
p = [df.columns.get_loc(i) for i in c]
a = df.values[:, p].astype(df.dtypes[c[0]])
100 loops, best of 3: 9.62 ms per loop

%timeit np.column_stack([df[col].values for col in ['A', 'C']])
The slowest run took 6.66 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 55.6 µs per loop

      

+4


source


try this:

np.array(zip(df['A'].values, df['C'].values))

      

timeit:



%%timeit
np.array(zip(df['A'].values, df['C'].values))

      

The slowest run took 5.51 times longer than the fastest. This may mean that the intermediate result is being cached. 10000, best of 3: 17.8 μs per loop

+1


source







All Articles