How do I efficiently get a numpy array for a subset of columns from my framework?
Motivation
I often answer questions where I advocate converting dataframe values to a basic numpy array for faster computations. However, there are some caveats for this and some ways that are better than others.
I will give my own answer to bring back the community. Hope you guys find this useful.
Problem
Consider a data blockdf
df = pd.DataFrame(dict(A=[1, 2, 3], B=list('xyz'), C=[9, 8, 7], D=[4, 5, 6]))
print(df)
A B C D
0 1 x 9 4
1 2 y 8 5
2 3 z 7 6
from dtypes
print(df.dtypes)
A int64
B object
C int64
D int64
dtype: object
I want to create a numpy array a
consisting of values from columns a
and C
. Suppose there can be many columns and that I am targeting two specific columns a
andC
What i tried
I could do:
df[['A', 'C']].values
array([[1, 9],
[2, 8],
[3, 7]])
That's for sure!
However, I can do it faster with numpy
p = [df.columns.get_loc(i) for i in ['A', 'C']]
df.values[:, p]
array([[1, 9],
[2, 8],
[3, 7]], dtype=object)
It's faster but imprecise. Pay attention to dtype=object
. I want integers!
p = [df.columns.get_loc(i) for i in ['A', 'C']]
df.values[:, p].astype(int)
array([[1, 9],
[2, 8],
[3, 7]])
This is correct now, but I may not have known that I had integers.
Timing
# Clear and accurate, but slower
%%timeit
df[['A', 'C']].values
1000 loops, best of 3: 347 µs per loop
# Not accurate, but close and fast
%%timeit
p = [df.columns.get_loc(i) for i in ['A', 'C']]
df.values[:, p]
10000 loops, best of 3: 59.2 µs per loop
# Accurate for this test case and fast, needs to be more generalized.
%%timeit
p = [df.columns.get_loc(i) for i in ['A', 'C']]
df.values[:, p].astype(int)
10000 loops, best of 3: 59.3 µs per loop
source to share
pandas
does not store one array for the entire dataframe in an attribute values
. When you call an attribute values
on a dataframe, it builds an array from the underlying objects that are stored, namely objects pd.Series
. It is useful to think of a data frame as pd.Series
of pd.Series
, where each column is one of those pd.Series
contained in the data frame. Each column can have dtype
one that is different from the rest. This is part of why dataframes are so useful. However, the numpy array must be of the same type. When we call an attribute values
on the dataframe, it goes into each column and fetches the data from each of the corresponding attributes values
and concatenates them together. If the columns matching dtypes are not compatible, thendtype
the resulting array will be forced to be object
.
Option 1
Slow but accurate
a = df[['A', 'C']].values
The reason this is slow is that you are asking pandas to create a new dataframe df[['A', 'C']]
, then go and build the array a
by clicking on each of the value attributes of the new dataframe columns.
Option 2
Find column positions then slicevalues
c = ['A', 'C']
p = [df.columns.get_loc(i) for i in c]
a = df.values[:, p].astype(df.dtypes[c[0]])
It's better because we are building an array of values without rebuilding the new dataframe. I'm pretty sure we're getting an array with sequential dtypes. If there is a casting to happen, I will not be well versed here.
Option 3
My preferred approach
Only access the column values that matter to me
a = np.column_stack([df[col].values for col in ['A', 'C']])
This uses the pandas framework as a container pd.Series
in which I values
only access the attribute for the columns I care about. Then I create a new array from these arrays. If processing needs to be done, numpy will handle it.
All approaches give the same result
array([[1, 9],
[2, 8],
[3, 7]])
Timing
small data
%%timeit
a = df[['A', 'C']].values
1000 loops, best of 3: 338 µs per loop
%%timeit
c = ['A', 'C']
p = [df.columns.get_loc(i) for i in c]
a = df.values[:, p].astype(df.dtypes[c[0]])
10000 loops, best of 3: 166 µs per loop
%timeit np.column_stack([df[col].values for col in ['A', 'C']])
The slowest run took 7.36 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 8.97 µs per loop
big data
df = pd.concat(
[df.join(pd.DataFrame(
np.random.randint(10, size=(3, 22)),
columns=list(ascii_uppercase[4:])
))] * 10000, ignore_index=True
)
%%timeit
a = df[['A', 'C']].values
The slowest run took 23.28 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 371 µs per loop
In [305]:
%%timeit
c = ['A', 'C']
p = [df.columns.get_loc(i) for i in c]
a = df.values[:, p].astype(df.dtypes[c[0]])
100 loops, best of 3: 9.62 ms per loop
%timeit np.column_stack([df[col].values for col in ['A', 'C']])
The slowest run took 6.66 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 55.6 µs per loop
source to share