Seemingly inconsistent syntax for column references when chaining methods on pandas dataframe
I'm a little confused as to why the syntax for referring to a column in a pandas dataframe is different depending on which method is called. Let's take the following method chain
import pandas as pd
iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
iris.columns = ['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth', 'Species']
(iris
.loc[:, ['SepalLength', 'PetalWidth', 'Species']]
.where(iris['SepalLength'] > 4.6)
.assign(PetalWidthx2 = lambda x_iris: x_iris['PetalWidth'] * 2)
.groupby('Species')
.agg({'SepalLength': 'mean', 'PetalWidthx2': 'std'}))
There are three different types of syntax used to denote columns in an iris data frame:
-
loc
,groupby
Andagg
everybody understands that the line refers to a column in a data frame. -
where
requires the data frame to be explicitly bound. - Explicit access to the data frame in the method
assign
will cause the operation to be performed on the original iris data frame, not on the copy that has been modified by calls toloc
andwhere
.lambda
Required here to refer to the current state of the modified data frame copy. -
In addition to the above, there is also
query
one that takes the entire method as a: stringiris.query('SepalLength > 4.6')
, but here the pandas explicilty documentation indicates that this is for special use cases:An example of using query () is when you have a collection of DataFrame objects that have a subset of column names (or levels / index names). You can pass the same request to both frames without specifying which frame you are interested in
To provide an example of what I mean by sequential column syntax with sequential data syntax, one can map an R package dplyr
where columns in the data frame refer to the same syntax for the entire function of the channel calls.
library(dplyr) # The iris data set is preloaded in R colnames(iris) = c('SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth', 'Species') iris %>% select(SepalLength, PetalWidth, Species) %>% filter(SepalLength > 4.6) %>% mutate(PetalWidth2x = PetalWidth * 2) %>% group_by(Species) %>% summarise(SepalLength = mean(SepalLength), PetalWidth2x = sd(PetalWidth2x))
Are there any advantages that pandas receives from these different ways of accessing the columns of data frame instead of using a simplified syntax used loc
, groupby
and agg
to all the methods (if so, which are these advantages)? Or is it a more workaround for some basic problem of using strings for the column names of the dataframe in the assign
and methods where
?
source to share
To quote Marius's comment :
I think the biggest difference between pandas and dplyr is that pandas works in existing Python syntax rules, which are pretty strict about what invalid characters (mostly objects in the current scope) might represent ...
I think this is correct, so let it expand a little.
loc
,groupby
Andagg
everybody understands that the line refers to a column in a data frame.
.loc[:, ['SepalLength', 'PetalWidth', 'Species']]
.groupby('Species')
.agg({'SepalLength': 'mean', 'PetalWidthx2': 'std'}))
In all three cases, the string is a valid object in this context. That is, the line itself provides enough information to complete the operation. Unlike...
where
requires the data frame to be explicitly bound.
.where(iris['SepalLength'] > 4.6)
In the case where
, Python requires the operator to >
work against something. By selecting a specific column of the dataframe, an object has been specified for that column, and the method __gt__
will be called against that object.
If we wanted the syntax to look like this:
.where('SepalLength' > 4.6)
We need to tell Python in some way what the operator meant >
in this context. Assessment is performed prior to transmission where
. The existing language function for this is to provide our own object with specific methods, and this was done by the pandas designers. The >
default string operation is simply not useful in this context.
Explicit access to the data frame in the method
assign
will cause the operation to be performed on the original iris data frame, not on the copy that has been modified by calls toloc
andwhere
.lambda
Required here to refer to the current state of the modified data frame copy.
.assign(PetalWidthx2 = lambda x_iris: x_iris['PetalWidth'] * 2)
If .assign
used as the first method of the dataframe, before any filtering, we could simply write it as
.assign(PetalWidthx2 = iris['PetalWidth'] * 2)
since the variable iris
already exists and is identical to the dataframe we want to work on.
However, since previous calls to .loc
and .where
change the data frame that we want to call .assign
on, it is no longer identical to the file frame iris
and there is no specific variable referencing the changed data frame. Since pandas uses existing Python syntax rules, it can use lambda
which in this context essentially permits operations on self
: the current state of the object being called to .assign
. an example of this in the docs .
This uses the ** kwargs method, which allows you to specify an arbitrary number of parameters (the names of the new columns) and their arguments (the value for the new column). ** kwargs pairs are parameter=argument
interpreted internally as a pair of words key:value
as seen from the source .
In addition to the above, there is also
query
one that accepts the entire method as a: stringiris.query('SepalLength > 4.6')
, but here the pandas documentation explicitly states that this is for special use cases
In case, query
the string passed in is an expression that will be compiled and executed by the backend, which is generally much faster than executing python code. This is a special case as the available operations are quite limited and the setup time for the backend is long, so it is really only useful for fairly large datasets.
source to share