Seemingly inconsistent syntax for column references when chaining methods on pandas dataframe

I'm a little confused as to why the syntax for referring to a column in a pandas dataframe is different depending on which method is called. Let's take the following method chain

import pandas as pd

iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
iris.columns = ['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth', 'Species']
(iris
    .loc[:, ['SepalLength', 'PetalWidth', 'Species']]
    .where(iris['SepalLength'] > 4.6)
    .assign(PetalWidthx2 = lambda x_iris: x_iris['PetalWidth'] * 2)
    .groupby('Species')
    .agg({'SepalLength': 'mean', 'PetalWidthx2': 'std'}))

      

There are three different types of syntax used to denote columns in an iris data frame:

  • loc

    , groupby

    And agg

    everybody understands that the line refers to a column in a data frame.
  • where

    requires the data frame to be explicitly bound.
  • Explicit access to the data frame in the method assign

    will cause the operation to be performed on the original iris data frame, not on the copy that has been modified by calls to loc

    and where

    . lambda

    Required here to refer to the current state of the modified data frame copy.
  • In addition to the above, there is also query

    one that takes the entire method as a: string iris.query('SepalLength > 4.6')

    , but here the pandas explicilty documentation indicates that this is for special use cases:

    An example of using query () is when you have a collection of DataFrame objects that have a subset of column names (or levels / index names). You can pass the same request to both frames without specifying which frame you are interested in

To provide an example of what I mean by sequential column syntax with sequential data syntax, one can map an R package dplyr

where columns in the data frame refer to the same syntax for the entire function of the channel calls.

library(dplyr)

# The iris data set is preloaded in R
colnames(iris) = c('SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth', 'Species')
iris %>% 
    select(SepalLength, PetalWidth, Species) %>% 
    filter(SepalLength > 4.6) %>%  
    mutate(PetalWidth2x = PetalWidth * 2) %>% 
    group_by(Species) %>% 
    summarise(SepalLength = mean(SepalLength), PetalWidth2x = sd(PetalWidth2x))

      

Are there any advantages that pandas receives from these different ways of accessing the columns of data frame instead of using a simplified syntax used loc

, groupby

and agg

to all the methods (if so, which are these advantages)? Or is it a more workaround for some basic problem of using strings for the column names of the dataframe in the assign

and methods where

?

+3


source to share


1 answer


To quote Marius's comment :

I think the biggest difference between pandas and dplyr is that pandas works in existing Python syntax rules, which are pretty strict about what invalid characters (mostly objects in the current scope) might represent ...

I think this is correct, so let it expand a little.


loc

, groupby

And agg

everybody understands that the line refers to a column in a data frame.

.loc[:, ['SepalLength', 'PetalWidth', 'Species']]
.groupby('Species')
.agg({'SepalLength': 'mean', 'PetalWidthx2': 'std'}))

      

In all three cases, the string is a valid object in this context. That is, the line itself provides enough information to complete the operation. Unlike...


where

requires the data frame to be explicitly bound.

.where(iris['SepalLength'] > 4.6)

      

In the case where

, Python requires the operator to >

work against something. By selecting a specific column of the dataframe, an object has been specified for that column, and the method __gt__

will be called against that object.

If we wanted the syntax to look like this:

.where('SepalLength' > 4.6)

      



We need to tell Python in some way what the operator meant >

in this context. Assessment is performed prior to transmission where

. The existing language function for this is to provide our own object with specific methods, and this was done by the pandas designers. The >

default string operation is simply not useful in this context.


Explicit access to the data frame in the method assign

will cause the operation to be performed on the original iris data frame, not on the copy that has been modified by calls to loc

and where

. lambda

Required here to refer to the current state of the modified data frame copy.

.assign(PetalWidthx2 = lambda x_iris: x_iris['PetalWidth'] * 2)

      

If .assign

used as the first method of the dataframe, before any filtering, we could simply write it as

.assign(PetalWidthx2 = iris['PetalWidth'] * 2)

      

since the variable iris

already exists and is identical to the dataframe we want to work on.

However, since previous calls to .loc

and .where

change the data frame that we want to call .assign

on, it is no longer identical to the file frame iris

and there is no specific variable referencing the changed data frame. Since pandas uses existing Python syntax rules, it can use lambda

which in this context essentially permits operations on self

: the current state of the object being called to .assign

. an example of this in the docs .

This uses the ** kwargs method, which allows you to specify an arbitrary number of parameters (the names of the new columns) and their arguments (the value for the new column). ** kwargs pairs are parameter=argument

interpreted internally as a pair of words key:value

as seen from the source .


In addition to the above, there is also query

one that accepts the entire method as a: string iris.query('SepalLength > 4.6')

, but here the pandas documentation explicitly states that this is for special use cases

In case, query

the string passed in is an expression that will be compiled and executed by the backend, which is generally much faster than executing python code. This is a special case as the available operations are quite limited and the setup time for the backend is long, so it is really only useful for fairly large datasets.

+4


source







All Articles