How can I remove rows from a Pandas frame based on data across multiple columns?

Question

How can I remove rows from a Pandas frame based on data across multiple columns?

I know how to delete rows based on simple criteria such as this question, however I need to delete rows using more complex criteria.

My situation: I have rows of data where each row contains four columns containing numeric codes. I need to discard all rows that do not have at least one code with a leading digit less than 5. Currently I have a function that I can use with dataframe.apply that creates a new column, "hold" and populates it with 1 if this is the string to save. Then I do a second pass using this simple keep column to remove unneeded rows. I am looking for a way to do this in one pass without creating a new column.

Sample data:

   a | b | c | d
0 145|567|999|876
1 999|876|543|543

In this data, I would like to keep the first row, because in column "a" the leading digit is less than 5. The second row has no columns with a leading digit less than 5, so the row must be discarded.

+3

python pandas

Gregory Arenius May 21 '15 at 18:17

source to share

1 answer

EdChum · Accepted Answer · 2015-05-21T18:25:19+0000

This should work:

In [31]:
df[(df.apply(lambda x: x.str[0].astype(int))).lt(5).any(axis=1)]

Out[31]:
     a    b    c    d
0  145  567  999  876

So basically this takes the first character of each column using the vector method str

, we pass it to int, then call lt

which is less than row-wise to create a boolean df, then call any

on the df line to create a boolean mask on the index. which is used to mask df. So, breaking above:

In [34]:
df.apply(lambda x: x.str[0].astype(int))

Out[34]:
   a  b  c  d
0  1  5  9  8
1  9  8  5  5

In [35]:    
df.apply(lambda x: x.str[0].astype(int)).lt(5)

Out[35]:
       a      b      c      d
0   True  False  False  False
1  False  False  False  False

In [37]:    
df.apply(lambda x: x.str[0].astype(int)).lt(5).any(axis=1)

Out[37]:
0     True
1    False
dtype: bool

EDIT

To handle values NaN

, you add a call dropna

:

In [39]:
t="""a,b,c,d
0,145,567,999,876
1,999,876,543,543
2,,324,344"""
df = pd.read_csv(io.StringIO(t),dtype=str)
df

Out[39]:
     a    b    c    d
0  145  567  999  876
1  999  876  543  543
2  NaN  324  344  NaN

In [44]:
df[(df.apply(lambda x: x.dropna().str[0].astype(int))).lt(5,axis=0).any(axis=1)]

Out[44]:
     a    b    c    d
0  145  567  999  876
2  NaN  324  344  NaN

How can I remove rows from a Pandas frame based on data across multiple columns?

More articles: