Pandas: display new column except for some codes

I have a dictionary of keys and values. I want to "match" numbers in a dataframe column where the original column are the keys and the new column are the values.

However, any values ​​that are not included in the dictionary must be encoded as 999.

Original data frame:

     Col1
0    02
1    03
2    02
3    02
4    04
5    88
6    77

      

Vocabulary:

codes = {'01':'05',
         '02':'06',
         '03':'07',
         '04':'08'}

      

Expected Result:

>>> df['ColNew'] = df['Col1'].map(codes)

     ColNew
0    06
1    07
2    06
3    06
4    08
5    999
6    999

      

I'm not sure how to do this other than including the 999 codes in the dictionary in the first place. It's frustrating when there are more than a hundred codes involved and only a few of them should be anything other than 999.

+3


source to share


3 answers


use map

and lets you pass a default value in case it doesn't exist. dict.get


dict.get

key

df['ColNew'] = df['Col1'].map(lambda x: codes.get(x, 999))

df

  Col1 ColNew
0   02     06
1   03     07
2   02     06
3   02     06
4   04     08
5   88    999
6   77    999

      



This will also save dtypes

. In this case, it doesn't matter since the dtype

column object

.

However, if he was int

, map

turned him in float

when he NaN

returns. With the default, we avoid type conversion.

+2


source


Note. This is an incomplete answer to piRSquared's request due to type conversion:

After that, you can just fill in the NaN.

df['ColNew'] = df.Col1.map(codes).fillna('999')

      

Result:



     ColNew
0    06
1    07
2    06
3    06
4    05
5    999
6    999

      


Interestingly, the parameter is na_action

Series.map

not used as the default display argument since I was initially tempted to think.

Its purpose is to control whether NaN values ​​affect the matching function - if you haven't matched them in any way, you will see a potential performance gain by setting na_action='ignore'

.

+1


source


Another way to trick this cat:

new_codes = {k: codes[k] if k in codes else '999' for k in set(df['Col1'])}
df['ColNew'] = df['Col1'].map(new_codes)

df
Out[126]: 
  Col1 ColNew
0   02     06
1   03     07
2   02     06
3   02     06
4   04     08
5   88    999
6   77    999

      

Some timings:

%timeit df['Col1'].map({k: codes[k] if k in codes else '999' for k in set(df['Col1'])})
1000 loops, best of 3: 373 µs per loop

%timeit df['Col1'].map(lambda x: codes.get(x, 999))
10000 loops, best of 3: 133 µs per loop

%timeit df.Col1.map(codes).fillna('999')
The slowest run took 92.77 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 575 µs per loop

      

Looks like piRSquared's answer is about 64% faster than mine!

0


source







All Articles