Pandas: display new column except for some codes
I have a dictionary of keys and values. I want to "match" numbers in a dataframe column where the original column are the keys and the new column are the values.
However, any values that are not included in the dictionary must be encoded as 999.
Original data frame:
Col1
0 02
1 03
2 02
3 02
4 04
5 88
6 77
Vocabulary:
codes = {'01':'05',
'02':'06',
'03':'07',
'04':'08'}
Expected Result:
>>> df['ColNew'] = df['Col1'].map(codes)
ColNew
0 06
1 07
2 06
3 06
4 08
5 999
6 999
I'm not sure how to do this other than including the 999 codes in the dictionary in the first place. It's frustrating when there are more than a hundred codes involved and only a few of them should be anything other than 999.
source to share
use map
and lets you pass a default value in case it doesn't exist. dict.get
dict.get
key
df['ColNew'] = df['Col1'].map(lambda x: codes.get(x, 999))
df
Col1 ColNew
0 02 06
1 03 07
2 02 06
3 02 06
4 04 08
5 88 999
6 77 999
This will also save dtypes
. In this case, it doesn't matter since the dtype
column object
.
However, if he was int
, map
turned him in float
when he NaN
returns. With the default, we avoid type conversion.
source to share
Note. This is an incomplete answer to piRSquared's request due to type conversion:
After that, you can just fill in the NaN.
df['ColNew'] = df.Col1.map(codes).fillna('999')
Result:
ColNew
0 06
1 07
2 06
3 06
4 05
5 999
6 999
Interestingly, the parameter is na_action
Series.map
not used as the default display argument since I was initially tempted to think.
Its purpose is to control whether NaN values affect the matching function - if you haven't matched them in any way, you will see a potential performance gain by setting na_action='ignore'
.
source to share
Another way to trick this cat:
new_codes = {k: codes[k] if k in codes else '999' for k in set(df['Col1'])}
df['ColNew'] = df['Col1'].map(new_codes)
df
Out[126]:
Col1 ColNew
0 02 06
1 03 07
2 02 06
3 02 06
4 04 08
5 88 999
6 77 999
Some timings:
%timeit df['Col1'].map({k: codes[k] if k in codes else '999' for k in set(df['Col1'])})
1000 loops, best of 3: 373 µs per loop
%timeit df['Col1'].map(lambda x: codes.get(x, 999))
10000 loops, best of 3: 133 µs per loop
%timeit df.Col1.map(codes).fillna('999')
The slowest run took 92.77 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 575 µs per loop
Looks like piRSquared's answer is about 64% faster than mine!
source to share