Factorize with pandas on two columns
I have some data in a pandas frame that looks like this:
CAR_TYPE MILEAGE
FORD 100
FORD 100
FORD 200
FORD 300
VW 100
VW 150
VW 150
VW 300
I want to "parse" the data to return a unique ID for each pair. However, I want the unique id "reset" to be zero for individual cars. Currently my factorization uses the following:
df['CAR_ID']=pd.factorize(pd.lib.fast_zip([df.CAR_TYPE.values, df.MILEAGE.values]))[0]
df.CAR_ID=df[['CAR_ID', 'CAR_TYPE']].astype(str).apply(lambda x: ''.join(x), axis=1)
Gives me something like
CAR_TYPE MILEAGE CAR_ID
FORD 100 FORD0
FORD 100 FORD0
FORD 200 FORD1
FORD 300 FORD2
VW 100 VW3
VW 150 VW4
VW 150 VW4
VW 300 VW5
Ideally I would like
CAR_TYPE MILEAGE IDEAL_CAR_ID
FORD 100 FORD0
FORD 100 FORD0
FORD 200 FORD1
FORD 300 FORD2
VW 100 VW0
VW 150 VW1
VW 150 VW1
VW 300 VW2
Apologies for the relatively stupid question in minds after a long day. I know its something that could be solved with a stack / stack, reset_index / set_index.
source to share
You can use groupby
with rank
if the values ββin MILEAGE
are sorted by each group:
a = df.groupby(['CAR_TYPE'])['MILEAGE'].rank(method='dense') \
.sub(1).astype(int).astype(str)
df['IDEAL_CAR_ID'] = df['CAR_TYPE'].add(a)
print (df)
CAR_TYPE MILEAGE IDEAL_CAR_ID
0 FORD 100 FORD0
1 FORD 100 FORD0
2 FORD 200 FORD1
3 FORD 300 FORD2
4 VW 100 VW0
5 VW 150 VW1
6 VW 150 VW1
7 VW 300 VW2
Another solution with factorize
:
a = df.groupby(['CAR_TYPE'])['MILEAGE'] \
.transform(lambda x: pd.factorize(x)[0]).astype(str)
df['IDEAL_CAR_ID'] = df['CAR_TYPE'].add(a)
print (df)
CAR_TYPE MILEAGE IDEAL_CAR_ID
0 FORD 100 FORD0
1 FORD 100 FORD0
2 FORD 200 FORD1
3 FORD 300 FORD2
4 VW 100 VW0
5 VW 150 VW1
6 VW 150 VW1
7 VW 300 VW2
Various outputs if the column is not sorted:
print (df)
CAR_TYPE MILEAGE
0 FORD 500
1 FORD 500
2 FORD 200
3 FORD 300
4 VW 100
5 VW 150
6 VW 150
7 VW 300
a = df.groupby(['CAR_TYPE'])['MILEAGE'].rank(method='dense') \
.sub(1).astype(int).astype(str)
df['IDEAL_CAR_ID'] = df['CAR_TYPE'].add(a)
print (df)
CAR_TYPE MILEAGE IDEAL_CAR_ID
0 FORD 500 FORD2
1 FORD 500 FORD2
2 FORD 200 FORD0
3 FORD 300 FORD1
4 VW 100 VW0
5 VW 150 VW1
6 VW 150 VW1
7 VW 300 VW2
a = df.groupby(['CAR_TYPE'])['MILEAGE'] \
.transform(lambda x: pd.factorize(x)[0]).astype(str)
df['IDEAL_CAR_ID'] = df['CAR_TYPE'].add(a)
print (df)
CAR_TYPE MILEAGE IDEAL_CAR_ID
0 FORD 500 FORD0
1 FORD 500 FORD0
2 FORD 200 FORD1
3 FORD 300 FORD2
4 VW 100 VW0
5 VW 150 VW1
6 VW 150 VW1
7 VW 300 VW2
source to share
Or you can try this.
df['CAR_ID']=df.CAR_TYPE+df1.groupby('CAR_TYPE')['MILEAGE'].apply(lambda x:x.astype('category').cat.codes).astype(str)
Out[21]:
CAR_TYPE MILEAGE CAR_ID
0 FORD 100 FORD0
1 FORD 100 FORD0
2 FORD 200 FORD1
3 FORD 300 FORD2
4 VW 100 VW0
5 VW 150 VW1
6 VW 150 VW1
7 VW 300 VW2
source to share
f1 = pd.factorize(list(zip(df.CAR_TYPE.values, df.MILEAGE.values)))[0]
f0 = pd.factorize(df.CAR_TYPE.values)[0] \
* (df.groupby('CAR_TYPE').CAR_TYPE.transform('count').values - 1)
df.assign(CAR_ID=df.CAR_TYPE.add((f1 - f0).astype(str)))
CAR_TYPE MILEAGE CAR_ID
0 FORD 100 FORD0
1 FORD 100 FORD0
2 FORD 200 FORD1
3 FORD 300 FORD2
4 VW 100 VW0
5 VW 150 VW1
6 VW 150 VW1
7 VW 300 VW2
source to share