Factorize with pandas on two columns

I have some data in a pandas frame that looks like this:

CAR_TYPE   MILEAGE
FORD       100     
FORD       100    
FORD       200    
FORD       300
VW         100     
VW         150
VW         150
VW         300

      

I want to "parse" the data to return a unique ID for each pair. However, I want the unique id "reset" to be zero for individual cars. Currently my factorization uses the following:

df['CAR_ID']=pd.factorize(pd.lib.fast_zip([df.CAR_TYPE.values, df.MILEAGE.values]))[0] 
df.CAR_ID=df[['CAR_ID', 'CAR_TYPE']].astype(str).apply(lambda x: ''.join(x), axis=1)

      

Gives me something like

CAR_TYPE   MILEAGE     CAR_ID  
FORD       100         FORD0
FORD       100         FORD0
FORD       200         FORD1
FORD       300         FORD2
VW         100         VW3
VW         150         VW4
VW         150         VW4
VW         300         VW5

      

Ideally I would like

CAR_TYPE   MILEAGE     IDEAL_CAR_ID  
FORD       100         FORD0
FORD       100         FORD0
FORD       200         FORD1
FORD       300         FORD2
VW         100         VW0
VW         150         VW1
VW         150         VW1
VW         300         VW2

      

Apologies for the relatively stupid question in minds after a long day. I know its something that could be solved with a stack / stack, reset_index / set_index.

+3


source to share


3 answers


You can use groupby

with rank

if the values ​​in MILEAGE

are sorted by each group:

a = df.groupby(['CAR_TYPE'])['MILEAGE'].rank(method='dense') \
      .sub(1).astype(int).astype(str)
df['IDEAL_CAR_ID'] = df['CAR_TYPE'].add(a)
print (df)

  CAR_TYPE  MILEAGE IDEAL_CAR_ID
0     FORD      100        FORD0
1     FORD      100        FORD0
2     FORD      200        FORD1
3     FORD      300        FORD2
4       VW      100          VW0
5       VW      150          VW1
6       VW      150          VW1
7       VW      300          VW2

      

Another solution with factorize

:

a = df.groupby(['CAR_TYPE'])['MILEAGE'] \
       .transform(lambda x: pd.factorize(x)[0]).astype(str)
df['IDEAL_CAR_ID'] = df['CAR_TYPE'].add(a)
print (df)
  CAR_TYPE  MILEAGE IDEAL_CAR_ID
0     FORD      100        FORD0
1     FORD      100        FORD0
2     FORD      200        FORD1
3     FORD      300        FORD2
4       VW      100          VW0
5       VW      150          VW1
6       VW      150          VW1
7       VW      300          VW2

      




Various outputs if the column is not sorted:

print (df)
  CAR_TYPE  MILEAGE
0     FORD      500
1     FORD      500
2     FORD      200
3     FORD      300
4       VW      100
5       VW      150
6       VW      150
7       VW      300

a = df.groupby(['CAR_TYPE'])['MILEAGE'].rank(method='dense') \
      .sub(1).astype(int).astype(str)
df['IDEAL_CAR_ID'] = df['CAR_TYPE'].add(a)
print (df)
  CAR_TYPE  MILEAGE IDEAL_CAR_ID
0     FORD      500        FORD2
1     FORD      500        FORD2
2     FORD      200        FORD0
3     FORD      300        FORD1
4       VW      100          VW0
5       VW      150          VW1
6       VW      150          VW1
7       VW      300          VW2

      




a = df.groupby(['CAR_TYPE'])['MILEAGE'] \
       .transform(lambda x: pd.factorize(x)[0]).astype(str)
df['IDEAL_CAR_ID'] = df['CAR_TYPE'].add(a)
print (df)
  CAR_TYPE  MILEAGE IDEAL_CAR_ID
0     FORD      500        FORD0
1     FORD      500        FORD0
2     FORD      200        FORD1
3     FORD      300        FORD2
4       VW      100          VW0
5       VW      150          VW1
6       VW      150          VW1
7       VW      300          VW2

      

+3


source


Or you can try this.



df['CAR_ID']=df.CAR_TYPE+df1.groupby('CAR_TYPE')['MILEAGE'].apply(lambda x:x.astype('category').cat.codes).astype(str)


Out[21]: 
  CAR_TYPE  MILEAGE CAR_ID 
0     FORD      100   FORD0
1     FORD      100   FORD0
2     FORD      200   FORD1
3     FORD      300   FORD2
4       VW      100     VW0
5       VW      150     VW1
6       VW      150     VW1
7       VW      300     VW2

      

+2


source


f1 = pd.factorize(list(zip(df.CAR_TYPE.values, df.MILEAGE.values)))[0]
f0 = pd.factorize(df.CAR_TYPE.values)[0] \
    * (df.groupby('CAR_TYPE').CAR_TYPE.transform('count').values - 1)

df.assign(CAR_ID=df.CAR_TYPE.add((f1 - f0).astype(str)))

  CAR_TYPE  MILEAGE CAR_ID
0     FORD      100  FORD0
1     FORD      100  FORD0
2     FORD      200  FORD1
3     FORD      300  FORD2
4       VW      100    VW0
5       VW      150    VW1
6       VW      150    VW1
7       VW      300    VW2

      

+2


source







All Articles