Python panda concatenate dynamic columns by removing duplicates

we use the case where we need to concatenate all the column values ​​in a row by removing duplicates. This information is stored in the panda data frame. For example, consider below a df data frame with columns A, B, C

A   B   C   
X1  AX  X1
X2  X2  X1
X3  X3  X3
X4  XX  XX

      

I would like to add a new column that concatenates AB to C and removes duplicates if found while keeping the order. The way out would be like

A   B   C   Newcol

X1  AX  X1  X1_AX
X2  X2  X1  X2_X1
X3  X3  X3  X3
X4  XX  XX  X4_XX

      

Note that the number of columns is dynamic. At the moment I am doing this with the command

df.apply(lambda x: '-'.join(x.dropna().astype(str).drop_duplicates()),axis=1)

      

But this is very slow and takes about 150 seconds for my data. But since 90% of data frames usually only have 2 columns, I put an if statement in my code and run below command for 2 column cases

t1=pd.Series(np.where(df.iloc[:,0].dropna().astype(str) != df.iloc[:,1].dropna().astype(str), df.iloc[:,0].dropna().astype(str)+"-"+df.iloc[:,1].dropna().astype(str),df.iloc[:,1].dropna().astype(str)))

      

which takes about 55.3 milliseconds

or even

t1=df.iloc[:,0].dropna().astype(str).where(df.iloc[:,0].dropna().astype(str) == df.iloc[:,1].dropna().astype(str), df.iloc[:,0].dropna().astype(str)+"-"+df.iloc[:,1].dropna().astype(str))

      

both consume nearly the same amount of time (55ms vs 150 seconds), but the problem is that it only applies to 2 columns. I would like to create a generalized operator so that it can handle n number of columns. I tried using the shortcut on top, but it gave an error while I tried using 3 columns.

reduce((lambda x,y:pd.Series(np.where(df.iloc[:,x].dropna().astype(str) != df.iloc[:,y].dropna().astype(str), df.iloc[:,x].dropna().astype(str)+"-"+df.iloc[:,y].dropna().astype(str),df.iloc[:,y].dropna().astype(str)))),list(range(df.shape[1])))

      

TypeError: '> =' not supported between instances 'str' and 'int'

Note that df is actually a chunk of a multi-core parallel task. Therefore, it would be great if proposals exclude parallelism.

+3


source to share


2 answers


Try

df['new'] = df.astype('str').apply(lambda x: '_'.join(set(x)), axis = 1)

    A   B   C   new
0   X1  AX  X1  AX_X1
1   X2  X2  X1  X1_X2
2   X3  X3  X3  X3
3   X4  XX  XX  X4_XX

      

EDIT: Preserve order of column values

def my_append(x):
    l = []
    for elm in x:
        if elm not in l:
            l.append(elm)
    return '_'.join(l)


df['New col']=df.astype('str').apply(my_append, axis = 1)

1000 loops, best of 3: 871 Β΅s per loop

      

Returns



    A   B   C   New col
0   X1  AX  X1  X1_AX
1   X2  X2  X1  X2_X1
2   X3  X3  X3  X3
3   X4  XX  XX  X4_XX

      

EDIT 1: If you have nan in any column like this

    A   B   C
0   X1  AX  X1
1   X2  X2  X1
2   X3  X3  X3
3   NaN XX  XX

      

Handle this in a function and then apply

def my_append(x):
l = []
for elm in x:
    if elm not in l:
        l.append(elm)
l = [x for x in l if str(x) != 'nan']
return '_'.join(l)

df['New col']=df.astype('str').apply(my_append, axis = 1)


    A   B   C   New col
0   X1  AX  X1  X1_AX
1   X2  X2  X1  X2_X1
2   X3  X3  X3  X3
3   NaN XX  XX  XX

      

+3


source


pd.unique

not sorted. Use it wrapped in understanding

df.assign(new_col=['_'.join(pd.unique(row)) for row in df.values])

    A   B   C new_col
0  X1  AX  X1   X1_AX
1  X2  X2  X1   X2_X1
2  X3  X3  X3      X3
3  X4  XX  XX   X4_XX

      




Handle NaN

df.assing(new_col=[
        '_'.join(pd.unique([i for i in row if pd.notnull(i)])) for row in df.values
    ])

      

+1


source







All Articles