Python panda concatenate dynamic columns by removing duplicates

Question

Python panda concatenate dynamic columns by removing duplicates

we use the case where we need to concatenate all the column values in a row by removing duplicates. This information is stored in the panda data frame. For example, consider below a df data frame with columns A, B, C

A   B   C   
X1  AX  X1
X2  X2  X1
X3  X3  X3
X4  XX  XX

I would like to add a new column that concatenates AB to C and removes duplicates if found while keeping the order. The way out would be like

A   B   C   Newcol

X1  AX  X1  X1_AX
X2  X2  X1  X2_X1
X3  X3  X3  X3
X4  XX  XX  X4_XX

Note that the number of columns is dynamic. At the moment I am doing this with the command

df.apply(lambda x: '-'.join(x.dropna().astype(str).drop_duplicates()),axis=1)

But this is very slow and takes about 150 seconds for my data. But since 90% of data frames usually only have 2 columns, I put an if statement in my code and run below command for 2 column cases

t1=pd.Series(np.where(df.iloc[:,0].dropna().astype(str) != df.iloc[:,1].dropna().astype(str), df.iloc[:,0].dropna().astype(str)+"-"+df.iloc[:,1].dropna().astype(str),df.iloc[:,1].dropna().astype(str)))

which takes about 55.3 milliseconds

or even

t1=df.iloc[:,0].dropna().astype(str).where(df.iloc[:,0].dropna().astype(str) == df.iloc[:,1].dropna().astype(str), df.iloc[:,0].dropna().astype(str)+"-"+df.iloc[:,1].dropna().astype(str))

both consume nearly the same amount of time (55ms vs 150 seconds), but the problem is that it only applies to 2 columns. I would like to create a generalized operator so that it can handle n number of columns. I tried using the shortcut on top, but it gave an error while I tried using 3 columns.

reduce((lambda x,y:pd.Series(np.where(df.iloc[:,x].dropna().astype(str) != df.iloc[:,y].dropna().astype(str), df.iloc[:,x].dropna().astype(str)+"-"+df.iloc[:,y].dropna().astype(str),df.iloc[:,y].dropna().astype(str)))),list(range(df.shape[1])))

TypeError: '> =' not supported between instances 'str' and 'int'

Note that df is actually a chunk of a multi-core parallel task. Therefore, it would be great if proposals exclude parallelism.

+3

performance python pandas string-concatenation

niths4u 05 june 17 at 16:52

source to share

2 answers

pd.unique

not sorted. Use it wrapped in understanding

df.assign(new_col=['_'.join(pd.unique(row)) for row in df.values])

    A   B   C new_col
0  X1  AX  X1   X1_AX
1  X2  X2  X1   X2_X1
2  X3  X3  X3      X3
3  X4  XX  XX   X4_XX

Handle NaN

df.assing(new_col=[
        '_'.join(pd.unique([i for i in row if pd.notnull(i)])) for row in df.values
    ])

+1

piRSquared 05 june 17 at 18:15

source to share

Vaishali · Accepted Answer · 2017-06-05T17:01:04+0000

Try

df['new'] = df.astype('str').apply(lambda x: '_'.join(set(x)), axis = 1)

    A   B   C   new
0   X1  AX  X1  AX_X1
1   X2  X2  X1  X1_X2
2   X3  X3  X3  X3
3   X4  XX  XX  X4_XX

EDIT: Preserve order of column values

def my_append(x):
    l = []
    for elm in x:
        if elm not in l:
            l.append(elm)
    return '_'.join(l)


df['New col']=df.astype('str').apply(my_append, axis = 1)

1000 loops, best of 3: 871 µs per loop

Returns

    A   B   C   New col
0   X1  AX  X1  X1_AX
1   X2  X2  X1  X2_X1
2   X3  X3  X3  X3
3   X4  XX  XX  X4_XX

EDIT 1: If you have nan in any column like this

    A   B   C
0   X1  AX  X1
1   X2  X2  X1
2   X3  X3  X3
3   NaN XX  XX

Handle this in a function and then apply

def my_append(x):
l = []
for elm in x:
    if elm not in l:
        l.append(elm)
l = [x for x in l if str(x) != 'nan']
return '_'.join(l)

df['New col']=df.astype('str').apply(my_append, axis = 1)


    A   B   C   New col
0   X1  AX  X1  X1_AX
1   X2  X2  X1  X2_X1
2   X3  X3  X3  X3
3   NaN XX  XX  XX

Python panda concatenate dynamic columns by removing duplicates

More articles: