Pandas duplicated vs groupby to mark all duplicate values

Question

Pandas duplicated vs groupby to mark all duplicate values

I have a fairly simple need that came up in several other posts, but I'm not sure if the best way to approach it is with the groupby

or method duplicated

.

I have what I need below with duplicated

, except that the first duplicate is marked as FALSE

instead TRUE

. I want all duplicates to be TRUE.

My goal is to combine data from two columns together when it is a duplicate, otherwise leave the data as it is.

Input example:

ID  File Name
1   Text.csv
2   TEXT.csv
3   unique.csv
4   unique2.csv
5   text.csv

Desired output:

ID  File Name   LowerFileName   Duplicate   UniqueFileName
1   Text.csv    text.csv    TRUE    1Text.csv
2   TEXT.csv    text.csv    TRUE    2TEXT.csv
3   unique.csv  unique.csv  FALSE   unique.csv
4   unique2.csv unique2.csv FALSE   unique2.csv
5   text.csv    text.csv    TRUE    5text.csv


df_attachment = pd.read_csv("Attachment.csv")
df_attachment['LowerFileName'] = df_attachment['File Name'].str.lower()
df_attachment['Duplicate'] = df_attachment.duplicated('LowerFileName')
#This syntax is incorrect 
df_attachment['UniqueFileName'] = np.where(df_attachment['Duplicate']=='TRUE', pd.concat(df_attachment['ID'],df_attachment['File Name']), df_attachment['File Name'))

+3

python pandas

EMC June 24. '15 at 3:38

source to share

4 answers

The easiest way to get around this odd Pandas functionality is to create a mask with df.duplicated(col_name) | df.duplicated(col_name, take_last=True)

. Bitwise or means the series you are creating True

for all duplicates.

Follow this using indices to set the values that you are from the original name or the new name, with a number in fron.

In your case, below:

# Generating your DataFrame
df_attachment = pd.DataFrame(index=range(5))
df_attachment['ID'] = [1, 2, 3, 4, 5]
df_attachment['File Name'] = ['Text.csv', 'TEXT.csv', 'unique.csv',
                             'unique2.csv', 'text.csv']
df_attachment['LowerFileName'] = df_attachment['File Name'].str.lower()


# Answer from here, mask generation over two lines for readability
mask = df_attachment.duplicated('LowerFileName')
mask = mask | df_attachment.duplicated('LowerFileName', take_last=True)
df_attachment['Duplicate'] = mask

# New column names if possible
df_attachment['number_name'] = df_attachment['ID'].astype(str) + df_attachment['File Name']

# Set the final unique name column using the mask already generated
df_attachment.loc[mask, 'UniqueFileName'] = df_attachment.loc[mask, 'number_name']
df_attachment.loc[~mask, 'UniqueFileName'] = df_attachment.loc[~mask, 'File Name']

# Drop the intermediate column used
del df_attachment['number_name']

And the final one df_attachment

:

    ID  File Name   LowerFileName   Duplicate   UniqueFileName
0   1   Text.csv    text.csv    True    1Text.csv
1   2   TEXT.csv    text.csv    True    2TEXT.csv
2   3   unique.csv  unique.csv  False   unique.csv
3   4   unique2.csv unique2.csv False   unique2.csv
4   5   text.csv    text.csv    True    5text.csv

This method uses Pandas vectorized operations and indexing, so it should be fast for any DataFrame size.

EDIT: 2017-03-28

Someone gave such a vote yesterday, so I decided to change this to say that this was supported natively by Pandas since 0.17.0

, see the changes here: http://pandas.pydata.org/pandas-docs/version/0.19.2/whatsnew .html # v0-17-0-october-9-2015

Now you can use the keep

drop_duplicates

and argument and duplicated

set it to False

to flag all duplicates: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.duplicated.html

So above the rows generating the duplicated column become:

df_attachment['Duplicate'] = df_attachment.duplicated('LowerFileName', keep=False)

+2

bastewart June 24. 15 at 8:49

source to share

In your use case, you would need to use groupby:

dupes = df_attachment.groupby('Name').ID.count() > 1
dupes.name = 'Duplicate'
#merge duplicate flage into the original dataframe on the common column 'Name'
df_attachment = pd.merge(df_attachment, dupes.reset_index())

0

maxymoo June 24. 15 at 3:53

source to share

Inspired by this answer you could do (assuming your column is File Name

renamed to file_name

):

df['unique_name'] = df.file_name
dupes = df.file_name[df.file_name.str.lower().duplicated()]
unique_names = df.ID.astype(str) + df.file_name
df.loc[df.file_name.isin(dupes), 'unique_name'] = unique_names

Which gives you:

   ID    File Name  unique_name
0   1     Text.csv     Text.csv
1   2     TEXT.csv    2TEXT.csv
2   3   unique.csv   unique.csv
3   4  unique2.csv  unique2.csv
4   5     text.csv    5text.csv

0

LondonRob June 24. 15 at 19:34

source to share

Alexander · Accepted Answer · 2015-06-24T04:42:42+0000

Perhaps using groupby

together with an expression lambda

can achieve your goal:

gb = df.groupby('Lower File Name')['Lower File Name'].count()
duplicates = gb[gb > 1].index.tolist()
df['UniqueFileName'] = \
    df.apply(lambda x: '{0}{1}'.format(x.ID if x['Lower File Name'] in duplicates
                                       else "", x['File Name']), axis=1)

>>> df
   ID    File Name Lower File Name Duplicate   UniqueFileName
0   1     Text.csv        text.csv     False        1Text.csv
1   2     TEXT.csv        text.csv      True        2TEXT.csv
2   3   unique.csv      unique.csv     False      3unique.csv
3   4  unique2.csv     unique2.csv     False  Noneunique2.csv
4   5     text.csv        text.csv      True        5text.csv
5   6   uniquE.csv      unique.csv      True      6uniquE.csv

The lambda expression generates a unique filename for the OP's requirements, appending File Name

to the matching one ID

only if duplicated Lower File Name

(i.e. more than one file exists with the same lowercase filename). Otherwise, it just uses the lowercase filename without ID

.

Please note that this solution does not use the column Duplicate

in the above DataFrame.

Also, would it be easier to just add ID

in Lower File Name

to create a unique name? You don't need the solution above, and you don't even need to check for duplicates, assuming the ID is unique.

Pandas duplicated vs groupby to mark all duplicate values

EDIT: 2017-03-28

More articles: