Pandas duplicated vs groupby to mark all duplicate values
I have a fairly simple need that came up in several other posts, but I'm not sure if the best way to approach it is with the groupby
or method duplicated
.
I have what I need below with duplicated
, except that the first duplicate is marked as FALSE
instead TRUE
. I want all duplicates to be TRUE.
My goal is to combine data from two columns together when it is a duplicate, otherwise leave the data as it is.
Input example:
ID File Name
1 Text.csv
2 TEXT.csv
3 unique.csv
4 unique2.csv
5 text.csv
Desired output:
ID File Name LowerFileName Duplicate UniqueFileName
1 Text.csv text.csv TRUE 1Text.csv
2 TEXT.csv text.csv TRUE 2TEXT.csv
3 unique.csv unique.csv FALSE unique.csv
4 unique2.csv unique2.csv FALSE unique2.csv
5 text.csv text.csv TRUE 5text.csv
df_attachment = pd.read_csv("Attachment.csv")
df_attachment['LowerFileName'] = df_attachment['File Name'].str.lower()
df_attachment['Duplicate'] = df_attachment.duplicated('LowerFileName')
#This syntax is incorrect
df_attachment['UniqueFileName'] = np.where(df_attachment['Duplicate']=='TRUE', pd.concat(df_attachment['ID'],df_attachment['File Name']), df_attachment['File Name'))
source to share
Perhaps using groupby
together with an expression lambda
can achieve your goal:
gb = df.groupby('Lower File Name')['Lower File Name'].count()
duplicates = gb[gb > 1].index.tolist()
df['UniqueFileName'] = \
df.apply(lambda x: '{0}{1}'.format(x.ID if x['Lower File Name'] in duplicates
else "", x['File Name']), axis=1)
>>> df
ID File Name Lower File Name Duplicate UniqueFileName
0 1 Text.csv text.csv False 1Text.csv
1 2 TEXT.csv text.csv True 2TEXT.csv
2 3 unique.csv unique.csv False 3unique.csv
3 4 unique2.csv unique2.csv False Noneunique2.csv
4 5 text.csv text.csv True 5text.csv
5 6 uniquE.csv unique.csv True 6uniquE.csv
The lambda expression generates a unique filename for the OP's requirements, appending File Name
to the matching one ID
only if duplicated Lower File Name
(i.e. more than one file exists with the same lowercase filename). Otherwise, it just uses the lowercase filename without ID
.
Please note that this solution does not use the column Duplicate
in the above DataFrame.
Also, would it be easier to just add ID
in Lower File Name
to create a unique name? You don't need the solution above, and you don't even need to check for duplicates, assuming the ID is unique.
source to share
The easiest way to get around this odd Pandas functionality is to create a mask with df.duplicated(col_name) | df.duplicated(col_name, take_last=True)
. Bitwise or means the series you are creating True
for all duplicates.
Follow this using indices to set the values ββthat you are from the original name or the new name, with a number in fron.
In your case, below:
# Generating your DataFrame
df_attachment = pd.DataFrame(index=range(5))
df_attachment['ID'] = [1, 2, 3, 4, 5]
df_attachment['File Name'] = ['Text.csv', 'TEXT.csv', 'unique.csv',
'unique2.csv', 'text.csv']
df_attachment['LowerFileName'] = df_attachment['File Name'].str.lower()
# Answer from here, mask generation over two lines for readability
mask = df_attachment.duplicated('LowerFileName')
mask = mask | df_attachment.duplicated('LowerFileName', take_last=True)
df_attachment['Duplicate'] = mask
# New column names if possible
df_attachment['number_name'] = df_attachment['ID'].astype(str) + df_attachment['File Name']
# Set the final unique name column using the mask already generated
df_attachment.loc[mask, 'UniqueFileName'] = df_attachment.loc[mask, 'number_name']
df_attachment.loc[~mask, 'UniqueFileName'] = df_attachment.loc[~mask, 'File Name']
# Drop the intermediate column used
del df_attachment['number_name']
And the final one df_attachment
:
ID File Name LowerFileName Duplicate UniqueFileName
0 1 Text.csv text.csv True 1Text.csv
1 2 TEXT.csv text.csv True 2TEXT.csv
2 3 unique.csv unique.csv False unique.csv
3 4 unique2.csv unique2.csv False unique2.csv
4 5 text.csv text.csv True 5text.csv
This method uses Pandas vectorized operations and indexing, so it should be fast for any DataFrame size.
EDIT: 2017-03-28
Someone gave such a vote yesterday, so I decided to change this to say that this was supported natively by Pandas since 0.17.0
, see the changes here: http://pandas.pydata.org/pandas-docs/version/0.19.2/whatsnew .html # v0-17-0-october-9-2015
Now you can use the keep
drop_duplicates
and argument and duplicated
set it to False
to flag all duplicates: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.duplicated.html
So above the rows generating the duplicated column become:
df_attachment['Duplicate'] = df_attachment.duplicated('LowerFileName', keep=False)
source to share
Inspired by this answer you could do (assuming your column is File Name
renamed to file_name
):
df['unique_name'] = df.file_name dupes = df.file_name[df.file_name.str.lower().duplicated()] unique_names = df.ID.astype(str) + df.file_name df.loc[df.file_name.isin(dupes), 'unique_name'] = unique_names
Which gives you:
ID File Name unique_name
0 1 Text.csv Text.csv
1 2 TEXT.csv 2TEXT.csv
2 3 unique.csv unique.csv
3 4 unique2.csv unique2.csv
4 5 text.csv 5text.csv
source to share