Pandas DataFrame - concatenating values ββof one column with one index into a list
I was in this issue for a while. This is almost a duplicate of at least one other question here , but I can't figure out how to do what I'm looking for from the linked answers online.
I have a Pandas DataFrame (we'll call it df
) that looks something like this:
Name Value Value2
'A' '8.8.8.8' 'x'
'B' '6.6.6.6' 'y'
'A' '6.6.6.6' 'x'
'A' '8.8.8.8' 'x'
Where Name
is the index. I want to convert this to something like this:
Name Value Value2
'A' ['8.8.8.8', '6.6.6.6'] 'x'
'B' ['6.6.6.6'] 'y'
So basically, everyone Value
corresponding to the same index should be concatenated into a list (or set or tuple), and that list should be Value
for the corresponding index. And as shown Value2
is the same as indexed strings, so in the end it should remain the same.
All I did (successfully) was to figure out how to make each item in a column Value
in a list with:
df['Value'] = pd.Series([[val] for val in df['Value']])
In the question I linked at the beginning of this post, the recommended way to join columns with duplicate indexes suggests a solution using df.groupby(df.index).sum()
. I know that I need something besides df.index
as an argument for groupby
, since the column is Value
treated as special and I'm not sure what to put instead sum()
as this is not exactly what I'm looking for.
Hope it's clear what I'm looking for, let me know if anything I can elaborate on. I also tried just looping through the DataFrame itself, finding rows with the same index, concatenating Values
into a list and updating df
accordingly. After trying to do a little bit of work on this method, I thought I was looking for a more Pandas-friendly way of solving this problem.
Edit: As an answer to the question about dealing with dermis, this kind of solution worked. It seems that I Values
got my bearings in the list correctly. I figured out that the function unique
returns Series
, not DataFrame
. Also, in the actual setup I have more columns than just Name
, Value
and Value2
. But I think I managed to get around both issues successfully with the following:
gb = df.groupby(tuple(df.columns.difference(['Value'])))
result = pd.DataFrame(gb['Value'].unique(), columns=df.columns)
If the first row contains groupby
a column list argument minus the column Value
, and the second row converts Series
, the returned unique
to DataFrame
is with the same columns as df
.
But I think with all of this in place (unless someone sees a problem with this), almost everything works as intended. However, it looks like something a little out of here. When I try to output this to a file with to_csv
, there are duplicate headers at the top (but only some headers are duplicated and there is no real template as far as I can tell). Also, the lists Value
are truncated, which is probably an easier problem to fix. The output csv
looks like this:
Name Value Value2 Name Value2
'A' ['8.8.8.8' '7.7.7.7' 'x'
'B' ['6.6.6.6'] 'y'
The above looks strange, but this is exactly what it looks like in the output. Note that unlike the example provided at the beginning of this post, it is assumed that for A
more than 2 Values
(so I can illustrate this point). When I do this with actual data, the lists are Value
truncated after the first 4 items.
source to share
I think you want to use pandas.Series.unique
. First make the index a 'Name'
column
df
# Value2 Value
#Name
#A x 8.8
#B y 6.6
#A x 6.6
#A x 8.8
df.reset_index(inplace=True)
# Name Value2 Value
#0 A x 8.8
#1 B y 6.6
#2 A x 6.6
#3 A x 8.8
Next call groupby
and function call unique
in series'Value'
gb = df.groupby(('Name','Value2'))
result = gb['Value'].unique()
result.reset_index(inplace=True) #lastly, reset the index
# Name Value2 Value
#0 A x [8.8, 6.6]
#1 B y [6.6]
Finally, if you want to be 'Name'
like the index again, just do
result.set_index( 'Name', inplace=True)
# Value2 Value
#Name
#A x [8.8, 6.6]
#B y [6.6]
UPDATE
As a follow up, make sure you reassign the result after resetting the index
result = gb['Value'].unique()
type(result)
#pandas.core.series.Series
result = result.reset_index()
type(result)
#pandas.core.frame.DataFrame
saving as CSV (rather TSV)
You don't want to use CSV here because Value
there are commas in the columns . Rather, save as TSV, you are still using the same method to_csv
, just change sep
arg:
result.to_csv( 'result.txt', sep='\t')
If I load result.txt in EXCEL as TSV I get
source to share