Pandas DataFrame - concatenating values ​​of one column with one index into a list

I was in this issue for a while. This is almost a duplicate of at least one other question here , but I can't figure out how to do what I'm looking for from the linked answers online.

I have a Pandas DataFrame (we'll call it df

) that looks something like this:

Name    Value        Value2
'A'     '8.8.8.8'    'x'
'B'     '6.6.6.6'    'y'
'A'     '6.6.6.6'    'x'
'A'     '8.8.8.8'    'x'

      

Where Name

is the index. I want to convert this to something like this:

Name    Value                     Value2
'A'     ['8.8.8.8', '6.6.6.6']    'x'
'B'     ['6.6.6.6']               'y'

      

So basically, everyone Value

corresponding to the same index should be concatenated into a list (or set or tuple), and that list should be Value

for the corresponding index. And as shown Value2

is the same as indexed strings, so in the end it should remain the same.

All I did (successfully) was to figure out how to make each item in a column Value

in a list with:

df['Value'] = pd.Series([[val] for val in df['Value']])

      

In the question I linked at the beginning of this post, the recommended way to join columns with duplicate indexes suggests a solution using df.groupby(df.index).sum()

. I know that I need something besides df.index

as an argument for groupby

, since the column is Value

treated as special and I'm not sure what to put instead sum()

as this is not exactly what I'm looking for.

Hope it's clear what I'm looking for, let me know if anything I can elaborate on. I also tried just looping through the DataFrame itself, finding rows with the same index, concatenating Values

into a list and updating df

accordingly. After trying to do a little bit of work on this method, I thought I was looking for a more Pandas-friendly way of solving this problem.


Edit: As an answer to the question about dealing with dermis, this kind of solution worked. It seems that I Values

got my bearings in the list correctly. I figured out that the function unique

returns Series

, not DataFrame

. Also, in the actual setup I have more columns than just Name

, Value

and Value2

. But I think I managed to get around both issues successfully with the following:

gb = df.groupby(tuple(df.columns.difference(['Value'])))
result = pd.DataFrame(gb['Value'].unique(), columns=df.columns)

      

If the first row contains groupby

a column list argument minus the column Value

, and the second row converts Series

, the returned unique

to DataFrame

is with the same columns as df

.

But I think with all of this in place (unless someone sees a problem with this), almost everything works as intended. However, it looks like something a little out of here. When I try to output this to a file with to_csv

, there are duplicate headers at the top (but only some headers are duplicated and there is no real template as far as I can tell). Also, the lists Value

are truncated, which is probably an easier problem to fix. The output csv

looks like this:

Name    Value                   Value2    Name    Value2
'A'     ['8.8.8.8' '7.7.7.7'    'x'                     
'B'     ['6.6.6.6']             'y'

      

The above looks strange, but this is exactly what it looks like in the output. Note that unlike the example provided at the beginning of this post, it is assumed that for A

more than 2 Values

(so I can illustrate this point). When I do this with actual data, the lists are Value

truncated after the first 4 items.

+3


source to share


1 answer


I think you want to use pandas.Series.unique

. First make the index a 'Name'

column

df
#     Value2  Value
#Name              
#A         x    8.8
#B         y    6.6
#A         x    6.6
#A         x    8.8

df.reset_index(inplace=True)
#  Name Value2  Value
#0    A      x    8.8
#1    B      y    6.6
#2    A      x    6.6
#3    A      x    8.8

      

Next call groupby

and function call unique

in series'Value'

gb = df.groupby(('Name','Value2'))
result = gb['Value'].unique()
result.reset_index(inplace=True) #lastly, reset the index
#  Name Value2       Value
#0    A      x  [8.8, 6.6]
#1    B      y       [6.6]

      

Finally, if you want to be 'Name'

like the index again, just do

result.set_index( 'Name', inplace=True)
#     Value2       Value
#Name                   
#A         x  [8.8, 6.6]
#B         y       [6.6]

      

UPDATE



As a follow up, make sure you reassign the result after resetting the index

result = gb['Value'].unique()
type(result)
#pandas.core.series.Series

result = result.reset_index()
type(result)
#pandas.core.frame.DataFrame

      

saving as CSV (rather TSV)

You don't want to use CSV here because Value

there are commas in the columns . Rather, save as TSV, you are still using the same method to_csv

, just change sep

arg:

result.to_csv( 'result.txt', sep='\t')

      

If I load result.txt in EXCEL as TSV I get

enter image description here

+4


source







All Articles