Split different records into separate lines, allowing only certain combinations

I searched the web for an answer, but while similar questions were asked, they were not the same and I was unable to use them.

I have a pandas dataframe in which multiple columns of text strings contain, sometimes, comma separated values. I want to split each CSV field that contains these comma separated values ​​from three of them, creating a new line for each record, but in parallel, this way (with A, C and D, not using E):

In [10]: before 
Out[10]: 
    A        B    C            D         E
0  a1,a2,a3  1    c1, c2, c3   d1,d2,d3  e1,e2,e3
1  a4        2    c4           d4        e4

In [11]: after
Out[11]: 
      A    B     C    D     E
0    a1    1    c1   d1    e1,e2,e3
1    a2    1    c2   d2    e1,e2,e3
2    a3    1    c3   d3    e1,e2,e3
3    a4    2    c4   d4    e4

      

The problem is that it should work for different tables, and all of these tables can have different columns (but always these three, A, C and D, along with the same names). And in addition, other columns may have comma-separated values ​​in some cells that were not separated, as happened with E.

EDIT: These three columns will always have the same number of comma-separated values ​​on the same row, but can vary between rows (1 value, 2 comma-separated values, 3 comma-separated values ​​...).

Sorry, I would like to suggest some code, but I couldn't do anything useful.

Can anyone help me? I would really appreciate any help! :)

+3


source to share


1 answer


If you know that three columns always have the same number of comma separated values, you can do:

>>> df
          A  B         C         D         E
0  a1,a2,a3  1  c1,c2,c3  d1,d2,d3  e1,e2,e3
1        a4  2        c4        d4        e4

      

split each column:

>>> for col in ['A', 'C', 'D']:
...     df[col] = df[col].str.split(',')
... 
>>> df
              A  B             C             D         E
0  [a1, a2, a3]  1  [c1, c2, c3]  [d1, d2, d3]  e1,e2,e3
1          [a4]  2          [c4]          [d4]        e4

      

define indexers:



>>> i = df['A'].map(len)
>>> j = np.repeat(np.arange(len(df)), i)
>>> k = np.concatenate(list(map(np.arange, i)))

      

expand the frame:

>>> df = df.iloc[j]
>>> df
              A  B             C             D         E
0  [a1, a2, a3]  1  [c1, c2, c3]  [d1, d2, d3]  e1,e2,e3
0  [a1, a2, a3]  1  [c1, c2, c3]  [d1, d2, d3]  e1,e2,e3
0  [a1, a2, a3]  1  [c1, c2, c3]  [d1, d2, d3]  e1,e2,e3
1          [a4]  2          [c4]          [d4]        e4

      

take one from each list:

>>> for col in ['A', 'C', 'D']:
...     df[col] = list(map(lambda xs, i: xs[i], df[col], k))
... 
>>> df
    A  B   C   D         E
0  a1  1  c1  d1  e1,e2,e3
0  a2  1  c2  d2  e1,e2,e3
0  a3  1  c3  d3  e1,e2,e3
1  a4  2  c4  d4        e4

      

+3


source







All Articles