Split different records into separate lines, allowing only certain combinations
I searched the web for an answer, but while similar questions were asked, they were not the same and I was unable to use them.
I have a pandas dataframe in which multiple columns of text strings contain, sometimes, comma separated values. I want to split each CSV field that contains these comma separated values ββfrom three of them, creating a new line for each record, but in parallel, this way (with A, C and D, not using E):
In [10]: before
Out[10]:
A B C D E
0 a1,a2,a3 1 c1, c2, c3 d1,d2,d3 e1,e2,e3
1 a4 2 c4 d4 e4
In [11]: after
Out[11]:
A B C D E
0 a1 1 c1 d1 e1,e2,e3
1 a2 1 c2 d2 e1,e2,e3
2 a3 1 c3 d3 e1,e2,e3
3 a4 2 c4 d4 e4
The problem is that it should work for different tables, and all of these tables can have different columns (but always these three, A, C and D, along with the same names). And in addition, other columns may have comma-separated values ββin some cells that were not separated, as happened with E.
EDIT: These three columns will always have the same number of comma-separated values ββon the same row, but can vary between rows (1 value, 2 comma-separated values, 3 comma-separated values ββ...).
Sorry, I would like to suggest some code, but I couldn't do anything useful.
Can anyone help me? I would really appreciate any help! :)
source to share
If you know that three columns always have the same number of comma separated values, you can do:
>>> df
A B C D E
0 a1,a2,a3 1 c1,c2,c3 d1,d2,d3 e1,e2,e3
1 a4 2 c4 d4 e4
split each column:
>>> for col in ['A', 'C', 'D']:
... df[col] = df[col].str.split(',')
...
>>> df
A B C D E
0 [a1, a2, a3] 1 [c1, c2, c3] [d1, d2, d3] e1,e2,e3
1 [a4] 2 [c4] [d4] e4
define indexers:
>>> i = df['A'].map(len)
>>> j = np.repeat(np.arange(len(df)), i)
>>> k = np.concatenate(list(map(np.arange, i)))
expand the frame:
>>> df = df.iloc[j]
>>> df
A B C D E
0 [a1, a2, a3] 1 [c1, c2, c3] [d1, d2, d3] e1,e2,e3
0 [a1, a2, a3] 1 [c1, c2, c3] [d1, d2, d3] e1,e2,e3
0 [a1, a2, a3] 1 [c1, c2, c3] [d1, d2, d3] e1,e2,e3
1 [a4] 2 [c4] [d4] e4
take one from each list:
>>> for col in ['A', 'C', 'D']:
... df[col] = list(map(lambda xs, i: xs[i], df[col], k))
...
>>> df
A B C D E
0 a1 1 c1 d1 e1,e2,e3
0 a2 1 c2 d2 e1,e2,e3
0 a3 1 c3 d3 e1,e2,e3
1 a4 2 c4 d4 e4
source to share