How to split row from column to create long data format
If I have the data shown below, how do I make the data format long (for example, one term for each gene in a row).
I guess I will have to apply
or map split(",")
to a column Term
, but what should I do after that?
import pandas as pd
from StringIO import StringIO
df = pd.read_table(StringIO("""Gene Terms
Mt-nd1 GO:0005739,GO:0005743,GO:0016021,GO:0030425,GO:0043025,GO:0070469,GO:0005623,GO:0005622,GO:0005737
Madd GO:0016021,GO:0045202,GO:0005886
Zmiz1 GO:0005654,GO:0043231
Cdca7 GO:0005622,GO:0005623,GO:0005737,GO:0005634,GO:0005654"""), sep="\s+")
Ps. the above table is simplified, the actual one df
will contain many more columns.
Pumped storage power plant. In case I was unclear, I want to get something like:
Mt-nd1 GO:0005739
Mt-nd1 GO:0005743
Mt-nd1 GO:0016021
...
Cdca7 GO:0005634
Cdca7 GO:0005654
source to share
You can use str.split
to split (instead of applying and splitting but similar):
In [6]: splitted = df['Terms'].str.split(',', expand=True)
In [7]: splitted
Out[7]:
0 1 2 3 4 5 \
0 GO:0005739 GO:0005743 GO:0016021 GO:0030425 GO:0043025 GO:0070469
1 GO:0016021 GO:0045202 GO:0005886 NaN NaN NaN
2 GO:0005654 GO:0043231 NaN NaN NaN NaN
3 GO:0005622 GO:0005623 GO:0005737 GO:0005634 GO:0005654 NaN
6 7 8
0 GO:0005623 GO:0005622 GO:0005737
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
To turn it into columns (instead of the list), you can use the keyword expand=True
for split
, or for older pandas you can do df['Terms'].str.split(',').apply(pd.Series)
to get the same thing.
Now, to get the desired result, we have to collect these columns, but first combine it with the gene column to get this information on the stack:
In [14]: stacked = pd.concat([df['Gene'], splitted],axis=1).set_index('Gene').stack()
In [15]: stacked
Out[15]:
Gene
Mt-nd1 0 GO:0005739
1 GO:0005743
2 GO:0016021
3 GO:0030425
4 GO:0043025
5 GO:0070469
6 GO:0005623
7 GO:0005622
8 GO:0005737
Madd 0 GO:0016021
1 GO:0045202
2 GO:0005886
Zmiz1 0 GO:0005654
1 GO:0043231
Cdca7 0 GO:0005622
1 GO:0005623
2 GO:0005737
3 GO:0005634
4 GO:0005654
dtype: object
From here we can reset the index, rename our column with terms, and drop the integer column (from the auto-generated column names) that we no longer need:
In [19]: stacked.rename(columns={0:'Term'}).drop('level_1', axis=1)
Out[19]:
Gene Term
0 Mt-nd1 GO:0005739
1 Mt-nd1 GO:0005743
2 Mt-nd1 GO:0016021
3 Mt-nd1 GO:0030425
4 Mt-nd1 GO:0043025
5 Mt-nd1 GO:0070469
6 Mt-nd1 GO:0005623
7 Mt-nd1 GO:0005622
8 Mt-nd1 GO:0005737
9 Madd GO:0016021
10 Madd GO:0045202
11 Madd GO:0005886
12 Zmiz1 GO:0005654
13 Zmiz1 GO:0043231
14 Cdca7 GO:0005622
15 Cdca7 GO:0005623
16 Cdca7 GO:0005737
17 Cdca7 GO:0005634
18 Cdca7 GO:0005654
How this can be merged or merged with other columns you have will depend on what exactly you want to do with it.
source to share