How to split row from column to create long data format

If I have the data shown below, how do I make the data format long (for example, one term for each gene in a row).

I guess I will have to apply

or map split(",")

to a column Term

, but what should I do after that?

import pandas as pd
from StringIO import StringIO

df = pd.read_table(StringIO("""Gene    Terms
Mt-nd1  GO:0005739,GO:0005743,GO:0016021,GO:0030425,GO:0043025,GO:0070469,GO:0005623,GO:0005622,GO:0005737
Madd    GO:0016021,GO:0045202,GO:0005886
Zmiz1   GO:0005654,GO:0043231
Cdca7   GO:0005622,GO:0005623,GO:0005737,GO:0005634,GO:0005654"""), sep="\s+")

      

Ps. the above table is simplified, the actual one df

will contain many more columns.

Pumped storage power plant. In case I was unclear, I want to get something like:

Mt-nd1  GO:0005739
Mt-nd1  GO:0005743
Mt-nd1  GO:0016021
...
Cdca7   GO:0005634
Cdca7   GO:0005654

      

+3


source to share


1 answer


You can use str.split

to split (instead of applying and splitting but similar):

In [6]: splitted = df['Terms'].str.split(',', expand=True)

In [7]: splitted 
Out[7]:
            0           1           2           3           4           5  \
0  GO:0005739  GO:0005743  GO:0016021  GO:0030425  GO:0043025  GO:0070469
1  GO:0016021  GO:0045202  GO:0005886         NaN         NaN         NaN
2  GO:0005654  GO:0043231         NaN         NaN         NaN         NaN
3  GO:0005622  GO:0005623  GO:0005737  GO:0005634  GO:0005654         NaN

            6           7           8
0  GO:0005623  GO:0005622  GO:0005737
1         NaN         NaN         NaN
2         NaN         NaN         NaN
3         NaN         NaN         NaN

      

To turn it into columns (instead of the list), you can use the keyword expand=True

for split

, or for older pandas you can do df['Terms'].str.split(',').apply(pd.Series)

to get the same thing.

Now, to get the desired result, we have to collect these columns, but first combine it with the gene column to get this information on the stack:



In [14]: stacked = pd.concat([df['Gene'], splitted],axis=1).set_index('Gene').stack()
In [15]: stacked
Out[15]:
Gene
Mt-nd1  0    GO:0005739
        1    GO:0005743
        2    GO:0016021
        3    GO:0030425
        4    GO:0043025
        5    GO:0070469
        6    GO:0005623
        7    GO:0005622
        8    GO:0005737
Madd    0    GO:0016021
        1    GO:0045202
        2    GO:0005886
Zmiz1   0    GO:0005654
        1    GO:0043231
Cdca7   0    GO:0005622
        1    GO:0005623
        2    GO:0005737
        3    GO:0005634
        4    GO:0005654
dtype: object

      

From here we can reset the index, rename our column with terms, and drop the integer column (from the auto-generated column names) that we no longer need:

In [19]: stacked.rename(columns={0:'Term'}).drop('level_1', axis=1)
Out[19]:
      Gene        Term
0   Mt-nd1  GO:0005739
1   Mt-nd1  GO:0005743
2   Mt-nd1  GO:0016021
3   Mt-nd1  GO:0030425
4   Mt-nd1  GO:0043025
5   Mt-nd1  GO:0070469
6   Mt-nd1  GO:0005623
7   Mt-nd1  GO:0005622
8   Mt-nd1  GO:0005737
9     Madd  GO:0016021
10    Madd  GO:0045202
11    Madd  GO:0005886
12   Zmiz1  GO:0005654
13   Zmiz1  GO:0043231
14   Cdca7  GO:0005622
15   Cdca7  GO:0005623
16   Cdca7  GO:0005737
17   Cdca7  GO:0005634
18   Cdca7  GO:0005654

      

How this can be merged or merged with other columns you have will depend on what exactly you want to do with it.

+4


source







All Articles