Construct matrix from multiple files using pandas

Question

Construct matrix from multiple files using pandas

Has multiple files (20) in a 2 column directory, for example

transcript_id value
ENMUST001     2
ENMUST003     3
ENMUST004     5

the number of lines is different in each file, what I would like to do is combine all 20 files into one huge matrix.

transcript_id value_file1 value_file2....value_file20
ENMUST001     2  3 
ENMUST003     3  4
ENMUST004     5  0

Collect all ids from the transcript_id column and corresponding values from each file (filename as column name), and if no value, use 0.

I tried to do this using pandas,

import os
import glob
import pandas as pd
path = 'pathtofiles'
transFiles = glob.glob(path + "*.tsv")
df_files = []
for file in transFiles:
    df = pd.read_csv(file, sep='\t')
    df.set_index('transcript_id')
    df_files.append(df)
df_combine = pd.concat(df_files, axis=1).fillna(0) 

Error:
ValueError: No objects to concatenate

I wonder if it is better to use the <pandas method? any pseudocode is evaluated.

change

found found

df.set_index('transcript_id')
print (df.shape)

    (921, 1)
    (1414, 1)
    (659, 1)
    (696, 1)
    (313, 1)
print (df.is_unique)
    (921, 1)
False
(1414, 1)
False
(659, 1)
False
(696, 1)
False
(313, 1)
False
df = df.drop_duplicates(inplace=True)
df_files.append(df)
df_combine = pd.concat(df_files, axis=1).fillna(0)

New error
ValueError: All objects passed were None

reprint

before:  (921, 1)
after:  (914, 1)
before:  (1414, 1)
after:  (1410, 1)
before:  (659, 1)
after:  (658, 1)
before:  (696, 1)
after:  (694, 1)
before:  (313, 1)
after:  (312, 1)

+3

python python-2.7 pandas

sid 05 Aug 17 at 16:37

source to share

1 answer

vahndi · Accepted Answer · 2017-08-05T16:54:29+0000

The default behavior for set_index is inplace=False

. Try replacing df.set_index('transcript_id')

with df = df.set_index('transcript_id')

. You can also remove duplicate values in the index with df = df[~df.index.duplicated(keep='first')]

.

import os
import glob
import pandas as pd

path = 'pathtofiles'
transFiles = glob.glob(path + "*.tsv")
df_files = []
for file in transFiles:
    df = pd.read_csv(file, sep='\t')
    df = df.set_index('transcript_id') # set index
    df = df[~df.index.duplicated(keep='first')] # remove duplicates
    df.columns = [os.path.split(file)[-1]] # set column name to filename
    df_files.append(df)
df_combine = pd.concat(df_files, axis=1).fillna(0)

Construct matrix from multiple files using pandas

change

reprint

More articles: