Construct matrix from multiple files using pandas
Has multiple files (20) in a 2 column directory, for example
transcript_id value
ENMUST001 2
ENMUST003 3
ENMUST004 5
the number of lines is different in each file, what I would like to do is combine all 20 files into one huge matrix.
transcript_id value_file1 value_file2....value_file20
ENMUST001 2 3
ENMUST003 3 4
ENMUST004 5 0
Collect all ids from the transcript_id column and corresponding values ββfrom each file (filename as column name), and if no value, use 0.
I tried to do this using pandas,
import os
import glob
import pandas as pd
path = 'pathtofiles'
transFiles = glob.glob(path + "*.tsv")
df_files = []
for file in transFiles:
df = pd.read_csv(file, sep='\t')
df.set_index('transcript_id')
df_files.append(df)
df_combine = pd.concat(df_files, axis=1).fillna(0)
Error:
ValueError: No objects to concatenate
I wonder if it is better to use the <pandas method? any pseudocode is evaluated.
change
found found
df.set_index('transcript_id')
print (df.shape)
(921, 1)
(1414, 1)
(659, 1)
(696, 1)
(313, 1)
print (df.is_unique)
(921, 1)
False
(1414, 1)
False
(659, 1)
False
(696, 1)
False
(313, 1)
False
df = df.drop_duplicates(inplace=True)
df_files.append(df)
df_combine = pd.concat(df_files, axis=1).fillna(0)
New error
ValueError: All objects passed were None
reprint
before: (921, 1)
after: (914, 1)
before: (1414, 1)
after: (1410, 1)
before: (659, 1)
after: (658, 1)
before: (696, 1)
after: (694, 1)
before: (313, 1)
after: (312, 1)
source to share
The default behavior for set_index is inplace=False
. Try replacing df.set_index('transcript_id')
with df = df.set_index('transcript_id')
. You can also remove duplicate values ββin the index with df = df[~df.index.duplicated(keep='first')]
.
import os
import glob
import pandas as pd
path = 'pathtofiles'
transFiles = glob.glob(path + "*.tsv")
df_files = []
for file in transFiles:
df = pd.read_csv(file, sep='\t')
df = df.set_index('transcript_id') # set index
df = df[~df.index.duplicated(keep='first')] # remove duplicates
df.columns = [os.path.split(file)[-1]] # set column name to filename
df_files.append(df)
df_combine = pd.concat(df_files, axis=1).fillna(0)
source to share