How can I import the same column name data using np.genfromtxt?

I have data in the data.dat file of the form:

column_1    col col col col col
1   2   3   1   2   3
4   3   2   3   2   4
1   4   3   1   4   3
5   6   4   5   6   4

      

And I am trying to import using np.genfromtxt so that all data with the column name col is stored in the y variable. I tried it with code:

import numpy as np
data = np.genfromtxt('data.dat', comments='#', delimiter='\t', dtype=None, names=True).transpose()
y = data['col']

      

But this gives me the following error:

ValueError: two fields with the same name

      

How can this be solved in Python?

+3


source to share


1 answer


When you use name=True

it np.genfromtxt

returns a structured array . Note that columns labeled col

in data.dat

are given ambiguous column names of the form col_n

:

In [114]: arr = np.genfromtxt('data', comments='#', delimiter='\t', dtype=None, names=True)

In [115]: arr
Out[115]: 
array([(1, 2, 3, 1, 2, 3), (4, 3, 2, 3, 2, 4), (1, 4, 3, 1, 4, 3),
       (5, 6, 4, 5, 6, 4)], 
      dtype=[('column_1', '<i8'), ('col', '<i8'), ('col_1', '<i8'), ('col_2', '<i8'), ('col_3', '<i8'), ('col_4', '<i8')])

      

So, once you use it names=True

, it becomes more difficult to select all the data associated with the column name col

. Moreover, a structured array does not allow you to split multiple columns at the same time. So it would be more convenient to load the data into an array of uniform dtype instead (which is what you would get without names=True

):

with open('data.dat', 'rb') as f:
    header = f.readline().strip().split('\t')
    arr = np.genfromtxt(f, comments='#', delimiter='\t', dtype=None)

      

Then you can find the numeric index of those columns whose name is col

:

idx = [i for i, col in enumerate(header) if col=='col']

      

and select all data with

y = arr[:, idx]

      




For example,

import numpy as np

with open('data.dat', 'rb') as f:
    header = f.readline().strip().split('\t')
    arr = np.genfromtxt(f, comments='#', delimiter='\t', dtype=None)
    idx = [i for i, col in enumerate(header) if col=='col']
    y = arr[:, idx]
    print(y)

      

gives

[[2 3 1 2 3]
 [3 2 3 2 4]
 [4 3 1 4 3]
 [6 4 5 6 4]]

      

If you want it to y

be one-dimensional, you can use ravel()

:

print(y.ravel())

      

gives

[2 3 1 2 3 3 2 3 2 4 4 3 1 4 3 6 4 5 6 4]

      

+1


source







All Articles