Pandas Subset of single columns on dataframe creates data structure
This is really a story about two data frames and strangely different behavior.
I have two csv files that I read into pandas. Every file without a header; header files are stored separately. Like this:
$ ls A.csv A.header B.csv B.header
I'm using pandas to read them, but first I need to parse the header:
def make_header(flnm):
return open(flnm, 'rb').read().strip(' \t\n\r').split(',')
A_header = make_header('A.header')
B_header = make_header('B.header')
Now I can read in csvs:
A = read_csv('A.csv', header=0, names=A_header)
B = read_csv('B.csv', header=0, names=B_header)
Make sure this worked correctly:
print type(A)
print type(B)
Result:
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
as was expected.
Now there is an oddity that I cannot figure out. I want to select one column from each of these dataframes. When I do this, one of the data frames returns a Series object (as you would expect) and one returns a single DataFrame column:
print type(A.A_x)
print type(B.B_x)
leads to:
<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>
As far as I can tell, I processed these files the same way from start to finish, but got different results. What could be causing this? Where is the error in my data sanitization or my understanding of pandas?
A couple of things I looked into
The two columns have the same data types:
print A.A_x.dtype
print B.B_x.B_x.dtype
gives:
int64
int64
(of course I need to fetch columns twice from dataframe B due to the strange behavior I observe).
I also checked for duplicate names in my header:
$ cat A.header | sed 's/,/\n/g' | grep A_x
> A_x
and
$ cat B.header | sed 's/,/\n/g' | grep B_x
> B_x
So, each specified name appears exactly once.
source to share
No one has answered this question yet
Check out similar questions: