Pandas: handle missing column

I am using the following code to read a CSV file in chunks using pandas read_csv

headers = ["1","2","3","4","5"]
fields = ["1", "5"]

for chunk in pandas.read_csv(fileName, names=headers, header=0, usecols=fields, chunksize=chunkSize):

      

Sometimes my CSV will not have a "5" column and I want to be able to handle that case and provide some default values. Is there a way to only read the headers of my CSV file without reading the entire file so I can handle it manually? Or could there be any other clever way to default the value for the missing column?

+3


source to share


1 answer


If you pass nrows=0

this will only read the row of the column, then you can call intersection

to find the common values ​​of the column and avoid errors:

In[14]:
t="""1,2,3,5,6
0,1,2,3,4"""
headers = ["1","2","3","4","5"]
fields = ["1", "5"]
cols = pd.read_csv(io.StringIO(t), nrows=0).columns
cols

Out[14]: Index(['1', '2', '3', '5', '6'], dtype='object')

      

So now we have the column names that we can call intersection

to find valid columns for your expected and actual columns:

In[15]:
valid_cols = cols.intersection(headers)
valid_cols

Out[15]: Index(['1', '2', '3', '5'], dtype='object')

      



You can do the same with fields

and then you can pipe them into your current code to avoid any exceptions

To demonstrate that the transmission is nrows=0

simply reading the header line:

In[16]:
pd.read_csv(io.StringIO(t), nrows=0)

Out[16]: 
Empty DataFrame
Columns: [1, 2, 3, 5, 6]
Index: []

      

+1


source







All Articles