Retrieving custom header column names from DataFrame, calculating missing columns using NA
I imported a tab-delimited file using Pandas read_csv
in a Jupyter notebook (Python 2) and highlighted one column of interest:
rawData = pd.read_csv(filename, delim_whitespace = True, header = 20)
columnOfInterest = rawData.ix[:, 9]
The format of my column of interest is something like this:
header1=123;header2=123;header3=123
Not every row in this DataFrame has every header, and I don't know the full set of possible headers. "123", my data values, all numbers.
After splitting the items in a column using ;
as my delimiter, all of my rows have a number of columns equal to the number of values ββin a row, which is not uniform across the dataset (ragged). I want to convert this to a matrix with missing values.
What I would like to do is take each row from my DataFrame, extract the header information, and if the header label is new (that is, it is not in any of the already processed rows) then I would like to add it to my name list columns. Of course, I would like the header names and equals signs to be removed from the rows, and I would like to have all of my data in place (so use the header information attached to each data value to put the values ββin the appropriate columns). So, I would like something similar to this:
# Original data frame, first 2 rows
['header1=123', 'header2=123', 'header3=123'] # <--- no header4
['header1=123', 'header3=123', 'header4=123'] # <--- no header2
# New data frame, first 2 rows plus column names
header1 header2 header3 header4
123 123 123 null # <--- header4 == null
123 null 123 123 # <--- header2 == null
Obviously this seems to work for a regular expression! However, I am at a loss as to how to do this in Pandas. Missing data must be null.
source to share
If you have a dataframe like
df = pd.DataFrame([['header1=123', 'header2=123', 'header3=123'],['header1=123', 'header3=123', 'header4=123']])
Then you can split the data by =
and then create a dictionary and the pd.DataFrame constructor will take care of all the rest ie
new = [[j.split('=') for j in i] for i in df.values ]
di=[{k:j for k,j in i} for i in new]
new_df = pd.DataFrame(di)
Output:
Dict:
[{'header1': '123', 'header2': '123', 'header3': '123'}, {'header1': '123', 'header3': '123', 'header4': '123'}]
DataFrame:
header1 header2 header3 header4 0 123 123 123 NaN 1 123 NaN 123 123
Hope it helps
source to share
You can use nested list comprehension
to convert to dict
and then just the constructor DataFrame
:
print (df)
col
0 header1=123;header2=123;header3=123
1 header1=123;header3=123;header4=123
d = [dict([y.split('=') for y in x]) for x in df['col'].str.split(';').values.tolist()]
print (d)
[{'header1': '123', 'header3': '123', 'header2': '123'},
{'header1': '123', 'header4': '123', 'header3': '123'}]
df = pd.DataFrame(d)
print (df)
header1 header2 header3 header4
0 123 123 123 NaN
1 123 NaN 123 123
If the values ββare split by ;
, the solution is simpler:
print (df)
col
0 [header1=123, header2=123, header3=123]
1 [header1=123, header3=123, header4=123]
d = [dict([y.split('=') for y in x]) for x in df['col'].values.tolist()]
df = pd.DataFrame(d)
print (df)
header1 header2 header3 header4
0 123 123 123 NaN
1 123 NaN 123 123
source to share
Using apply
In [1178]: df.col.apply(lambda x: pd.Series(
dict([tuple(y.split('=')) for y in x.split(';')])))
Out[1178]:
header1 header2 header3 header4
0 123 123 123 NaN
1 123 NaN 123 123
Or
In [1532]: df.col.apply(lambda x: pd.Series(
dict(map(lambda y: tuple(y.split('=')), x.split(';')))))
Out[1532]:
header1 header2 header3 header4
0 123 123 123 NaN
1 123 NaN 123 123
source to share