Retrieving custom header column names from DataFrame, calculating missing columns using NA

I imported a tab-delimited file using Pandas read_csv

in a Jupyter notebook (Python 2) and highlighted one column of interest:

rawData = pd.read_csv(filename, delim_whitespace = True, header = 20)
columnOfInterest = rawData.ix[:, 9] 

      

The format of my column of interest is something like this:

header1=123;header2=123;header3=123

      

Not every row in this DataFrame has every header, and I don't know the full set of possible headers. "123", my data values, all numbers.

After splitting the items in a column using ;

as my delimiter, all of my rows have a number of columns equal to the number of values ​​in a row, which is not uniform across the dataset (ragged). I want to convert this to a matrix with missing values.

What I would like to do is take each row from my DataFrame, extract the header information, and if the header label is new (that is, it is not in any of the already processed rows) then I would like to add it to my name list columns. Of course, I would like the header names and equals signs to be removed from the rows, and I would like to have all of my data in place (so use the header information attached to each data value to put the values ​​in the appropriate columns). So, I would like something similar to this:

# Original data frame, first 2 rows
['header1=123', 'header2=123', 'header3=123'] # <--- no header4
['header1=123', 'header3=123', 'header4=123'] # <--- no header2

# New data frame, first 2 rows plus column names
header1    header2    header3    header4 
123        123        123        null    # <--- header4 == null
123        null       123        123     # <--- header2 == null

      

Obviously this seems to work for a regular expression! However, I am at a loss as to how to do this in Pandas. Missing data must be null.

+3


source to share


3 answers


If you have a dataframe like

df = pd.DataFrame([['header1=123', 'header2=123', 'header3=123'],['header1=123', 'header3=123', 'header4=123']])

      

Then you can split the data by =

and then create a dictionary and the pd.DataFrame constructor will take care of all the rest ie

new = [[j.split('=') for j in i] for i in df.values ]

di=[{k:j for k,j in i} for i in new]

new_df = pd.DataFrame(di)

      

Output:



Dict:

[{'header1': '123', 'header2': '123', 'header3': '123'},
 {'header1': '123', 'header3': '123', 'header4': '123'}]

DataFrame:

  header1 header2 header3 header4
0 123 123 123 NaN
1 123 NaN 123 123

Hope it helps

+2


source


You can use nested list comprehension

to convert to dict

and then just the constructor DataFrame

:

print (df)
                                   col
0  header1=123;header2=123;header3=123
1  header1=123;header3=123;header4=123

d = [dict([y.split('=') for y in x]) for x in df['col'].str.split(';').values.tolist()]
print (d)
[{'header1': '123', 'header3': '123', 'header2': '123'},
 {'header1': '123', 'header4': '123', 'header3': '123'}]

df = pd.DataFrame(d)
print (df)
  header1 header2 header3 header4
0     123     123     123     NaN
1     123     NaN     123     123

      



If the values ​​are split by ;

, the solution is simpler:

print (df)
                                       col
0  [header1=123, header2=123, header3=123]
1  [header1=123, header3=123, header4=123]

d = [dict([y.split('=') for y in x]) for x in df['col'].values.tolist()]
df = pd.DataFrame(d)
print (df)
  header1 header2 header3 header4
0     123     123     123     NaN
1     123     NaN     123     123

      

+4


source


Using apply

In [1178]: df.col.apply(lambda x: pd.Series(
                        dict([tuple(y.split('=')) for y in x.split(';')])))
Out[1178]:
  header1 header2 header3 header4
0     123     123     123     NaN
1     123     NaN     123     123

      

Or

In [1532]: df.col.apply(lambda x: pd.Series(
                        dict(map(lambda y: tuple(y.split('=')), x.split(';')))))
Out[1532]:
  header1 header2 header3 header4
0     123     123     123     NaN
1     123     NaN     123     123

      

0


source







All Articles