Creating pandas data from a list of dictionaries containing lists of data
I have a list of dictionaries with this structure.
{
'data' : [[year1, value1], [year2, value2], ... m entries],
'description' : string,
'end' : string,
'f' : string,
'lastHistoricalperiod' : string,
'name' : string,
'series_id' : string,
'start' : int,
'units' : string,
'unitsshort' : string,
'updated' : string
}
I want to put this in a pandas DataFrame that looks like
year value updated (other dict keys ... )
0 2040 120.592468 2014-05-23T12:06:16-0400 other key-values
1 2039 120.189987 2014-05-23T12:06:16-0400 ...
2 other year-value pairs ...
...
n
where n = m * len (list with dictionaries) (where the length of each list in 'data' = m)
That is, each tuple in 'data' must have its own string. What I have done so far is the following:
x = [list of dictionaries as described above]
# Create Empty Data Frame
output = pd.DataFrame()
# Loop through each dictionary in the list
for dictionary in x:
# Create a new DataFrame from the 2-D list alone.
data = dictionary['data']
y = pd.DataFrame(data, columns = ['year', 'value'])
# Loop through all the other dictionary key-value pairs and fill in values
for key in dictionary:
if key != 'data':
y[key] = dictionary[key]
# Concatenate most recent output with the dframe from this dictionary.
output = pd.concat([output_frame, y], ignore_index = True)
This seems to be very hacky and I was wondering if there is a more "pythonic" way of doing this, or at least if there are obvious speedups here.
source to share
If your data is in a form [{},{},...]
, you can do the following ...
The problem with your data is in the data key of your dictionaries.
df = pd.DataFrame(data)
fix = df.groupby(level=0)['data'].apply(lambda x:pd.DataFrame(x.iloc[0],columns = ['Year','Value']))
fix = fix.reset_index(level=1,drop=True)
df = pd.merge(fix,df.drop(['data'],1),how='inner',left_index=True,right_index=True)
The code does the following ...
- Creates a DataFrame with a list of dictionaries
- creates a new dataframe by stretching the data column into more rows
- The stretch line caused a multi-index with a misplaced column - this removes it
- Finally, concatenate the original index and get the DataFrame you want
source to share
Some data would be helpful in answering this question. However, from your data structure, some sample data might look like this:
dict_list = [{'data' : [['1999', 1], ['2000', 2], ['2001', 3]],
'description' : 'foo_dictionary',
'end' : 'foo1',
'f' : 'foo2',},
{'data' : [['2002', 4], ['2003', 5]],
'description' : 'bar_dictionary',
'end' : 'bar1',
'f' : 'bar2',}
]
My suggestion was to manipulate and transform this data into a new dictionary and then just pass that dictionary to the DataFrame constructor. To pass a dictionary to the constructor pd.DataFrame
, you can simply change the shape of the data into a new dict like this:
data_dict = {'years' : [],
'value' : [],
'description' : [],
'end' : [],
'f' : [],}
for dictionary in dict_list:
data_dict['years'].extend([elem[0] for elem in dictionary['data']])
data_dict['value'].extend([elem[1] for elem in dictionary['data']])
data_dict['description'].extend(dictionary['description'] for x in xrange(len(dictionary['data'])))
data_dict['end'].extend(dictionary['end'] for x in xrange(len(dictionary['data'])))
data_dict['f'].extend(dictionary['f'] for x in xrange(len(dictionary['data'])))
and then just pipe this to pandas
import pandas as pd
pd.DataFrame(data_dict)
which gives me the following output:
description end f value years
0 foo_dictionary foo1 foo2 1 1999
1 foo_dictionary foo1 foo2 2 2000
2 foo_dictionary foo1 foo2 3 2001
3 bar_dictionary bar1 bar2 4 2002
4 bar_dictionary bar1 bar2 5 2003
I would say that if this is the type of output you want, then this system would be a worthy simplification.
In fact, you could simplify it even further by creating a year: value dictionary as well as a dict for the other vals. Then you don't need to enter a new dictionary and you can start a nested loop. It might look like this:
year_val_dict = {'years' : [],
'value' : []}
other_val_dict = {_key : [] for _key in dict_list[0] if _key!='data'}
for dictionary in dict_list:
year_val_dict['years'].extend([elem[0] for elem in dictionary['data']])
year_val_dict['value'].extend([elem[1] for elem in dictionary['data']])
for _key in other_val_dict:
other_val_dict[_key].extend(dictionary[_key] for x in xrange(len(dictionary['data'])))
year_val_dict.update(other_val_dict)
pd.DataFrame(year_val_dict)
NB this of course assumes that all dicts in the dict_list have the same structure ....
source to share