Parsing key-value pairs in DataFrame columns
I have key-value pairs that are nested in a row in a pandas series. What is the most efficient / optimized way to split them into separate columns? (I can unzip and do conversions, but what's the best way?)
I do n't know :
- Keyword in advance
- Number of keys in each entry
- The order of the keys in each record
Strings are a list of unicode strings. Once retrieved, the values ββwill always be bigint.
Input:
parsedSeries.head()
0 [key1=774, key2=238]
1 [key1=524, key2=101, key3=848]
2 [key3=843]
3 [key1=232, key3=298, key2=457]
Expected Result:
record key1 key2 key3
0 774 238 NAN
1 524 101 848
2 NAN NAN 843
3 232 457 298
Note that the input consists of lists containing Unicode-formatted strings u"X=Y"
, where it is assumed X
to adhere to any binding conventions for use as an attribute name in Python, and Y
can always be interpreted as an integer. For example, the following can be used to plot the above data:
pandas.Series([[u"key1=774", u"key2=238"],
[u"key1=524", u"key2=101", u"key3=848"],
[u"key3=843"],
[u"key1=232", u"key3=298", u"key2=457"]])
source to share
The "best" solution is probably because you don't find yourself in this situation in the first place. Most of the time, when you have non-scalar values ββin a Series or DataFrame, you've already taken a step in the wrong direction because you can't apply vector operations.
Anyway, starting with your series, you can do something like this:
>>> ds = [dict(w.split('=', 1) for w in x) for x in s]
>>> pd.DataFrame.from_records(ds)
key1 key2 key3
0 774 238 NaN
1 524 101 848
2 NaN NaN 843
3 232 457 298
source to share
ok the final answer for you may differ depending on how accurate your example is. In particular, regular expressions can be adjusted for data analysis.
Lets do some imoprts and set your datafile:
import re
import pandas as pd
from StringIO import StringIO
f = StringIO("""0 [key1=774, key2=238]
1 [key1=524, key2=101, key3=848]
2 [key3=843]
3 [key1=232, key3=298, key2=457]""")
We are now ready to start. First, just some regex magic to get a dictation of your strings:
# get the dicts
rows = [dict(re.findall('(key[0-9]*)=([0-9]*)',l)) for l in f]
# convert values to ints
rows = [dict((k,int(v)) for k,v in row.items()) for row in rows]
rows
Output:
[{'key1': 774, 'key2': 238},
{'key1': 524, 'key2': 101, 'key3': 848},
{'key3': 843},
{'key1': 232, 'key2': 457, 'key3': 298}]
It was just a regex, but you're just there:
pd.DataFrame(rows)
Output:
key1 key2 key3
0 774 238 NaN
1 524 101 848
2 NaN NaN 843
3 232 457 298
Convert to one liner if you like, but I leave it in two steps so you can tweak the regex to match your actual datafile.
source to share
Very minor tweak DSM to use from_records
to treat values ββas integers rather than strings.
def key_to_int(split_vals):
return (split_vals[0], int(split_vals[1]))
def dictify(row):
return dict(key_to_int(elem.split("=")) for elem in row)
pandas.DataFrame.from_records(parsedSeries.map(dictify))
gives
Out[518]:
key1 key2 key3
0 774 238 NaN
1 524 101 848
2 NaN NaN 843
3 232 457 298
[4 rows x 3 columns]
where the values ββare integers (columns are still of type float
due to values NaN
, and NumPy continues to lack support for NaN integer value).
source to share