Parsing key-value pairs in DataFrame columns

I have key-value pairs that are nested in a row in a pandas series. What is the most efficient / optimized way to split them into separate columns? (I can unzip and do conversions, but what's the best way?)

I do n't know :

  • Keyword in advance
  • Number of keys in each entry
  • The order of the keys in each record

Strings are a list of unicode strings. Once retrieved, the values ​​will always be bigint.

Input:

parsedSeries.head()

0 [key1=774, key2=238]
1 [key1=524, key2=101, key3=848]
2 [key3=843]
3 [key1=232, key3=298, key2=457]

      

Expected Result:

record   key1   key2   key3
0        774    238    NAN
1        524    101    848
2        NAN    NAN    843
3        232    457    298

      

Note that the input consists of lists containing Unicode-formatted strings u"X=Y"

, where it is assumed X

to adhere to any binding conventions for use as an attribute name in Python, and Y

can always be interpreted as an integer. For example, the following can be used to plot the above data:

pandas.Series([[u"key1=774", u"key2=238"],
               [u"key1=524", u"key2=101", u"key3=848"],
               [u"key3=843"],
               [u"key1=232", u"key3=298", u"key2=457"]])

      

+3


source to share


3 answers


The "best" solution is probably because you don't find yourself in this situation in the first place. Most of the time, when you have non-scalar values ​​in a Series or DataFrame, you've already taken a step in the wrong direction because you can't apply vector operations.

Anyway, starting with your series, you can do something like this:



>>> ds = [dict(w.split('=', 1) for w in x) for x in s]
>>> pd.DataFrame.from_records(ds)
  key1 key2 key3
0  774  238  NaN
1  524  101  848
2  NaN  NaN  843
3  232  457  298

      

+2


source


ok the final answer for you may differ depending on how accurate your example is. In particular, regular expressions can be adjusted for data analysis.

Lets do some imoprts and set your datafile:

import re
import pandas as pd
from StringIO import StringIO

f = StringIO("""0 [key1=774, key2=238]
1 [key1=524, key2=101, key3=848]
2 [key3=843]
3 [key1=232, key3=298, key2=457]""")

      

We are now ready to start. First, just some regex magic to get a dictation of your strings:

# get the dicts
rows = [dict(re.findall('(key[0-9]*)=([0-9]*)',l)) for l in f]
# convert values to ints
rows = [dict((k,int(v)) for k,v in row.items()) for row in rows]
rows

      

Output:



[{'key1': 774, 'key2': 238},
 {'key1': 524, 'key2': 101, 'key3': 848},
 {'key3': 843},
 {'key1': 232, 'key2': 457, 'key3': 298}]

      

It was just a regex, but you're just there:

pd.DataFrame(rows)

      

Output:



  key1 key2 key3
0  774  238  NaN
1  524  101  848
2  NaN  NaN  843
3  232  457  298

      

Convert to one liner if you like, but I leave it in two steps so you can tweak the regex to match your actual datafile.

+1


source


Very minor tweak DSM to use from_records

to treat values ​​as integers rather than strings.

def key_to_int(split_vals):
    return (split_vals[0], int(split_vals[1]))

def dictify(row):
    return dict(key_to_int(elem.split("=")) for elem in row)

pandas.DataFrame.from_records(parsedSeries.map(dictify))

      

gives

Out[518]: 
   key1  key2  key3
0   774   238   NaN
1   524   101   848
2   NaN   NaN   843
3   232   457   298

[4 rows x 3 columns]

      

where the values ​​are integers (columns are still of type float

due to values NaN

, and NumPy continues to lack support for NaN integer value).

+1


source







All Articles