Multiple values ​​in one pandas DataFrame column

I have some data that I am processing from XML to a pandas DataFrame. The XML data looks something like this:

<tracks>
  <track name="trackname1" variants="1,2,3,4,5">
    <variant var="1,2,3">
      <leg time="21:23" route_id="5" stop_id="103" serial="1"/>
      <leg time="21:26" route_id="5" stop_id="17" serial="2"/>
      <leg time="21:30" route_id="5" stop_id="38" serial="3"/>
      <leg time="20:57" route_id="8" stop_id="101" serial="1"/>
      <leg time="21:01" route_id="8" stop_id="59" serial="2"/>
      ...
    </variant>
    <variant var="4,5">
      ... more leg elements
    </variant>
  </track>
  <track name="trackname2" variants="1,2,3,4,5,6,7">
    <variant var="1">
      ... more leg elements
    </variant>
    <variant var="2,3,4,5,7">
      ... more leg elements
    </variant>
  </track>
</tracks>

      

I am importing this into pandas because I need to be able to join this data with other DataFrames and I need to be able to query things like: "get all option 1 legs for route_id 5".

I am trying to figure out how to do this in a pandas DataFrame. Should I make a DataFrame that looks something like this:

track_name     variants  time     route_id  stop_id  serial
"trackname1"   "1,2,3"   "21:23"  "5"       "103"    "1"
"trackname1"   "1,2,3"   "21:26"  "5"       "17"     "2"
...
"trackname1"   "4,5"     "21:20"  "5"       "103"    "1"
...
"trackname2"   "1"       "20:59"  "3"       "45"     "1"
... you get the point

      

If this is a way, how would I (efficiently) extract, for example, "all lines for option 3 on route_id 5"? Note that this should give me all rows that have 3 in the list of option columns, not just rows that only have "3" in the option column.

Is there some other way to construct the DataFrame that would make this easier? Should I be using something other than pandas?

+3


source to share


1 answer


Assuming you have enough memory, your task will be easier to complete if your DataFrame holds one option for each row:

track_name     variants  time     route_id  stop_id  serial
"trackname1"   1         "21:23"         5      103       1
"trackname1"   2         "21:23"         5      103       1
"trackname1"   3         "21:23"         5      103       1
"trackname1"   1         "21:26"         5       17       2
"trackname1"   2         "21:26"         5       17       2
"trackname1"   3         "21:26"         5       17       2
...
"trackname1"   4         "21:20"         5      103       1
"trackname1"   5         "21:20"         5      103       1
...
"trackname2"   1         "20:59"         3       45       1

      

Then you can find "all lines for option 3 on route_id 5 with

df.loc[(df['variants']==3) & (df['route_id']==5)]

      

If you are packing many options into one line, for example

"trackname1"   "1,2,3"   "21:23"  "5"       "103"    "1"

      

then you can find lines like this using



df.loc[(df['variants'].str.contains("3")) & (df['route_id']=="5")]

      

assuming that the options are always unambiguous. If there are also 2-digit choices such as "13" or "30", you will need to pass the more complex regex pattern to str.contains

.

Alternatively, you can use apply

to separate each option with a comma:

df['variants'].apply(lambda x: "3" in x.split(','))

      

but this is very inefficient since now you are calling the Python function once for each line and doing the string splitting and membership test against the vectorized integer comparison.

So, to avoid a possible complex regex or a relatively slow call apply

, I find it best to build a DataFrame with one integer variant for each row.

+3


source







All Articles