Python: reading 200K JSON files into Pandas Dataframe

I am very new to python (<2 weeks) and was prompted to read in 200k + JSON files that I provided (as-is) into one database (using python). These JSON files have flat one-level attributes that vary from 50 to 1000 across files, but those 50 are a subset of 1000.

Here is a snippet of the json file:

{
"study_type" : "Observational",
"intervention.intervention_type" : "Device",
"primary_outcome.time_frame" : "24 months",
"primary_completion_date.type" : "Actual",
"design_info.primary_purpose" : "Diagnostic",
"design_info.secondary_purpose" : "Intervention",
"start_date" : "January 2014",
"end_date" : "March 2014",
"overall_status" : "Completed",
"location_countries.country" : "United States",
"location.facility.name" : "Generic Institution",
}

      

Our goal is to take this master database of these JSON files, flush individual columns, run descriptive statistics on those columns, and create a final, flushed database.

I am coming from a SAS background, so I thought to use pandas and create a (very) large framework. I was combing over stack overflow last week and I used some knowledge, but I feel like there must be a way to make this way more efficient.

Below is the code I've written so far - it works, but is very slow (I believe it will take days, not weeks, to run even after removing unnecessary input attributes / columns starting with "result").

Also, the awkward way of converting a dictionary to a summary table causes the column index of the column to be larger than the column name and I was unable to figure out how to delete it.

import json, os
import pandas as pd    
from copy import deepcopy

path_to_json = '/home/ubuntu/json_flat/'

#Gets list of files in directory with *.json suffix
list_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('.json')]

#Initialize series
df_list = []

#For every json file found
for js in list_files:

    with open(os.path.join(path_to_json, js)) as data_file:
        data = json.loads(data_file.read())                         #Loads Json file into dictionary
        data_file.close()                                           #Close data file / remove from memory

        data_copy = deepcopy(data)                                  #Copies dictionary file
        for k in data_copy.keys():                                  #Iterate over copied dictionary file
            if k.startswith('result'):                              #If field starts with "X" then delete from dictionary
                del data[k]
        df = pd.Series(data)                                        #Convert Dictionary to Series
        df_list.append(df)                                          #Append to empty series  
        database = pd.concat(df_list, axis=1).reset_index()         #Concatenate series into database

output_db = database.transpose()                                    #Transpose rows/columns
output_db.to_csv('/home/ubuntu/output/output_db.csv', mode = 'w', index=False)

      

Any thoughts, advice is greatly appreciated. I am completely open to using another technique or approach entirely (in python) if it is more efficient and still allows us to accomplish our tasks above.

Thank!

+3


source to share


2 answers


I have tried to replicate your approach in a more concise manner, fewer copies and additions. It works with the sample data you provided, but don't know if there are additional complications in your dataset. You can try, I hope the comments help.



import json
import os
import pandas
import io


path_to_json = "XXX"

list_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('.json')]

#set up an empty dictionary
resultdict = {}

for fili in list_files:
    #the with avoids the extra step of closing the file
    with open(os.path.join(path_to_json, fili), "r") as inputjson:
        #the dictionary key is set to filename here, but you could also use e.g. a counter
        resultdict[fili] = json.load(inputjson)
        """
        you can exclude stuff here or later via dictionary comprehensions: 
        http://stackoverflow.com/questions/1747817/create-a-dictionary-with-list-comprehension-in-python
        e.g. as in your example code
        resultdict[fili] = {k:v for k,v in json.load(inputjson).items() if not k.startswith("result")}
        """

#put the whole thing into the DataFrame     
dataframe = pandas.DataFrame(resultdict)

#write out, transpose for desired format
with open("output.csv", "w") as csvout:
    dataframe.T.to_csv(csvout)

      

+2


source


The most important performance error is probably this:

database = pd.concat(df_list, axis=1).reset_index()

      

You do this in a loop, adding one more thing to each time df_list

and then adding again. But this database variable is not fully used, so you can only do this step once, outside of the loop.

With Pandas, "concat" in a loop is a huge anti-pattern. Create your list in a loop, concat once.



Second, you have to use Pandas to read JSON files: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html

Keep it simple. Write a function that takes the path, calls pd.read_json()

, removes the lines you want ( series.str.startswith()

), etc.

Once you've done this well, your next step will be to check if the CPU is limited (CPU usage is 100%) or I / O is limited (CPU consumption is much less than 100%).

+1


source







All Articles