Python: reading 200K JSON files into Pandas Dataframe
I am very new to python (<2 weeks) and was prompted to read in 200k + JSON files that I provided (as-is) into one database (using python). These JSON files have flat one-level attributes that vary from 50 to 1000 across files, but those 50 are a subset of 1000.
Here is a snippet of the json file:
{
"study_type" : "Observational",
"intervention.intervention_type" : "Device",
"primary_outcome.time_frame" : "24 months",
"primary_completion_date.type" : "Actual",
"design_info.primary_purpose" : "Diagnostic",
"design_info.secondary_purpose" : "Intervention",
"start_date" : "January 2014",
"end_date" : "March 2014",
"overall_status" : "Completed",
"location_countries.country" : "United States",
"location.facility.name" : "Generic Institution",
}
Our goal is to take this master database of these JSON files, flush individual columns, run descriptive statistics on those columns, and create a final, flushed database.
I am coming from a SAS background, so I thought to use pandas and create a (very) large framework. I was combing over stack overflow last week and I used some knowledge, but I feel like there must be a way to make this way more efficient.
Below is the code I've written so far - it works, but is very slow (I believe it will take days, not weeks, to run even after removing unnecessary input attributes / columns starting with "result").
Also, the awkward way of converting a dictionary to a summary table causes the column index of the column to be larger than the column name and I was unable to figure out how to delete it.
import json, os
import pandas as pd
from copy import deepcopy
path_to_json = '/home/ubuntu/json_flat/'
#Gets list of files in directory with *.json suffix
list_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('.json')]
#Initialize series
df_list = []
#For every json file found
for js in list_files:
with open(os.path.join(path_to_json, js)) as data_file:
data = json.loads(data_file.read()) #Loads Json file into dictionary
data_file.close() #Close data file / remove from memory
data_copy = deepcopy(data) #Copies dictionary file
for k in data_copy.keys(): #Iterate over copied dictionary file
if k.startswith('result'): #If field starts with "X" then delete from dictionary
del data[k]
df = pd.Series(data) #Convert Dictionary to Series
df_list.append(df) #Append to empty series
database = pd.concat(df_list, axis=1).reset_index() #Concatenate series into database
output_db = database.transpose() #Transpose rows/columns
output_db.to_csv('/home/ubuntu/output/output_db.csv', mode = 'w', index=False)
Any thoughts, advice is greatly appreciated. I am completely open to using another technique or approach entirely (in python) if it is more efficient and still allows us to accomplish our tasks above.
Thank!
source to share
I have tried to replicate your approach in a more concise manner, fewer copies and additions. It works with the sample data you provided, but don't know if there are additional complications in your dataset. You can try, I hope the comments help.
import json
import os
import pandas
import io
path_to_json = "XXX"
list_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('.json')]
#set up an empty dictionary
resultdict = {}
for fili in list_files:
#the with avoids the extra step of closing the file
with open(os.path.join(path_to_json, fili), "r") as inputjson:
#the dictionary key is set to filename here, but you could also use e.g. a counter
resultdict[fili] = json.load(inputjson)
"""
you can exclude stuff here or later via dictionary comprehensions:
http://stackoverflow.com/questions/1747817/create-a-dictionary-with-list-comprehension-in-python
e.g. as in your example code
resultdict[fili] = {k:v for k,v in json.load(inputjson).items() if not k.startswith("result")}
"""
#put the whole thing into the DataFrame
dataframe = pandas.DataFrame(resultdict)
#write out, transpose for desired format
with open("output.csv", "w") as csvout:
dataframe.T.to_csv(csvout)
source to share
The most important performance error is probably this:
database = pd.concat(df_list, axis=1).reset_index()
You do this in a loop, adding one more thing to each time df_list
and then adding again. But this database variable is not fully used, so you can only do this step once, outside of the loop.
With Pandas, "concat" in a loop is a huge anti-pattern. Create your list in a loop, concat once.
Second, you have to use Pandas to read JSON files: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html
Keep it simple. Write a function that takes the path, calls pd.read_json()
, removes the lines you want ( series.str.startswith()
), etc.
Once you've done this well, your next step will be to check if the CPU is limited (CPU usage is 100%) or I / O is limited (CPU consumption is much less than 100%).
source to share