Pandas get previous row of data by date
I am working with some data where I need to get the date of occurrence. For example, let's say we are working with medical data. Each row represents a unique visit for a patient, although the same patient can have multiple rows. Each line also contains information about the type of visit, whether it is a regular or emergency room.
I need to go through, and for each line, get the date the patient was previously admitted to the emergency room prior to that visit. For example, I would like to add the previous_er_discharge_date column as shown below:
visit_id patient_id discharge_date visit_type previous_er_discharge_date
1 abc 2014-05-05 in-patient 2014-05-01
2 abc 2014-05-01 emergency NaT
3 def 2014-04-18 in-patient NaT
4 def 2014-03-12 in-patient 2014-02-12
5 def 2014-02-12 emergency NaT
So, I have something that works, but it is very slow. I basically just create a separate dataframe from only ER visits and iterate through the main dataframe and find out if there are previous ER dates for this patient and if they do, I take the first one. (The data is sorted using bit_date). General view of the code I have.
er_visits = main_data[main_data.visit_type=='emergency']
prev_dates = []
for index, row in main_data.iterrows():
dates = er_visits.discharge_date[(er_visits.patient_id==row.patient_id) &
(er_visits.discharge_date < row.discharge_date)].values
if len(dates) > 0:
prev_dates.append(dates[0])
else:
prev_dates.append(pd.NaT)
The above code works, but it is slow and I was hoping to get help finding faster ways to do this. The dataset I am working with contains several hundred thousand rows and this is something that needs to be done every day.
Thank!
source to share
In pandas, you basically want to avoid loops as they kill performance.
Her DataFrame is similar to yours (I was lazy about dates, so they are ints, it's the same idea).
df = pd.DataFrame({
'id': ['abc', 'abc', 'def', 'def', 'def'],
'date': [505, 501, 418, 312, 212]})
And here's a function that adds the previous date for each group:
def prev_dates(g):
g.sort(columns=['date'])
g['prev'] = g.date.shift(-1)
return g
So all it takes is to put things together:
>> df.groupby(df.id).apply(prev_dates)
date id prev
0 505 abc 501
1 501 abc NaN
2 418 def 312
3 312 def 212
4 212 def NaN
Edit
As noted below by @julius, sort(columns=
has since been deprecated and should be replaced with `` sort_values (by = ''.
source to share
What if you need to find all visits for this patient?
sort[Date, ID]
[nextpatient] = [ID].shift(-1)
[nextvisit] = np.where([ID] == [nextpatient], 1, 0)
[nextdate] = np.where([nextvisit] == 1, [Date].shift(-1), datetime64.nat)
What is my approach (typed on my phone so it is not accurate). I sort and then change the unique one. If this identifier matches the identifier, then I move the date. Then I create a column to measure the time between interactions. Also one more column to determine what is the reason for the visit, as well as another shift.
Interestingly, this is also a good approach in terms of speed. I run it roughly weekly on a dataset of 5 million rows.
source to share