Pandas get previous row of data by date

I am working with some data where I need to get the date of occurrence. For example, let's say we are working with medical data. Each row represents a unique visit for a patient, although the same patient can have multiple rows. Each line also contains information about the type of visit, whether it is a regular or emergency room.

I need to go through, and for each line, get the date the patient was previously admitted to the emergency room prior to that visit. For example, I would like to add the previous_er_discharge_date column as shown below:

visit_id    patient_id    discharge_date    visit_type    previous_er_discharge_date
1           abc           2014-05-05        in-patient    2014-05-01
2           abc           2014-05-01        emergency     NaT
3           def           2014-04-18        in-patient    NaT
4           def           2014-03-12        in-patient    2014-02-12
5           def           2014-02-12        emergency     NaT

      

So, I have something that works, but it is very slow. I basically just create a separate dataframe from only ER visits and iterate through the main dataframe and find out if there are previous ER dates for this patient and if they do, I take the first one. (The data is sorted using bit_date). General view of the code I have.

er_visits = main_data[main_data.visit_type=='emergency']
prev_dates = []
for index, row in main_data.iterrows():
    dates = er_visits.discharge_date[(er_visits.patient_id==row.patient_id) &
                                     (er_visits.discharge_date < row.discharge_date)].values

    if len(dates) > 0:
        prev_dates.append(dates[0])
    else:
        prev_dates.append(pd.NaT)

      

The above code works, but it is slow and I was hoping to get help finding faster ways to do this. The dataset I am working with contains several hundred thousand rows and this is something that needs to be done every day.

Thank!

+3


source to share


2 answers


In pandas, you basically want to avoid loops as they kill performance.

Her DataFrame is similar to yours (I was lazy about dates, so they are ints, it's the same idea).

df = pd.DataFrame({
    'id': ['abc', 'abc', 'def', 'def', 'def'],
    'date': [505, 501, 418, 312, 212]})

      

And here's a function that adds the previous date for each group:

def prev_dates(g):
    g.sort(columns=['date'])
    g['prev'] = g.date.shift(-1)
    return g

      



So all it takes is to put things together:

 >> df.groupby(df.id).apply(prev_dates)
    date    id  prev
 0  505     abc     501
 1  501     abc     NaN
 2  418     def     312
 3  312     def     212
 4  212     def     NaN

      

Edit

As noted below by @julius, sort(columns=

has since been deprecated and should be replaced with `` sort_values ​​(by = ''.

+12


source


What if you need to find all visits for this patient?

sort[Date, ID]
[nextpatient] = [ID].shift(-1)
[nextvisit]  = np.where([ID] == [nextpatient],  1, 0)
[nextdate]  = np.where([nextvisit] == 1, [Date].shift(-1),   datetime64.nat)

      



What is my approach (typed on my phone so it is not accurate). I sort and then change the unique one. If this identifier matches the identifier, then I move the date. Then I create a column to measure the time between interactions. Also one more column to determine what is the reason for the visit, as well as another shift.

Interestingly, this is also a good approach in terms of speed. I run it roughly weekly on a dataset of 5 million rows.

0


source







All Articles