If operator is grouped in Pandas dataframe

I have a dataset that contains columns for Year, Julian Day, Hour and Temperature. I have grouped the data by year and day and now I want to perform an operation on temperature data IF each day contains 24 hours of data. Then I want to create a Dataframe with year, Julian day, maximum temperature and minimum temperature. However, I'm not sure about the syntax to make sure this condition is met. Any help would be greatly appreciated. My code is below:

df = pd.read_table(data,skiprows=1,sep='\t',usecols=(0,3,4,6),names=['year','jday','hour','temp'],na_values=-999.9)
g = df.groupby(['year','jday'])

if #the grouped year and day has 24 hours worth of data
    maxt = g.aggregate({'temp':np.max})
    mint = g.aggregate({'temp':np.min})
else:
    continue

      

And some sample data (goes from 1942-2015):

Year    Month   Day Julian  Hour    Wind    TempC   DewC    Pressure    RH
1942    9   24  267 9   2.1 18.5    15.2    1014.2  81.0
1942    9   24  267 10  2.1 23.5    14.6    1014.6  57.0
1942    9   24  267 11  3.6 25.2    12.4    1014.2  45.0
1942    9   24  267 12  3.6 26.8    11.9    1014.2  40.0
1942    9   24  267 13  2.6 27.4    11.9    1014.2  38.0
1942    9   24  267 14  2.1 28.0    11.3    1013.5  35.0
1942    9   24  267 15  4.1 29.1    9.1 1013.5  29.0
1942    9   24  267 16  4.1 29.1    10.7    1013.5  32.0
1942    9   24  267 17  4.6 29.1    13.0    1013.9  37.0
1942    9   24  267 18  3.6 25.7    12.4    1015.2  44.0
1942    9   24  267 19  0.0 23.0    16.3    1015.2  66.0
1942    9   24  267 20  2.6 22.4    15.7    1015.9  66.0
1942    9   24  267 21  2.1 20.2    16.3    1016.3  78.0
1942    9   24  267 22  3.1 20.2    14.6    1016.9  70.0
1942    9   24  267 23  2.6 19.6    15.2    1017.6  76.0
1942    9   25  268 0   3.1 18.5    13.5    1018.3  73.0
1942    9   25  268 1   2.6 16.9    13.0    1018.3  78.0
1942    9   25  268 2   4.1 15.7    5.2 1021.0  50.0
1942    9   25  268 3   4.1 15.2    4.1 1020.7  47.0
1942    9   25  268 4   3.1 14.1    5.8 1021.3  57.0
1942    9   25  268 5   3.1 13.0    5.8 1021.3  62.0
1942    9   25  268 6   2.1 13.0    5.2 1022.4  59.0
1942    9   25  268 7   2.1 12.4    1.9 1022.4  49.0
1942    9   25  268 8   3.6 13.5    5.8 1024.7  60.0
1942    9   25  268 9   4.6 15.7    3.5 1025.1  44.0
1942    9   25  268 10  4.1 17.4    1.3 1025.4  34.0
1942    9   25  268 11  2.6 18.5    3.0 1025.4  36.0
1942    9   25  268 12  2.1 19.1    0.8 1025.1  29.0
1942    9   25  268 13  2.6 19.6    2.4 1024.7  32.0
1942    9   25  268 14  4.1 20.7    4.6 1023.4  35.0
1942    9   25  268 15  3.6 21.3    4.1 1023.7  32.0
1942    9   25  268 16  1.5 21.3    4.6 1023.4  34.0
1942    9   25  268 17  5.1 20.7    7.4 1023.4  42.0
1942    9   25  268 18  5.1 19.1    8.5 1023.0  50.0
1942    9   25  268 19  3.6 18.0    9.6 1022.7  58.0
1942    9   25  268 20  3.1 16.3    9.6 1023.0  65.0
1942    9   25  268 21  1.5 15.2    11.3    1023.0  78.0
1942    9   25  268 22  1.5 14.6    11.3    1023.0  81.0
1942    9   25  268 23  2.1 14.1    10.7    1024.0  80.0

      

+3


source to share


2 answers


I am assuming that there is no group ['year', 'julian']

that contains non-integer hours, so we can just use the group length as a condition.

import pandas as pd

def get_min_max_by_date(df_group):
    if len(df_group['hour'].unique()) < 24:
        new_df = pd.DataFrame()
    else:
        year = df_group['year'].unique()[0]
        j_day = df_group['jday'].unique()[0]
        min_temp = df_group['temp'].min()
        max_temp = df_group['temp'].max()
        new_df = pd.DataFrame({'year': [year],
                               'julian_day': [j_day],
                               'min_temp': [min_temp],
                               'max_temp': [max_temp]}, index=[0])
    return new_df

df = pd.read_table(data,
                   skiprows=1,
                   sep='\t',
                   usecols=(0, 3, 4, 6),
                   names=['year', 'jday', 'hour', 'temp'],
                   na_values=-999.9)

final_df = df.groupby(['year', 'jday'],
                      as_index=False).apply(get_min_max_by_date)
final_df = final_df.reset_index()

      



I don't have time to test this right now, but this should get you started.

+1


source


I started by grouping for just one day and then iterating through the groups, checking the unique hours in each group. You can use set

to find a unique clock for each day of measurement, and compare it with the clock for the day{0,1,2,3,...23}

a_full_day = set(range(24))
#data_out  = {}
gb         = df.groupby(['jday']) # only group by day

for day, inds in gb.groups.iteritems():
    if set(df.ix[inds, 'hour']) == a_full_day:
        maxt = df.ix[inds, 'temp'].max()
        #data_out[day] = {}
        #data_out[day]['maxt'] = maxt
        # etc

      



I have added some commented lines suggesting how you can save the output

0


source







All Articles