Randomly select rows from file based on time in columns
It's a little tricky and I really appreciate any help! I am trying to randomly sample lines from a CSV file. Basically, I want to get the resulting unique locations file (locations are specified by columns Easting
and Northing
data file, below). I want to randomly get 1st place in the 12-hour period for SessionDate
the file (12-hour period, divided into: between 0631
and 1829
hours and between 1830
and 0630
hours, taking into account both Start:
, and End:
in the data file below); BUT if any 2 locations are within 6 hours of each other (depending on their timeStart:
), in order for this location to be selected, and a new location to be randomly drawn, and for this sampling to continue until new locations are drawn (i.e., sampling without replacement). I am trying to do this with python, but my experience is very limited. I've tried putting each line into a dictionary first, and recently each line into a list:
import random
import csv
f = open('file.csv', "U")
list = []
for line in f:
list.append(line.split(','))
I'm not sure where to go from here - how to sample from these lists the way I need to and then write them to the output file with my "unique" locations.
Here are the first few lines of my data file:
SessionDate Start: End: Easting Northing
27-Apr-07 18:00 21:45 174739 9785206
28-Apr-07 18:00 21:30 171984 9784738
28-Apr-07 18:00 21:30 171984 9784738
28-Apr-07 18:00 21:30 171984 9784738
28-Apr-07 18:00 21:30 171984 9784738
It gets a little more complicated as some of the observations span midnight, so they might be on different dates, but might be within 6 hours of each other (which is why I have this criterion), for example:
SessionDate Start: End: Easting Northing
27-Apr-07 22:30 23:25 171984 9784738
28-Apr-07 0:25 1:30 174739 9785206
source to share
Here's my solution - I made a few changes to your data (location to make it easier to see the results). I am basically creating dict
dates pointing to other dict
locations that point to a list of selected rows.
data = """SessionDate Start: End: Easting Northing
27-Apr-07 18:00 21:45 A 1
27-Apr-07 18:00 21:30 G 2
28-Apr-07 18:00 21:30 B 2
28-Apr-07 18:00 21:30 B 2
28-Apr-07 18:00 21:30 B 2
29-Apr-07 8:00 11:30 C 3
29-Apr-07 20:00 21:30 C 3
29-Apr-07 20:00 21:30 C 3
30-Apr-07 8:00 10:30 D 4
30-Apr-07 16:00 17:30 E 5
30-Apr-07 14:00 21:30 F 6
30-Apr-07 18:00 21:30 F 6
"""
selected = {}
for line in data.split("\n"):
if "Session" in line:
continue
if not line:
continue
tmp = [x for x in line.split() if x]
raw_dt = " ".join([tmp[0], tmp[1]]).strip()
curr_dt = datetime.strptime(raw_dt, "%d-%b-%y %H:%M")
loc = (tmp[-2], tmp[-1])
found = False
for dt in selected:
diff = dt - curr_dt
if dt < curr_dt:
diff = curr_dt - dt
# print dt, curr_dt, diff, diff <= timedelta(hours=12), loc, loc in selected[dt]
if diff <= timedelta(hours=12):
if loc not in selected[dt]:
selected[dt].setdefault(loc, []).append(tmp)
found = True
else:
found = True
if not found:
if curr_dt not in selected:
selected[curr_dt] = {}
if loc not in selected[curr_dt]:
selected[curr_dt][loc] = [tmp,]
# if output needs to be sorted
rows = sorted(x for k in selected for l in selected[k] for x in selected[k][l])
for row in rows:
print " ".join(row)
source to share
This is not a complete answer, but something that points you in the right direction.
As I said in a comment, the handling of datetime objects in python is done using the datetime module . Here's a small example related to your problem:
from datetime import datetime
d1 = datetime.strptime("27-Apr-07 18:00", "%d-%b-%y %H:%M")
d2 = datetime.strptime("28-Apr-07 01:00", "%d-%b-%y %H:%M")
difference = d2 - d1
#Difference in hours
dH = difference.days*24 + difference.seconds/3600
Other than that, just scroll through the sorted file, after reading the entire 12H block, ramdomly sample, make sure your unique condition is met (if not repeated) and go.
source to share