Randomly select rows from file based on time in columns

Question

Randomly select rows from file based on time in columns

It's a little tricky and I really appreciate any help! I am trying to randomly sample lines from a CSV file. Basically, I want to get the resulting unique locations file (locations are specified by columns Easting

and Northing

data file, below). I want to randomly get 1st place in the 12-hour period for SessionDate

the file (12-hour period, divided into: between 0631

and 1829

hours and between 1830

and 0630

hours, taking into account both Start:

, and End:

in the data file below); BUT if any 2 locations are within 6 hours of each other (depending on their timeStart:

), in order for this location to be selected, and a new location to be randomly drawn, and for this sampling to continue until new locations are drawn (i.e., sampling without replacement). I am trying to do this with python, but my experience is very limited. I've tried putting each line into a dictionary first, and recently each line into a list:

import random
import csv

f = open('file.csv', "U")
list = []

for line in f:
    list.append(line.split(','))

I'm not sure where to go from here - how to sample from these lists the way I need to and then write them to the output file with my "unique" locations.

Here are the first few lines of my data file:

SessionDate Start:  End:    Easting Northing
27-Apr-07   18:00   21:45   174739  9785206
28-Apr-07   18:00   21:30   171984  9784738
28-Apr-07   18:00   21:30   171984  9784738
28-Apr-07   18:00   21:30   171984  9784738
28-Apr-07   18:00   21:30   171984  9784738

It gets a little more complicated as some of the observations span midnight, so they might be on different dates, but might be within 6 hours of each other (which is why I have this criterion), for example:

SessionDate Start:  End:    Easting Northing
27-Apr-07   22:30   23:25   171984  9784738
28-Apr-07   0:25    1:30    174739  9785206

+3

python random csv sampling

kjc May 11 '15 at 22:40

source to share

2 answers

vikramls · Answer 1 · 2015-05-12T17:22:31+0000

Here's my solution - I made a few changes to your data (location to make it easier to see the results). I am basically creating dict

dates pointing to other dict

locations that point to a list of selected rows.

data  = """SessionDate Start:  End:    Easting Northing
27-Apr-07   18:00   21:45   A  1
27-Apr-07   18:00   21:30   G  2
28-Apr-07   18:00   21:30   B  2
28-Apr-07   18:00   21:30   B  2
28-Apr-07   18:00   21:30   B  2
29-Apr-07   8:00    11:30   C 3
29-Apr-07   20:00   21:30   C  3
29-Apr-07   20:00   21:30   C  3
30-Apr-07   8:00   10:30   D  4
30-Apr-07   16:00   17:30   E  5
30-Apr-07   14:00   21:30   F  6
30-Apr-07   18:00   21:30   F  6
"""

selected = {}
for line in data.split("\n"):
    if "Session" in line:
        continue
    if not line:
        continue

    tmp = [x for x in line.split() if x]
    raw_dt = " ".join([tmp[0], tmp[1]]).strip()
    curr_dt = datetime.strptime(raw_dt, "%d-%b-%y %H:%M")
    loc = (tmp[-2], tmp[-1])

    found = False
    for dt in selected:
        diff = dt - curr_dt
        if dt < curr_dt:
            diff = curr_dt - dt
        # print dt, curr_dt, diff, diff <= timedelta(hours=12), loc, loc in selected[dt]            
        if diff <= timedelta(hours=12):
            if loc not in selected[dt]:
                selected[dt].setdefault(loc, []).append(tmp)
                found = True
            else:
                found = True
    if not found:
        if curr_dt not in selected:
            selected[curr_dt] = {}
        if loc not in selected[curr_dt]:
            selected[curr_dt][loc] = [tmp,]

# if output needs to be sorted
rows = sorted(x for k in selected for l in selected[k] for x in selected[k][l])
for row in rows:
    print " ".join(row)

jorgeh · Answer 2 · 2015-05-11T23:54:52+0000

This is not a complete answer, but something that points you in the right direction.

As I said in a comment, the handling of datetime objects in python is done using the datetime module . Here's a small example related to your problem:

from datetime import datetime

d1 = datetime.strptime("27-Apr-07 18:00", "%d-%b-%y %H:%M")
d2 = datetime.strptime("28-Apr-07 01:00", "%d-%b-%y %H:%M")

difference = d2 - d1

#Difference in hours
dH = difference.days*24 + difference.seconds/3600

Other than that, just scroll through the sorted file, after reading the entire 12H block, ramdomly sample, make sure your unique condition is met (if not repeated) and go.

Randomly select rows from file based on time in columns

More articles: