Python: minimum mean distance

I have a set of Users Latitude and Longitude and a set of latitude latitudes at an office location.

I need to find an office location that has a minimum average distance to all users.

What is the efficient way to do this in python? I have 3k users and 40k office spaces ...

For example:

Input: User 1 (x1, y1)
User 2 (x2, y2)
Office 1 (x3, y3)
Office 2 (x4, y4)

Then I need to figure out the location of the office that has a minimum distance from all users.

Office 1 is 200 m from user 1 and 400 m from user 2. Average distance from all users = 300 m

Office 2 is 100 m from user 1 and 200 m from user 2. Average distance from all users = 150 m

Office 2 is the place of choice.

+3


source to share


2 answers


Here's an example using geodjango parts of django. You can do the same using shapely with pyproj . (This can be a little painful to set up, but once you have all the settings, it is pretty trivial to work.)



from django.contrib.gis.geos import Point, MultiPoint

WGS84_SRID = 4326
office1 = Point(x1, y1, srid=WGS84_SRID )
office2 = Point(x2, y1, srid=WGS84_SRID )

# load user locations
user_locations = []
with open("user_locations.csv", "r") as in_f:
    # assuming wgs84 decimal degrees 
    # one location per line in format, 'lon, lat'
    for line in in_f:
        x, y = [float(i.strip()) for i in line.split(",")]
        user_locations.append(Point(x, y, srid=WGS84_SRID ))

# get points in a meters projection
GOOGLE_MAPS_SRID = 3857
office1_meters = office1.transform(GOOGLE_MAPS_SRID, clone=True)
office2_meters = office2.transform(GOOGLE_MAPS_SRID, clone=True)
user_locations_meters = [user_loc.transform(GOOGLE_MAPS_SRID, clone=True) for user_loc in user_locations]

# centroid method
mp = MultiPoint(user_locations, srid=4326)
centroid_distance_from_office1 = mp.centroid.distance(office1_meters)
centroid_distance_from_office2 = mp.centroid.distance(office1_meters)

print "Centroid Location: {}".format(mp.centroid.ewkt)
print("centroid_distance_from_office1: {}m".format(centroid_distance_from_office1)
print("centroid_distance_from_office2: {}m".format(centroid_distance_from_office2)

# average distance method
total_user_locations = float(len(user_locations))
office1_user_avg_distance = sum( user_loc.distance(office1_meters) for user_loc in user_locations_meters)/total_user_locations 
office2_user_avg_distance = sum( user_loc.distance(office2_meters) for user_loc in user_locations_meters)/total_user_locations 

print "avg user distance OFFICE-1: {}".format(office1_user_avg_distance)
print "avg user distance OFFICE-2: {}".format(office2_user_avg_distance)

      

+1


source


Basically the code implementing the algorithm at http://en.wikipedia.org/wiki/Geometric_median#Computation and gives you a usage example based on a set of random points.

NB: this is for points in a plane, because I cannot work out how to sum two spherical coordinates ... so you need to direct the spherical coordinates with a planar projection in advance, but this point has already been touched on the previous answer.

code

from math import sqrt
from random import seed, uniform
from operator import add
seed(100)

class Point():
    """Very basic point class, supporting "+", scalar "/" and distances."""
    def __init__(self, x, y):
        self.x = x
        self.y = y
    def __repr__(self):
        return "("+repr(self.x)+","+repr(self.y)+")"
    def __add__(self, P):
        return Point(self.x+P.x, self.y+P.y)
    def __div__(self, scalar):
        return Point(self.x/float(scalar), self.y/float(scalar))
    def delta(self, P):
        dx = self.x - P.x
        dy = self.y - P.y
        return sqrt(dx*dx+dy*dy)

def iterate(GM,points):
    "Simple implementation of http://en.wikipedia.org/wiki/Geometric_median#Computation"
    # distances from the tentative GM
    distances = [GM.delta(p) for p in points]
    normalized_positions = [p/d for p,d in zip(points,distances)]
    normalization_factor = sum(1.0/d for d in distances)
    new_median = reduce(add, normalized_positions)/normalization_factor
    return new_median

# The "clients"
nclients = 10
points = [Point(uniform(-3,3),uniform(-3,3)) for i in range(nclients)]

# Centroid of clients and total of distances
centroid = reduce(add,points)/nclients
print "Position of centroid:",centroid
print "Sum of distances from centroid:",
print reduce(add,[centroid.delta(p) for p in points])


print
print "Compute the Geometric Median using random starting points:"
nstart = 10
for P0 in [Point(uniform(-5,5),uniform(-5,5)) for i in range(nstart)]:
    p0 = P0
    for i in range(10):
        P0 = iterate(P0, points)
    print p0,"-->",P0

print
print "Sum of distances from last Geometric Median:",
print reduce(add,[P0.delta(p) for p in points])

      

Output



Position of centroid: (-0.4647467432024398,0.08675910209912471)
Sum of distances from centroid: 22.846445119

Compute the Geometric Median using random starting points:
(1.2632163919279735,4.633157837008632) --> (-0.8739691868669638,-0.019827884361901298)
(-2.8916600791314986,4.561006461166512) --> (-0.8929310891388812,-0.025857080003665663)
(0.5539966580106901,4.011520429873922) --> (-0.8764828849474395,-0.020607834485528134)
(3.1801819335743033,-3.395781900250662) --> (-0.8550062003820846,-0.014134334529992666)
(1.48542908120573,-3.7590671941155627) --> (-0.8687797019011291,-0.018241177226221747)
(-4.943549141082007,-1.044838193982506) --> (-0.9066276248482427,-0.030440865315529194)
(2.73500702168781,0.6615770729288597) --> (-0.8231318436739281,-0.005320464433689587)
(-3.073593440129266,3.411747144619733) --> (-0.8952513352350909,-0.026600471220747438)
(4.137768422492282,-2.6277493707729596) --> (-0.8471586848200597,-0.011875801531868494)
(-0.5180751681772549,1.377998063140823) --> (-0.8849056106235963,-0.02326386487180884)

Sum of distances from last Geometric Median: 22.7019120091

      

my own comment

In this case, the locations (centroid vs GM) are completely different, but the results are similar. I expect significant differences in both locations and average distances when you have some sort of clustering around a point (city) or some function is talking about a line (road) etc.

In the end, you can speed up the work with help numpy

, I avoided doing it numpy

due to limited time resources :)

+1


source







All Articles