Comparing columns in two separate pandas frames
I have two data frames, both of which contain latitude and longitude columns. For every lat / lon entry in the first data frame, I want to evaluate every lat / lon pair in the second data frame to determine the distance.
For example:
df1: df2: lat lon lat lon 0 38.32 -100.50 0 37.65 -97.87 1 42.51 -97.39 1 33.31 -96.40 2 33.45 -103.21 2 36.22 -100.01 distance between 38.32, -100.50 and 37.65, -97.87 distance between 38.32, -100.50 and 33.31, -96.40 distance between 38.32, -100.50 and 36.22, -100.01 distance between 42.51, -97.39 and 37.65, -97.87 distance between 42.51, -97.39 and 33.31, -96.40 ... and so on ...
I'm not sure how to do this.
Thanks for the help!
source to share
Euclidean Distance is calculated as
You can do this with two of your data files like
((df1 - df2) ** 2).sum(1) ** .5
0 2.714001
1 9.253113
2 4.232363
dtype: float64
source to share
UPDATE: as @root noted, it doesn't make sense to use a Euclidean metric in this case, so use sklearn.neighbors.DistanceMetric
from sklearn.neighbors import DistanceMetric
dist = DistanceMetric.get_metric('haversine')
first we can build a DF with all combinations - (c) root :
x = pd.merge(df1.assign(k=1), df2.assign(k=1), on='k', suffixes=('1', '2')) \
.drop('k',1)
vectorized haversine distance calculation
x['dist'] = np.ravel(dist.pairwise(np.radians(df1),np.radians(df2)) * 6367)
Result:
In [86]: x
Out[86]:
lat1 lon1 lat2 lon2 dist
0 38.32 -100.50 37.65 -97.87 242.073182
1 38.32 -100.50 33.31 -96.40 667.993048
2 38.32 -100.50 36.22 -100.01 237.350451
3 42.51 -97.39 37.65 -97.87 541.605087
4 42.51 -97.39 33.31 -96.40 1026.006744
5 42.51 -97.39 36.22 -100.01 734.219411
6 33.45 -103.21 37.65 -97.87 671.274044
7 33.45 -103.21 33.31 -96.40 632.004981
8 33.45 -103.21 36.22 -100.01 424.140594
OLD answer:
IIUC you can use scipy.spatial.distance.pdist pairwise distance :
In [32]: from scipy.spatial.distance import pdist
In [43]: from itertools import combinations
In [34]: X = pd.concat([df1, df2])
In [35]: X
Out[35]:
lat lon
0 38.32 -100.50
1 42.51 -97.39
2 33.45 -103.21
0 37.65 -97.87
1 33.31 -96.40
2 36.22 -100.01
like Pandas. Series:
In [36]: s = pd.Series(pdist(X),
index=pd.MultiIndex.from_tuples(tuple(combinations(X.index, 2))))
In [37]: s
Out[37]:
0 1 5.218065
2 5.573240
0 2.714001
1 6.473801
2 2.156409
1 2 10.768287
0 4.883646
1 9.253113
2 6.813846
2 0 6.793791
1 6.811439
2 4.232363
0 1 4.582194
2 2.573810
1 2 4.636831
dtype: float64
like Pandas.DataFrame:
In [46]: s.rename_axis(['df1','df2']).reset_index(name='dist')
Out[46]:
df1 df2 dist
0 0 1 5.218065
1 0 2 5.573240
2 0 0 2.714001
3 0 1 6.473801
4 0 2 2.156409
5 1 2 10.768287
6 1 0 4.883646
7 1 1 9.253113
8 1 2 6.813846
9 2 0 6.793791
10 2 1 6.811439
11 2 2 4.232363
12 0 1 4.582194
13 0 2 2.573810
14 1 2 4.636831
source to share
You can cross-connect to get all lat / lon combinations and then calculate the distance using the appropriate measure. To do this, you can use the package geopy
that supplies geopy.distance.vincenty
and geopy.distance.great_circle
. Both should give valid distances, while vincenty
giving more accurate results, but is slower to compute.
from geopy.distance import vincenty
# Function to compute distances.
def get_lat_lon_dist(row):
# Store lat/long as tuples for input into distance functions.
latlon1 = tuple(row[['lat1', 'lon1']])
latlon2 = tuple(row[['lat2', 'lon2']])
# Compute the distance.
return vincenty(latlon1, latlon2).km
# Perform a cross-join to get all combinations of lat/lon.
dist = pd.merge(df1.assign(k=1), df2.assign(k=1), on='k', suffixes=('1', '2')) \
.drop('k', axis=1)
# Compute the distances between lat/longs
dist['distance'] = dist.apply(get_lat_lon_dist, axis=1)
I've used kilometers as units in this example, but others can be specified, for example:
vincenty(latlon1, latlon2).miles
Result:
lat1 lon1 lat2 lon2 distance
0 38.32 -100.50 37.65 -97.87 242.709065
1 38.32 -100.50 33.31 -96.40 667.878723
2 38.32 -100.50 36.22 -100.01 237.080141
3 42.51 -97.39 37.65 -97.87 541.184297
4 42.51 -97.39 33.31 -96.40 1024.839512
5 42.51 -97.39 36.22 -100.01 733.819732
6 33.45 -103.21 37.65 -97.87 671.766908
7 33.45 -103.21 33.31 -96.40 633.751134
8 33.45 -103.21 36.22 -100.01 424.335874
Edit
As @MaxU pointed out in the comments, you can use the numpy implementation of the Haversine formula in a similar way for added performance. This should be equivalent to a function great_circle
in geopy
.
source to share