Comparing columns in two separate pandas frames

Question

Comparing columns in two separate pandas frames

I have two data frames, both of which contain latitude and longitude columns. For every lat / lon entry in the first data frame, I want to evaluate every lat / lon pair in the second data frame to determine the distance.

For example:

df1: df2:

     lat lon lat lon 
0 38.32 -100.50 0 37.65 -97.87
1 42.51 -97.39 1 33.31 -96.40
2 33.45 -103.21 2 36.22 -100.01

distance between 38.32, -100.50 and 37.65, -97.87
distance between 38.32, -100.50 and 33.31, -96.40
distance between 38.32, -100.50 and 36.22, -100.01
distance between 42.51, -97.39 and 37.65, -97.87
distance between 42.51, -97.39 and 33.31, -96.40
... and so on ...

I'm not sure how to do this.

Thanks for the help!

+3

python pandas

user1985891 03 Apr 17 at 18:37

source to share

3 answers

piRSquared · Answer 1 · 2017-04-03T18:54:10+0000

Euclidean Distance is calculated as

$edpic$

You can do this with two of your data files like

((df1 - df2) ** 2).sum(1) ** .5

0    2.714001
1    9.253113
2    4.232363
dtype: float64

MaxU · Answer 2 · 2017-04-03T19:05:24+0000

UPDATE: as @root noted, it doesn't make sense to use a Euclidean metric in this case, so use sklearn.neighbors.DistanceMetric

from sklearn.neighbors import DistanceMetric
dist = DistanceMetric.get_metric('haversine')

first we can build a DF with all combinations - (c) root :

x = pd.merge(df1.assign(k=1), df2.assign(k=1), on='k', suffixes=('1', '2')) \
      .drop('k',1)

vectorized haversine distance calculation

x['dist'] = np.ravel(dist.pairwise(np.radians(df1),np.radians(df2)) * 6367)

Result:

In [86]: x
Out[86]:
    lat1    lon1   lat2    lon2         dist
0  38.32 -100.50  37.65  -97.87   242.073182
1  38.32 -100.50  33.31  -96.40   667.993048
2  38.32 -100.50  36.22 -100.01   237.350451
3  42.51  -97.39  37.65  -97.87   541.605087
4  42.51  -97.39  33.31  -96.40  1026.006744
5  42.51  -97.39  36.22 -100.01   734.219411
6  33.45 -103.21  37.65  -97.87   671.274044
7  33.45 -103.21  33.31  -96.40   632.004981
8  33.45 -103.21  36.22 -100.01   424.140594

OLD answer:

IIUC you can use scipy.spatial.distance.pdist pairwise distance :

In [32]: from scipy.spatial.distance import pdist

In [43]: from itertools import combinations

In [34]: X = pd.concat([df1, df2])

In [35]: X
Out[35]:
     lat     lon
0  38.32 -100.50
1  42.51  -97.39
2  33.45 -103.21
0  37.65  -97.87
1  33.31  -96.40
2  36.22 -100.01

like Pandas. Series:

In [36]: s = pd.Series(pdist(X),
                       index=pd.MultiIndex.from_tuples(tuple(combinations(X.index, 2))))

In [37]: s
Out[37]:
0  1     5.218065
   2     5.573240
   0     2.714001
   1     6.473801
   2     2.156409
1  2    10.768287
   0     4.883646
   1     9.253113
   2     6.813846
2  0     6.793791
   1     6.811439
   2     4.232363
0  1     4.582194
   2     2.573810
1  2     4.636831
dtype: float64

like Pandas.DataFrame:

In [46]: s.rename_axis(['df1','df2']).reset_index(name='dist')
Out[46]:
    df1  df2       dist
0     0    1   5.218065
1     0    2   5.573240
2     0    0   2.714001
3     0    1   6.473801
4     0    2   2.156409
5     1    2  10.768287
6     1    0   4.883646
7     1    1   9.253113
8     1    2   6.813846
9     2    0   6.793791
10    2    1   6.811439
11    2    2   4.232363
12    0    1   4.582194
13    0    2   2.573810
14    1    2   4.636831

root · Answer 3 · 2017-04-03T19:29:31+0000

You can cross-connect to get all lat / lon combinations and then calculate the distance using the appropriate measure. To do this, you can use the package geopy

that supplies geopy.distance.vincenty

and geopy.distance.great_circle

. Both should give valid distances, while vincenty

giving more accurate results, but is slower to compute.

from geopy.distance import vincenty

# Function to compute distances.
def get_lat_lon_dist(row):
    # Store lat/long as tuples for input into distance functions.
    latlon1 = tuple(row[['lat1', 'lon1']])
    latlon2 = tuple(row[['lat2', 'lon2']])

    # Compute the distance.
    return vincenty(latlon1, latlon2).km

# Perform a cross-join to get all combinations of lat/lon.
dist = pd.merge(df1.assign(k=1), df2.assign(k=1), on='k', suffixes=('1', '2')) \
         .drop('k', axis=1)

# Compute the distances between lat/longs
dist['distance'] = dist.apply(get_lat_lon_dist, axis=1)

I've used kilometers as units in this example, but others can be specified, for example:

vincenty(latlon1, latlon2).miles

Result:

    lat1    lon1   lat2    lon2     distance
0  38.32 -100.50  37.65  -97.87   242.709065
1  38.32 -100.50  33.31  -96.40   667.878723
2  38.32 -100.50  36.22 -100.01   237.080141
3  42.51  -97.39  37.65  -97.87   541.184297
4  42.51  -97.39  33.31  -96.40  1024.839512
5  42.51  -97.39  36.22 -100.01   733.819732
6  33.45 -103.21  37.65  -97.87   671.766908
7  33.45 -103.21  33.31  -96.40   633.751134
8  33.45 -103.21  36.22 -100.01   424.335874

Edit

As @MaxU pointed out in the comments, you can use the numpy implementation of the Haversine formula in a similar way for added performance. This should be equivalent to a function great_circle

in geopy

.

Comparing columns in two separate pandas frames

More articles: