Function vectorization in pandas

Question

Function vectorization in pandas

I have a dataframe that contains a list of lat / lon coordinates:

d = {'Provider ID': {0: '10001',
  1: '10005',
  2: '10006',
  3: '10007',
  4: '10008',
  5: '10011',
  6: '10012',
  7: '10016',
  8: '10018',
  9: '10019'},
 'latitude': {0: '31.215379379000467',
  1: '34.22133455500045',
  2: '34.795039606000444',
  3: '31.292159523000464',
  4: '31.69311635000048',
  5: '33.595265517000485',
  6: '34.44060759100046',
  7: '33.254429322000476',
  8: '33.50314015000049',
  9: '34.74643089500046'},
 'longitude': {0: ' -85.36146587999968',
  1: ' -86.15937514799964',
  2: ' -87.68507485299966',
  3: ' -86.25539902199966',
  4: ' -86.26549483099967',
  5: ' -86.66531866799966',
  6: ' -85.75726760699968',
  7: ' -86.81407933399964',
  8: ' -86.80242858299965',
  9: ' -87.69893502799965'}}
df = pd.DataFrame(d)

My goal is to use the haversine function to determine the distances between each element in KM:

from math import radians, cos, sin, asin, sqrt
def haversine(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """

    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])

    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 

    # 6367 km is the radius of the Earth
    km = 6367 * c
    return km

My goal is to get a dataframe that looks like result_df below, where values are the distance between each provider ID:

 result_df = pd.DataFrame(columns = df['Provider ID'], index=df['Provider ID'])

I can do it in a loop, but it's terribly slow. I'm looking for some help converting this method to a vectorized method:

for first_hospital_coordinates in result_df.columns:
    for second_hospital_coordinates in result_df['Provider ID']:
        if first_hospital_coordinates == 'Provider ID':
            pass
        else:
            L1 = df[df['Provider ID'] == first_hospital_coordinates]['latitude'].astype('float64').values
            O1 = df[df['Provider ID'] == first_hospital_coordinates]['longitude'].astype('float64').values
            L2 = df[df['Provider ID'] == second_hospital_coordinates]['latitude'].astype('float64').values
            O2 = df[df['Provider ID'] == second_hospital_coordinates]['longitude'].astype('float64').values

            distance = haversine(O1, L1, O2, L2)

            crit = result_df['Provider ID'] == second_hospital_coordinates
            result_df.loc[crit, first_hospital_coordinates] = distance

+3

python pandas dataframe

DataSwede Dec 20. 14 at 0:39

source to share

3 answers

You don't need anything interesting, just a few mods for your function.

First, don't use a library math

. If you are doing real math or science, you will probably be better off with numpy.

Secondly, we will use the dataframe method apply

. What it does apply

is that it executes a function and runs each row (axis = 1) or column (axis = 0) through it and creates a new pandas object with all return values. So we need to set up the haversine

totake line in the dataframe and unpack the values. It will be:

def haversine(row):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """
    import numpy as np

    # convert all of the row to radians
    row = np.radians(row)

    # unpack the values for convenience
    lat1 = row['lat1']
    lat2 = row['lat2']
    lon1 = row['lon1']
    lon2 = row['lon2']

    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
    c = 2 * np.arcsin(np.sqrt(a)) 

    # 6367 km is the radius of the Earth
    km = 6367 * c
    return km

Ok, now we need to get the form data in the form. In your question, it is all string and this is bad for math. So using your variable d

I said:

df = pandas.DataFrame(d).set_index('Provider ID').astype(float)

Thus created a dataframe of rows, set the provider as the index and then convert all columns to floats as we are doing the math.

Now we need to create strings with two sets of coordinates. For this we will use the method shift

and attach the result to the original file frame. Doing it straight away looks like this:

df = df.join(df.shift(), lsuffix='1', rsuffix='2')
print(df.head())

                  lat1       lon1       lat2       lon2
Provider ID                                            
10001        31.215379 -85.361466        NaN        NaN
10005        34.221335 -86.159375  31.215379 -85.361466
10006        34.795040 -87.685075  34.221335 -86.159375
10007        31.292160 -86.255399  34.795040 -87.685075
10008        31.693116 -86.265495  31.292160 -86.255399

rsuffix

and lsuffix

is what adds "1" and "2" to the column names during the join operation.

Columns "2" are from df.shift()

, and you will notice that they are equal to columns "1" of the previous row. You will also see that the first row of columns "2" is NaN

because nothing precedes the first row.

So now we can have apply

the Haversine function:

distance = df.apply(haversine, axis=1)
print(distance)
Provider ID
10001                 NaN
10005          342.261590
10006          153.567591
10007          411.393751
10008           44.566642
10011          214.661170
10012          125.775583
10016          163.973219
10018           27.659157
10019          160.901128
dtype: float64

+2

Paul H Dec 20. '14 at 1:21

source to share

You should be able to work on all of this. I'm not very familiar with Pandas, so I'll just work with basic arrays numpy

. Use of your data d

:

df = pd.DataFrame(d)
df1 = df.astype(float)
a = np.radians(df1.values[:,1:])
# a.shape is 10,2, it contains the Lat/Lon only

# transpose and subtract
# add a new axes so they can be broadcast
diff = a[...,np.newaxis] - a.T
# diff.shape is (10,2,10): dLat is diff[:,0,:], dLon is diff[:,1,:]

b = np.square(np.sin(diff / 2))
# b.shape is (10,2,10): sin^2(dLat/2) is b[:,0,:], sin^2(dLon/2) is b[:,1,:]

# make this term: cos(Lat1) * cos(Lat2)
cos_Lat = np.cos(a[:,0])
c = cos_Lat * cos_Lat[:, np.newaxis]    # shape 10x10

# sin^2(dLon/2) is b[:,1,:]
b[:,1,:] = b[:,1,:] * c
g = b.sum(axis = 1)
h = 6367000 * 2 * np.arcsin((np.sqrt(g)))   # meters

Come back to pandas.DataFrame

df2 = pd.DataFrame(h, index = df['Provider ID'].values, columns = df['Provider ID'].values)

I have not tested performance tests. There are many array creation in between, and this can be expensive - using an optional out argument ufuncs

can make this easier.

Same with in-place operations:

df = pd.DataFrame(d)
df_A = df.astype(float)
z = df_A.values[:,1:]

# cos(Lat1) * cos(Lat2)
w = np.cos(z[:,0])
w = w * w[:, np.newaxis]    # w.shape is (10,10)

# sin^2(dLat/2) and sin^2(dLon/2)
np.radians(z, z)
z = z[...,np.newaxis] - z.T
np.divide(z, 2, z)
np.sin(z, z)
np.square(z,z)
# z.shape is now (10,2,10): sin^2(dLat/2) is z[:,0,:], sin^2(dLon/2) is z[:,1,:]

# cos(Lat1) * cos(Lat2) * sin^2(dLon/2)
np.multiply(z[:,1,:], w, z[:,1,:])

# sin^2(dLat/2) + cos(Lat1) * cos(Lat2) * sin^2(dLon/2)
z = z.sum(axis = 1)

np.sqrt(z, z)
np.arcsin(z,z)
np.multiply(z, 6367000 * 2, z)   #meters

df_B = pd.DataFrame(z, index = df['Provider ID'].values, columns = df['Provider ID'].values)

+1

wwii Dec 21 14 at 2:28 am

source to share

nitin · Accepted Answer · 2014-12-20T01:46:48+0000

In order to vectorize this code, you will need to work with a complete dataframe, not separate lat and length. I made a try. I need df result and new h2 function,

import numpy as np
def h2(df, p):
    inrad = df.applymap(radians)
    dlon = inrad.longitude-inrad.longitude[p]
    dlat = inrad.latitude-inrad.latitude[p]
    lat1 = pd.Series(index = df.index, data = [df.latitude[p] for i in range(len(df.index))])
    a = np.sin(dlat/2)*np.sin(dlat/2) + np.cos(df.latitude) * np.cos(lat1) * np.sin(dlon/2)**2
    c = 2 * 1/np.sin(np.sqrt(a))
    km = 6367 * c
    return km

df = df.set_index('Provider ID')
df = df.astype(float)
df2 = pd.DataFrame(index = df.index, columns = df.index)
for c in df2.columns:
    df2[c] = h2(df, c)

print (df2)

This should result (I can't be sure I have the correct answer ... my goal was to vectorize the code)

Provider ID         10001         10005         10006         10007  \
Provider ID                                                           
10001                 inf  5.021936e+05  5.270062e+05  1.649088e+06   
10005        5.021936e+05           inf  9.294868e+05  4.985233e+05   
10006        5.270062e+05  9.294868e+05           inf  4.548412e+05   
10007        1.649088e+06  4.985233e+05  4.548412e+05           inf   
10008        1.460299e+06  5.777248e+05  5.246954e+05  3.638231e+06   
10011        6.723581e+05  2.004199e+06  1.027439e+06  6.394402e+05   
10012        4.559090e+05  3.265536e+06  7.573411e+05  4.694125e+05   
10016        7.680036e+05  1.429573e+06  9.105474e+05  7.517467e+05   
10018        7.096548e+05  1.733554e+06  1.020976e+06  6.701920e+05   
10019        5.436342e+05  9.278739e+05  2.891822e+07  4.638858e+05   

Provider ID         10008         10011         10012         10016  \
Provider ID                                                           
10001        1.460299e+06  6.723581e+05  4.559090e+05  7.680036e+05   
10005        5.777248e+05  2.004199e+06  3.265536e+06  1.429573e+06   
10006        5.246954e+05  1.027439e+06  7.573411e+05  9.105474e+05   
10007        3.638231e+06  6.394402e+05  4.694125e+05  7.517467e+05   
10008                 inf  7.766998e+05  5.401081e+05  9.496953e+05   
10011        7.766998e+05           inf  1.341775e+06  4.220911e+06   
10012        5.401081e+05  1.341775e+06           inf  1.119063e+06   
10016        9.496953e+05  4.220911e+06  1.119063e+06           inf   
10018        8.236437e+05  1.242451e+07  1.226941e+06  5.866259e+06   
10019        5.372119e+05  1.051748e+06  7.514774e+05  9.362341e+05   

Provider ID         10018         10019  
Provider ID                              
10001        7.096548e+05  5.436342e+05  
10005        1.733554e+06  9.278739e+05  
10006        1.020976e+06  2.891822e+07  
10007        6.701920e+05  4.638858e+05  
10008        8.236437e+05  5.372119e+05  
10011        1.242451e+07  1.051748e+06  
10012        1.226941e+06  7.514774e+05  
10016        5.866259e+06  9.362341e+05  
10018                 inf  1.048895e+06  
10019        1.048895e+06           inf  

[10 rows x 10 columns]

Function vectorization in pandas

More articles: