Loop through array to find euclidean distance in python
This is what I have so far:
Stats2003 = np.loadtxt('/DataFiles/2003.txt')
Stats2004 = np.loadtxt('/DataFiles/2004.txt')
Stats2005 = np.loadtxt('/DataFiles/2005.txt')
Stats2006 = np.loadtxt('/DataFiles/2006.txt')
Stats2007 = np.loadtxt('/DataFiles/2007.txt')
Stats2008 = np.loadtxt('/DataFiles/2008.txt')
Stats2009 = np.loadtxt('/DataFiles/2009.txt')
Stats2010 = np.loadtxt('/DataFiles/2010.txt')
Stats2011 = np.loadtxt('/DataFiles/2011.txt')
Stats2012 = np.loadtxt('/DataFiles/2012.txt')
Stats = Stats2003, Stats2004, Stats2004, Stats2005, Stats2006, Stats2007, Stats2008, Stats2009, Stats2010, Stats2011, Stats2012
I am trying to calculate the euclidean distance between each of these arrays with every other array, but I am having a hard time doing it.
I have an output that I would like by calculating a distance like this:
dist1 = np.linalg.norm(Stats2003-Stats2004)
dist2 = np.linalg.norm(Stats2003-Stats2005)
dist11 = np.linalg.norm(Stats2004-Stats2005)
and so on, but I would like to do these calculations with a loop.
I am showing calculations to a table using Prettytable.
Can anyone point me in the right direction? I haven't found any previous solutions that worked.
source to share
Take a look scipy.spatial.distance.cdist
.
From the documentation:
Calculates the distance between each pair of two sets of inputs.
So, you can do something like the following:
import numpy as np
from scipy.spatial.distance import cdist
# start year to stop year
years = range(2003,2013)
# this will yield an n_years X n_features array
features = np.array([np.loadtxt('/Datafiles/%s.txt' % year) for year in years])
# compute the euclidean distance from each year to every other year
distance_matrix = cdist(features,features,metric = 'euclidean')
If you know the starting year and you are not missing data for any years, then it is easy to determine which two years are compared on a coordinate (m,n)
in the distance matrix.
source to share
To make a loop, you need to store data from your variable names . A simple solution would be to use dictionaries instead. Loops are implicit in the understanding of a dict:
import itertools as it
years = range(2003, 2013)
stats = {y: np.loadtxt('/DataFiles/{}.txt'.format(y) for y in years}
dists = {(y1,y2): np.linalg.norm(stats[y1] - stats[y2]) for (y1, y2) in it.combinations(years, 2)}
now get access to statistics for a specific year, for example. 2007, at stats[2007]
and distances with tuples for example. dists[(2007, 20011)]
...
source to share