You knew the traveling salesman problem is NP-complete. From wikipedia -- https://en.wikipedia.org/wiki/Travelling_salesman_problem
The traveling salesman problem (TSP) asks the following question: "Given a list of cities and the distances between each pair of cities, what is the shortest possible route that visits each city exactly once and returns to the origin city?" It is an NP-hard problem in combinatorial optimization, important in operations research and theoretical computer science.
Yet consider a simple everyday example of garbage collection, USPS delivery, meals-on-wheels, or everyday GPS navigation. There is a need for a scalable solution to minimize the distance and maximize resource utilization. This routing need has many applications.
When theoretical solutions may be computationally costly, heuristic solutions offer a faster (perhaps semi-optimal) solutions. One such routing heuristic is http://connection.ebscohost.com/c/articles/6691618/minimal-technology-routing-system-meals-wheels by Dr. Bartholdi that uses Hilbert space filling curves to offer short routes to TSP problems.
Further more, say when you are comparing records like one hotel in a city with nearby businesses like a subway for making exchange referrals, computing every business's Euclidean intersection with other is quadratic and disallowed in Hadoop/Spark. To limit the join space means we use a equality predicate like zipcode (only comparing businesses that share a zipcode) to limit search. Unfortunately, the zip codes are not contiguous for contiguous areas and thus do need a contiguous surrogate like Hilbert space filling curves discussed below to make the search perform better.
Let us suppose a gopher wants to travel to the top 20 US populated cities from New York.
The top 20 cities data is populated at https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population
# Read data from the web; the top 20 cities
cities = pd.read_html('https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population', header=0)[3].head(20)
# Display a preview of the cities
pd_display(cities, "Top 20 most populous cities in the US from Wikipedia")
Let us be selective in which data we choose.
# Compute lat-lng codes and citynames. We are not interested in others
interested_cities = cities[['City', 'Location']].copy()
# Strip any footnote references from the city name
interested_cities['City'] = interested_cities['City'].apply(lambda x: x.split('[')[0])
# Make the dataframe index on city
interested_cities.set_index('City', inplace=True, drop=True)
# Write logic to parse lat lon code
from LatLon import string2latlon
import re
def parse_lat_lon(code_str):
# Strip the degree symbol
(lat,lng) = map(lambda x: x.replace('|', ' '), re.sub(r'([NEWS])', r'|\1', code_str.encode('ascii', errors='ignore')).split())
# Return lat lon parsed from direction
return string2latlon(lat, lng, 'D% %H')
# Collect only lat long codes
interested_cities['Location'] = interested_cities['Location'].\
apply(lambda x: parse_lat_lon(x.split('/')[1]))
# Collect latitude
interested_cities['Latitude'] = interested_cities['Location'].\
apply(lambda x: float(str(x).split(',')[0].strip()))
# Collect longitude
interested_cities['Longitude'] = interested_cities['Location'].\
apply(lambda x: float(str(x).split(',')[1].strip()))
# Center the (0,0) of the map on the center of the plan
interested_cities['RelLongitude'] = interested_cities['Longitude'] - interested_cities['Longitude'].mean()
interested_cities['RelLatitude'] = interested_cities['Latitude'] - interested_cities['Latitude'].mean()
# Display preview of lat-lng and city
pd_display(interested_cities, "Latitude and Longitude of the 20 cities")
Encode the latitude and longitude into a "space filling curve index" -- a monotonic locality preserving index that can be used to quickly lookup relative distances in 2 (or even higher) dimensions easily. See https://arxiv.org/pdf/1109.2323v2.pdf and http://www.cslab.ece.ntua.gr/twiki/pub/Collab/MultidimensionalIndexing/Alternative_Algorithm_for_Hilbert_s_Space_Filling_Curve.pdf for details.
Google's S2 Sphere is a generalized code that exhibits similar locality preserving traits as Hilbert space filling curve. See http://blog.christianperone.com/2015/08/googles-s2-geometry-on-the-sphere-cells-and-hilbert-curve/ for a detailed explanation.
# Encode s2 cell ids for each cities
import s2sphere
from s2sphere import Angle, CellId, LatLng
# Cell ID is a number that resolves a geographic location to a unit sq-cm on the earth with only one long number
get_cellid = lambda lat,lng: CellId.from_lat_lng(LatLng.from_degrees(lat,lng)).id() if all([lat,lng]) else 0
# Put the CellID in the dataframe
interested_cities['CellID'] = interested_cities.apply(
lambda x: get_cellid(float(x['RelLatitude']), float(x['RelLongitude'])), axis=1)
# Make the S2 Index relative to New York
interested_cities['RelCellID'] = np.abs(interested_cities[
'CellID'] - interested_cities.ix['New York']['CellID'])
# Display preview
display(interested_cities.sort_values('RelCellID'))
Just spotting the RelCellID index values seems to suggest the farthest point to New York is Seattle and San Francisco; The CellIDs also seem to suggest that San Francisco and San Jose are very close to each other on the map.
Let us use a very simple heuristic. Shortest distance first from current location aka Djikstra's algorithm. https://en.wikipedia.org/wiki/Dijkstra%27s_algorithmOrder to plan the travel from New York. Since we are beginning our travel from New York, let us identify next target.
# Compute routine to identify the next closest location from a city
def get_order(interested_cities, src_city):
# Push the first cell as the initial state
visited = [src_city]
# For each state, aka city, find closest city that has not been visited
def find_nearest(city):
# Make the S2 Index relative to New York
interested_cities['RelCellID'] = np.abs(interested_cities['CellID'] - interested_cities.ix[city]['CellID'])
# Find closest city to current city where it has not already been visited
cities = [city for city in interested_cities.sort_values('RelCellID').index.values if city not in visited]
if cities and len(cities) > 0:
target_city = cities[0]
visited.append(target_city)
# Recursively call until all cities have been visited
return find_nearest(target_city)
else:
visited.append(src_city)
return visited
return find_nearest(src_city)
# Compute plan
plan = pd.DataFrame(get_order(interested_cities, 'New York'), columns=["City"])
# Form a ordered series of records
plan.set_index('City', inplace=True, drop=True)
# Assign a order of the visit
plan['order'] = [i for i, val in enumerate(plan.index.values)]
# Display the plan
pd_display(plan, "Order in which we may visit the cities starting from New York")
Now that we have an ordered route, let us bring back the latitude and longitude so we may plot the route.
# Join with original records for lat and longitude information
city_coordinates = plan.join(interested_cities).sort_values('order')
Plot.
# Plot the map
import folium
folium.initialize_notebook()
# Initialize a map
geomap = folium.Map(zoom_start=4,
max_zoom=6,
min_zoom=4,
location=[
city_coordinates['Latitude'].mean(),
city_coordinates['Longitude'].mean()
]) # US map
# Create lines
lines = folium.PolyLine(
locations=map(lambda x: x,
city_coordinates[['Latitude', 'Longitude']].values),
color='blue',
weight=5,
opacity=0.3)
# Create markers
for name, city in city_coordinates.iterrows():
marker = folium.CircleMarker(
radius=100000,
fill_color='#%02x%02x%02x' % tuple(
map(lambda x: int(x * 255),
sns.color_palette("Blues", len(city_coordinates))[city[
'order']])),
color='blue',
location=[city.Latitude, city.Longitude],
popup="{0} visited {1}".format(name, city['order'])).add_to(geomap)
# Plot map
geomap.add_child(lines)
geomap.fit_bounds(
[[
city_coordinates['Latitude'].min() - 2,
city_coordinates['Longitude'].min() - 2
], [
city_coordinates['Latitude'].max() + 2,
city_coordinates['Longitude'].max() + 2
]],
max_zoom=4)
# Render
display(geomap)
The visual plot seems to reveal what is seemingly not too circuitous path. Is it optimal? We will let you make that determination. But the point of this exercise was --