Efficient Routing¶

You knew the traveling salesman problem is NP-complete. From wikipedia -- https://en.wikipedia.org/wiki/Travelling_salesman_problem

The traveling salesman problem (TSP) asks the following question: "Given a list of cities and the distances between each pair of cities, what is the shortest possible route that visits each city exactly once and returns to the origin city?" It is an NP-hard problem in combinatorial optimization, important in operations research and theoretical computer science.

Yet consider a simple everyday example of garbage collection, USPS delivery, meals-on-wheels, or everyday GPS navigation. There is a need for a scalable solution to minimize the distance and maximize resource utilization. This routing need has many applications.

When theoretical solutions may be computationally costly, heuristic solutions offer a faster (perhaps semi-optimal) solutions. One such routing heuristic is http://connection.ebscohost.com/c/articles/6691618/minimal-technology-routing-system-meals-wheels by Dr. Bartholdi that uses Hilbert space filling curves to offer short routes to TSP problems.

Further more, say when you are comparing records like one hotel in a city with nearby businesses like a subway for making exchange referrals, computing every business's Euclidean intersection with other is quadratic and disallowed in Hadoop/Spark. To limit the join space means we use a equality predicate like zipcode (only comparing businesses that share a zipcode) to limit search. Unfortunately, the zip codes are not contiguous for contiguous areas and thus do need a contiguous surrogate like Hilbert space filling curves discussed below to make the search perform better.

Top 20 Cities¶

Let us suppose a gopher wants to travel to the top 20 US populated cities from New York.

Data¶

The top 20 cities data is populated at https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population

# Read data from the web; the top 20 cities
cities = pd.read_html('https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population', header=0)[3].head(20)

# Display a preview of the cities
pd_display(cities, "Top 20 most populous cities in the US from Wikipedia")

Let us be selective in which data we choose.

# Compute lat-lng codes and citynames. We are not interested in others
interested_cities = cities[['City', 'Location']].copy()

# Strip any footnote references from the city name
interested_cities['City'] = interested_cities['City'].apply(lambda x: x.split('[')[0])

# Make the dataframe index on city
interested_cities.set_index('City', inplace=True, drop=True)

# Write logic to parse lat lon code
from LatLon import string2latlon
import re
def parse_lat_lon(code_str):
    # Strip the degree symbol
    (lat,lng) = map(lambda x: x.replace('|', ' '), re.sub(r'([NEWS])', r'|\1', code_str.encode('ascii', errors='ignore')).split())
    # Return lat lon parsed from direction
    return string2latlon(lat, lng, 'D% %H')

# Collect only lat long codes
interested_cities['Location'] = interested_cities['Location'].\
    apply(lambda x: parse_lat_lon(x.split('/')[1]))
    
# Collect latitude
interested_cities['Latitude'] = interested_cities['Location'].\
    apply(lambda x: float(str(x).split(',')[0].strip()))

# Collect longitude
interested_cities['Longitude'] = interested_cities['Location'].\
    apply(lambda x: float(str(x).split(',')[1].strip()))
    
# Center the (0,0) of the map on the center of the plan
interested_cities['RelLongitude'] = interested_cities['Longitude'] - interested_cities['Longitude'].mean()

interested_cities['RelLatitude'] = interested_cities['Latitude'] - interested_cities['Latitude'].mean()
    
# Display preview of lat-lng and city
pd_display(interested_cities, "Latitude and Longitude of the 20 cities")

Encode¶

Encode the latitude and longitude into a "space filling curve index" -- a monotonic locality preserving index that can be used to quickly lookup relative distances in 2 (or even higher) dimensions easily. See https://arxiv.org/pdf/1109.2323v2.pdf and http://www.cslab.ece.ntua.gr/twiki/pub/Collab/MultidimensionalIndexing/Alternative_Algorithm_for_Hilbert_s_Space_Filling_Curve.pdf for details.

Google's S2 Sphere is a generalized code that exhibits similar locality preserving traits as Hilbert space filling curve. See http://blog.christianperone.com/2015/08/googles-s2-geometry-on-the-sphere-cells-and-hilbert-curve/ for a detailed explanation.

# Encode s2 cell ids for each cities
import s2sphere
from s2sphere import Angle, CellId, LatLng

# Cell ID is a number that resolves a geographic location to a unit sq-cm on the earth with only one long number
get_cellid = lambda lat,lng: CellId.from_lat_lng(LatLng.from_degrees(lat,lng)).id() if all([lat,lng]) else 0

# Put the CellID in the dataframe
interested_cities['CellID'] = interested_cities.apply(
    lambda x: get_cellid(float(x['RelLatitude']), float(x['RelLongitude'])), axis=1)

# Make the S2 Index relative to New York
interested_cities['RelCellID'] = np.abs(interested_cities[
    'CellID'] - interested_cities.ix['New York']['CellID'])

# Display preview
display(interested_cities.sort_values('RelCellID'))

Just spotting the RelCellID index values seems to suggest the farthest point to New York is Seattle and San Francisco; The CellIDs also seem to suggest that San Francisco and San Jose are very close to each other on the map.

Order¶

Let us use a very simple heuristic. Shortest distance first from current location aka Djikstra's algorithm. https://en.wikipedia.org/wiki/Dijkstra%27s_algorithmOrder to plan the travel from New York. Since we are beginning our travel from New York, let us identify next target.

# Compute routine to identify the next closest location from a city
def get_order(interested_cities, src_city):
    # Push the first cell as the initial state
    visited = [src_city]

    # For each state, aka city, find closest city that has not been visited
    def find_nearest(city):
        # Make the S2 Index relative to New York
        interested_cities['RelCellID'] = np.abs(interested_cities['CellID'] - interested_cities.ix[city]['CellID'])
        # Find closest city to current city where it has not already been visited
        cities = [city for city in interested_cities.sort_values('RelCellID').index.values if city not in visited]
        if cities and len(cities) > 0:
            target_city = cities[0]
            visited.append(target_city)
            # Recursively call until all cities have been visited
            return find_nearest(target_city)
        else:
            visited.append(src_city)
            return visited
    return find_nearest(src_city)

# Compute plan
plan = pd.DataFrame(get_order(interested_cities, 'New York'), columns=["City"])

# Form a ordered series of records
plan.set_index('City', inplace=True, drop=True)

# Assign a order of the visit
plan['order'] = [i for i, val in enumerate(plan.index.values)]

# Display the plan
pd_display(plan, "Order in which we may visit the cities starting from New York")

Plotting Route¶

Now that we have an ordered route, let us bring back the latitude and longitude so we may plot the route.

# Join with original records for lat and longitude information
city_coordinates = plan.join(interested_cities).sort_values('order')

Plot.

# Plot the map
import folium
folium.initialize_notebook()

# Initialize a map
geomap = folium.Map(zoom_start=4,
                    max_zoom=6,
                    min_zoom=4,
                    location=[
                        city_coordinates['Latitude'].mean(),
                        city_coordinates['Longitude'].mean()
                    ])  # US map

# Create lines
lines = folium.PolyLine(
    locations=map(lambda x: x,
                  city_coordinates[['Latitude', 'Longitude']].values),
    color='blue',
    weight=5,
    opacity=0.3)

# Create markers
for name, city in city_coordinates.iterrows():
    marker = folium.CircleMarker(
        radius=100000,
        fill_color='#%02x%02x%02x' % tuple(
            map(lambda x: int(x * 255),
                sns.color_palette("Blues", len(city_coordinates))[city[
                    'order']])),
        color='blue',
        location=[city.Latitude, city.Longitude],
        popup="{0} visited {1}".format(name, city['order'])).add_to(geomap)

# Plot map
geomap.add_child(lines)
geomap.fit_bounds(
    [[
        city_coordinates['Latitude'].min() - 2,
        city_coordinates['Longitude'].min() - 2
    ], [
        city_coordinates['Latitude'].max() + 2,
        city_coordinates['Longitude'].max() + 2
    ]],
    max_zoom=4)

# Render
display(geomap)

Optimized?¶

The visual plot seems to reveal what is seemingly not too circuitous path. Is it optimal? We will let you make that determination. But the point of this exercise was --

That heuristics can provide near-optimal solutions when statistical solutions may not be conducive for realtime performance.
Locality preserving space filling indices, like s2 index and Hilbert space filling curve, may be used to quickly estimate nearby landmarks by a mere lookup of the index values versus computing a true Euclidean distance.
Such locality preserving indices, and their R Tree nature, may be used to turn complex theta joins into equality joins to compare two records in Hadoop because Hadoop like most relational engines does not allow non-equality joins without a costly cross-join operation.

	2015 rank	City	State[5]	2015 estimate	...	Change	2014 land area	2010 population density	Location
0	1	New York[6]	New York	8550405	...	7000459040849855290♠+4.59%	7002302600000000000♠302.6 sq mi 783.8 km2	7004270120000000000♠27,012 per sq mi 10,430 km−2	40°39′51″N 73°56′19″W / 40.6643°N 73.9385°W...
1	2	Los Angeles	California	3971883	...	7000472659936228800♠+4.73%	7002468700000000000♠468.7 sq mi 1,213.9 km2	7003809200000000000♠8,092 per sq mi 3,124 km−2	34°01′10″N 118°24′39″W / 34.0194°N 118.4108°...
2	3	Chicago	Illinois	2720546	...	6999925508922324480♠+0.93%	7002227600000000000♠227.6 sq mi 589.6 km2	7004118420000000000♠11,842 per sq mi 4,572 km−2	41°50′15″N 87°40′54″W / 41.8376°N 87.6818°W...
...	...	...	...	...	...	...	...	...	...
17	18	Seattle	Washington	684451	...	7001124521079091780♠+12.45%	7001839000000000000♠83.9 sq mi 217.4 km2	7003725100000000000♠7,251 per sq mi 2,800 km−2	47°37′14″N 122°21′03″W / 47.6205°N 122.3509°...
18	19	Denver[12]	Colorado	682545	...	7001137275517447070♠+13.73%	7002153000000000000♠153.0 sq mi 396.3 km2	7003392300000000000♠3,923 per sq mi 1,515 km−2	39°45′42″N 104°52′50″W / 39.7618°N 104.8806°...
19	20	El Paso	Texas	681124	...	7000493020561651830♠+4.93%	7002255200000000000♠255.2 sq mi 661.1 km2	7003254300000000000♠2,543 per sq mi 982 km−2	31°50′54″N 106°25′37″W / 31.8484°N 106.4270°...

	Location	Latitude	Longitude	RelLongitude	RelLatitude
City
New York	40.6643, -73.9385	40.66	-73.94	25.05	4.78
Los Angeles	34.0194, -118.4108	34.02	-118.41	-19.42	-1.86
Chicago	41.8376, -87.6818	41.84	-87.68	11.30	5.96
...	...	...	...	...	...
Seattle	47.6205, -122.3509	47.62	-122.35	-23.37	11.74
Denver	39.7618, -104.8806	39.76	-104.88	-5.89	3.88
El Paso	31.8484, -106.427	31.85	-106.43	-7.44	-4.03

	Location	Latitude	Longitude	RelLongitude	RelLatitude	CellID	RelCellID
City
New York	40.6643, -73.9385	40.66	-73.94	25.05	4.78	1673033395779129293	0
Philadelphia	40.0094, -75.1333	40.01	-75.13	23.85	4.13	1678214086752040423	5180690972911130
Charlotte	35.2087, -80.8307	35.21	-80.83	18.16	-0.67	1876166626263424565	203133230484295272
...	...	...	...	...	...	...	...
San Diego	32.8153, -117.135	32.82	-117.14	-18.15	-3.07	427424719826299983	1245608675952829310
Phoenix	33.5722, -112.088	33.57	-112.09	-13.10	-2.31	397784215078795125	1275249180700334168
El Paso	31.8484, -106.427	31.85	-106.43	-7.44	-4.03	391226763282846791	1281806632496282502

	order
City
New York	0
Philadelphia	1
Charlotte	2
...	...
Phoenix	18
El Paso	19
New York	20