Data Bloom

Efficient Routing

You knew the traveling salesman problem is NP-complete. From wikipedia -- https://en.wikipedia.org/wiki/Travelling_salesman_problem

The traveling salesman problem (TSP) asks the following question: "Given a list of cities and the distances between each pair of cities, what is the shortest possible route that visits each city exactly once and returns to the origin city?" It is an NP-hard problem in combinatorial optimization, important in operations research and theoretical computer science.

Yet consider a simple everyday example of garbage collection, USPS delivery, meals-on-wheels, or everyday GPS navigation. There is a need for a scalable solution to minimize the distance and maximize resource utilization. This routing need has many applications.

When theoretical solutions may be computationally costly, heuristic solutions offer a faster (perhaps semi-optimal) solutions. One such routing heuristic is http://connection.ebscohost.com/c/articles/6691618/minimal-technology-routing-system-meals-wheels by Dr. Bartholdi that uses Hilbert space filling curves to offer short routes to TSP problems.

Further more, say when you are comparing records like one hotel in a city with nearby businesses like a subway for making exchange referrals, computing every business's Euclidean intersection with other is quadratic and disallowed in Hadoop/Spark. To limit the join space means we use a equality predicate like zipcode (only comparing businesses that share a zipcode) to limit search. Unfortunately, the zip codes are not contiguous for contiguous areas and thus do need a contiguous surrogate like Hilbert space filling curves discussed below to make the search perform better.

Top 20 Cities

Let us suppose a gopher wants to travel to the top 20 US populated cities from New York.

In [1]:
# Read data from the web; the top 20 cities
cities = pd.read_html('https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population', header=0)[3].head(20)

# Display a preview of the cities
pd_display(cities, "Top 20 most populous cities in the US from Wikipedia")

Top 20 most populous cities in the US from Wikipedia

2015 rank City State[5] 2015 estimate ... Change 2014 land area 2010 population density Location
0 1 New York[6] New York 8550405 ... 7000459040849855290♠+4.59% 7002302600000000000♠302.6 sq mi 783.8 km2 7004270120000000000♠27,012 per sq mi 10,430 km−2 40°39′51″N 73°56′19″W / 40.6643°N 73.9385°W...
1 2 Los Angeles California 3971883 ... 7000472659936228800♠+4.73% 7002468700000000000♠468.7 sq mi 1,213.9 km2 7003809200000000000♠8,092 per sq mi 3,124 km−2 34°01′10″N 118°24′39″W / 34.0194°N 118.4108°...
2 3 Chicago Illinois 2720546 ... 6999925508922324480♠+0.93% 7002227600000000000♠227.6 sq mi 589.6 km2 7004118420000000000♠11,842 per sq mi 4,572 km−2 41°50′15″N 87°40′54″W / 41.8376°N 87.6818°W...
... ... ... ... ... ... ... ... ... ...
17 18 Seattle Washington 684451 ... 7001124521079091780♠+12.45% 7001839000000000000♠83.9 sq mi 217.4 km2 7003725100000000000♠7,251 per sq mi 2,800 km−2 47°37′14″N 122°21′03″W / 47.6205°N 122.3509°...
18 19 Denver[12] Colorado 682545 ... 7001137275517447070♠+13.73% 7002153000000000000♠153.0 sq mi 396.3 km2 7003392300000000000♠3,923 per sq mi 1,515 km−2 39°45′42″N 104°52′50″W / 39.7618°N 104.8806°...
19 20 El Paso Texas 681124 ... 7000493020561651830♠+4.93% 7002255200000000000♠255.2 sq mi 661.1 km2 7003254300000000000♠2,543 per sq mi 982 km−2 31°50′54″N 106°25′37″W / 31.8484°N 106.4270°...

20 rows × 9 columns

Let us be selective in which data we choose.

In [2]:
# Compute lat-lng codes and citynames. We are not interested in others
interested_cities = cities[['City', 'Location']].copy()

# Strip any footnote references from the city name
interested_cities['City'] = interested_cities['City'].apply(lambda x: x.split('[')[0])

# Make the dataframe index on city
interested_cities.set_index('City', inplace=True, drop=True)

# Write logic to parse lat lon code
from LatLon import string2latlon
import re
def parse_lat_lon(code_str):
    # Strip the degree symbol
    (lat,lng) = map(lambda x: x.replace('|', ' '), re.sub(r'([NEWS])', r'|\1', code_str.encode('ascii', errors='ignore')).split())
    # Return lat lon parsed from direction
    return string2latlon(lat, lng, 'D% %H')

# Collect only lat long codes
interested_cities['Location'] = interested_cities['Location'].\
    apply(lambda x: parse_lat_lon(x.split('/')[1]))
    
# Collect latitude
interested_cities['Latitude'] = interested_cities['Location'].\
    apply(lambda x: float(str(x).split(',')[0].strip()))

# Collect longitude
interested_cities['Longitude'] = interested_cities['Location'].\
    apply(lambda x: float(str(x).split(',')[1].strip()))
    
# Center the (0,0) of the map on the center of the plan
interested_cities['RelLongitude'] = interested_cities['Longitude'] - interested_cities['Longitude'].mean()

interested_cities['RelLatitude'] = interested_cities['Latitude'] - interested_cities['Latitude'].mean()
    
# Display preview of lat-lng and city
pd_display(interested_cities, "Latitude and Longitude of the 20 cities")

Latitude and Longitude of the 20 cities

Location Latitude Longitude RelLongitude RelLatitude
City
New York 40.6643, -73.9385 40.66 -73.94 25.05 4.78
Los Angeles 34.0194, -118.4108 34.02 -118.41 -19.42 -1.86
Chicago 41.8376, -87.6818 41.84 -87.68 11.30 5.96
... ... ... ... ... ...
Seattle 47.6205, -122.3509 47.62 -122.35 -23.37 11.74
Denver 39.7618, -104.8806 39.76 -104.88 -5.89 3.88
El Paso 31.8484, -106.427 31.85 -106.43 -7.44 -4.03

20 rows × 5 columns

Encode

Encode the latitude and longitude into a "space filling curve index" -- a monotonic locality preserving index that can be used to quickly lookup relative distances in 2 (or even higher) dimensions easily. See https://arxiv.org/pdf/1109.2323v2.pdf and http://www.cslab.ece.ntua.gr/twiki/pub/Collab/MultidimensionalIndexing/Alternative_Algorithm_for_Hilbert_s_Space_Filling_Curve.pdf for details.

Google's S2 Sphere is a generalized code that exhibits similar locality preserving traits as Hilbert space filling curve. See http://blog.christianperone.com/2015/08/googles-s2-geometry-on-the-sphere-cells-and-hilbert-curve/ for a detailed explanation.

In [3]:
# Encode s2 cell ids for each cities
import s2sphere
from s2sphere import Angle, CellId, LatLng

# Cell ID is a number that resolves a geographic location to a unit sq-cm on the earth with only one long number
get_cellid = lambda lat,lng: CellId.from_lat_lng(LatLng.from_degrees(lat,lng)).id() if all([lat,lng]) else 0

# Put the CellID in the dataframe
interested_cities['CellID'] = interested_cities.apply(
    lambda x: get_cellid(float(x['RelLatitude']), float(x['RelLongitude'])), axis=1)

# Make the S2 Index relative to New York
interested_cities['RelCellID'] = np.abs(interested_cities[
    'CellID'] - interested_cities.ix['New York']['CellID'])

# Display preview
display(interested_cities.sort_values('RelCellID'))
Location Latitude Longitude RelLongitude RelLatitude CellID RelCellID
City
New York 40.6643, -73.9385 40.66 -73.94 25.05 4.78 1673033395779129293 0
Philadelphia 40.0094, -75.1333 40.01 -75.13 23.85 4.13 1678214086752040423 5180690972911130
Charlotte 35.2087, -80.8307 35.21 -80.83 18.16 -0.67 1876166626263424565 203133230484295272
... ... ... ... ... ... ... ...
San Diego 32.8153, -117.135 32.82 -117.14 -18.15 -3.07 427424719826299983 1245608675952829310
Phoenix 33.5722, -112.088 33.57 -112.09 -13.10 -2.31 397784215078795125 1275249180700334168
El Paso 31.8484, -106.427 31.85 -106.43 -7.44 -4.03 391226763282846791 1281806632496282502

20 rows × 7 columns

Just spotting the RelCellID index values seems to suggest the farthest point to New York is Seattle and San Francisco; The CellIDs also seem to suggest that San Francisco and San Jose are very close to each other on the map.

Order

Let us use a very simple heuristic. Shortest distance first from current location aka Djikstra's algorithm. https://en.wikipedia.org/wiki/Dijkstra%27s_algorithmOrder to plan the travel from New York. Since we are beginning our travel from New York, let us identify next target.

In [4]:
# Compute routine to identify the next closest location from a city
def get_order(interested_cities, src_city):
    # Push the first cell as the initial state
    visited = [src_city]

    # For each state, aka city, find closest city that has not been visited
    def find_nearest(city):
        # Make the S2 Index relative to New York
        interested_cities['RelCellID'] = np.abs(interested_cities['CellID'] - interested_cities.ix[city]['CellID'])
        # Find closest city to current city where it has not already been visited
        cities = [city for city in interested_cities.sort_values('RelCellID').index.values if city not in visited]
        if cities and len(cities) > 0:
            target_city = cities[0]
            visited.append(target_city)
            # Recursively call until all cities have been visited
            return find_nearest(target_city)
        else:
            visited.append(src_city)
            return visited
    return find_nearest(src_city)

# Compute plan
plan = pd.DataFrame(get_order(interested_cities, 'New York'), columns=["City"])

# Form a ordered series of records
plan.set_index('City', inplace=True, drop=True)

# Assign a order of the visit
plan['order'] = [i for i, val in enumerate(plan.index.values)]

# Display the plan
pd_display(plan, "Order in which we may visit the cities starting from New York")

Order in which we may visit the cities starting from New York

order
City
New York 0
Philadelphia 1
Charlotte 2
... ...
Phoenix 18
El Paso 19
New York 20

21 rows × 1 columns

Plotting Route

Now that we have an ordered route, let us bring back the latitude and longitude so we may plot the route.

In [5]:
# Join with original records for lat and longitude information
city_coordinates = plan.join(interested_cities).sort_values('order')

Plot.

In [6]:
# Plot the map
import folium
folium.initialize_notebook()

# Initialize a map
geomap = folium.Map(zoom_start=4,
                    max_zoom=6,
                    min_zoom=4,
                    location=[
                        city_coordinates['Latitude'].mean(),
                        city_coordinates['Longitude'].mean()
                    ])  # US map

# Create lines
lines = folium.PolyLine(
    locations=map(lambda x: x,
                  city_coordinates[['Latitude', 'Longitude']].values),
    color='blue',
    weight=5,
    opacity=0.3)

# Create markers
for name, city in city_coordinates.iterrows():
    marker = folium.CircleMarker(
        radius=100000,
        fill_color='#%02x%02x%02x' % tuple(
            map(lambda x: int(x * 255),
                sns.color_palette("Blues", len(city_coordinates))[city[
                    'order']])),
        color='blue',
        location=[city.Latitude, city.Longitude],
        popup="{0} visited {1}".format(name, city['order'])).add_to(geomap)

# Plot map
geomap.add_child(lines)
geomap.fit_bounds(
    [[
        city_coordinates['Latitude'].min() - 2,
        city_coordinates['Longitude'].min() - 2
    ], [
        city_coordinates['Latitude'].max() + 2,
        city_coordinates['Longitude'].max() + 2
    ]],
    max_zoom=4)

# Render
display(geomap)

Optimized?

The visual plot seems to reveal what is seemingly not too circuitous path. Is it optimal? We will let you make that determination. But the point of this exercise was --

  1. That heuristics can provide near-optimal solutions when statistical solutions may not be conducive for realtime performance.
  2. Locality preserving space filling indices, like s2 index and Hilbert space filling curve, may be used to quickly estimate nearby landmarks by a mere lookup of the index values versus computing a true Euclidean distance.
  3. Such locality preserving indices, and their R Tree nature, may be used to turn complex theta joins into equality joins to compare two records in Hadoop because Hadoop like most relational engines does not allow non-equality joins without a costly cross-join operation.