Reverse Geocoding Custom Data Sources

According to Wikipedia,

Reverse geocoding is the process of back (reverse) coding of a point location (latitude, longitude) to a readable address or place name. This permits the identification of nearby street addresses, places, and/or areal subdivisions such as neighbourhoods, county, state, or country. 

Reverse geocoding is crucial to the work that I do at OpenSignal where I've to churn through terabytes of crowdsourced data to compare operator performance (in terms of coverage and data speeds) and user numbers at various geographical areas across the globe. To be able to do these analyses easily, I built a geocoder library in Python that is offline and fast, and that improves on an existing one built by Richard Penman. By making this offline, you do not have to deal with slow web APIs (such as Nominatim and Google) and query limits. The library was built with speed in mind and it can geocode 10 million GPS coordinates in under 30 seconds on a machine with 8 cores.

Since its release over a year ago, the library has been pretty well received by the community. It's been downloaded over 10,000 times and has received 1230 stars on Github. Being featured as the #1 post on Hacker News certainly helped drive a lot of traffic to it. I'm really grateful to the community who have helped report and squash quite a few bugs along the way.

Under the Hood

Under the hood, the library comes packaged with a database of places with a population greater than 1000, which was obtained from GeoNames. This entire database is loaded into a k-d tree and the nearest neighbour algorithm is used to find the city/town closest to the input point location. There's a nice explanation of k-d trees in Data Skeptic, one of my favourite podcasts. The scipy implementation of k-d trees is, unfortunately, single-threaded and does not exploit the multiple CPUs available on your machine. Thus, to improve performance, I implemented a parallelised k-d tree that comes into its own for really large inputs (in the order of millions) as seen in the graph below.

Running time (in seconds) for various input sizes

For the nearest neighbour algorithm, I use the Euclidean distance formula to compare distances between any two GPS coordinates. This isn't the most accurate, especially at latitudes further away from the equator where projection distortions are much more noticeable. I had to make a tradeoff between accuracy and speed, and I chose to focus on speed. The primary use case for this library, as I envisioned it, was to be able to geocode to larger administrative regions (such as cities and counties) and at this scale, inaccuracies due to latitude distortions are not so apparent. If you're interested in smaller regions, you could implement the Haversine formula as detailed in this post.

Usage at OpenSignal

Within OpenSignal, the library is used in various ways. We use it for ad-hoc analysis of our coverage and speed data and also in our coverage maps where you can compare the performance of operators in a country/region.

OpenSignal Coverage Map for Singapore (App available on the App Store and Google Play)

If you pan across the map above, the reverse geocoder library is used to determine the main country of interest. We, however, noticed a problem when you zoom into a border region between two or more countries.

The Problem

Taking Singapore as an example, the library fails when you query for locations near the border. 

An example of the library failing at border regions

The locations circled in blue are the ones in the database obtained from GeoNames. The locations circled in black are the ones given as input to the geocoder library. The lines show the geocoded locations returned by the library for those inputs. As you can see, the lines in red show the wrong country being returned. For instance, if you search for a location in the north of Singapore such as Woodlands or Yishun, the library returns Johor Bahru in Malaysia. Singapore, being a city-state, has only one administrative region in the GeoNames database and the location is unfortunately set to the downtown area in the south.

The Solution

I've now modified the library so that it accepts custom data sources. To fix the Singapore problem, you can customise the GeoNames database by adding the following three locations:

lat lon name admin1 admin2 cc
1.38511 103.98685 Singapore East SG
1.39367 103.66711 Singapore West SG
1.45106 103.80779 Singapore North SG

You can load this custom data source and pass it to the library as follows:

import io
import reverse_geocoder as rg

geo = rg.RGeocoder(mode=2, verbose=True, stream=io.StringIO(open('custom_source.csv', encoding='utf-8').read()))
coordinates = (51.5214588,-0.1729636),(9.936033, 76.259952),(37.38605,-122.08385)
results = geo.query(coordinates)

The results of this are shown in the image below.

Geocoding fixed with custom data source

The three custom locations are circled in magenta and you can see that the library now returns the right location (as shown by the green lines) within the country of interest.

Special thanks to Grégoire Charvet for reviewing my code for this enhancement.