Reverse Geocoding Custom Data Sources

According to Wikipedia,

Reverse geocoding is the process of back (reverse) coding of a point location (latitude, longitude) to a readable address or place name. This permits the identification of nearby street addresses, places, and/or areal subdivisions such as neighbourhoods, county, state, or country. 

Reverse geocoding is crucial to the work that I do at OpenSignal where I've to churn through terabytes of crowdsourced data to compare operator performance (in terms of coverage and data speeds) and user numbers at various geographical areas across the globe. To be able to do these analyses easily, I built a geocoder library in Python that is offline and fast, and that improves on an existing one built by Richard Penman. By making this offline, you do not have to deal with slow web APIs (such as Nominatim and Google) and query limits. The library was built with speed in mind and it can geocode 10 million GPS coordinates in under 30 seconds on a machine with 8 cores.

Since its release over a year ago, the library has been pretty well received by the community. It's been downloaded over 10,000 times and has received 1230 stars on Github. Being featured as the #1 post on Hacker News certainly helped drive a lot of traffic to it. I'm really grateful to the community who have helped report and squash quite a few bugs along the way.

Under the Hood

Under the hood, the library comes packaged with a database of places with a population greater than 1000, which was obtained from GeoNames. This entire database is loaded into a k-d tree and the nearest neighbour algorithm is used to find the city/town closest to the input point location. There's a nice explanation of k-d trees in Data Skeptic, one of my favourite podcasts. The scipy implementation of k-d trees is, unfortunately, single-threaded and does not exploit the multiple CPUs available on your machine. Thus, to improve performance, I implemented a parallelised k-d tree that comes into its own for really large inputs (in the order of millions) as seen in the graph below.

Running time (in seconds) for various input sizes

For the nearest neighbour algorithm, I use the Euclidean distance formula to compare distances between any two GPS coordinates. This isn't the most accurate, especially at latitudes further away from the equator where projection distortions are much more noticeable. I had to make a tradeoff between accuracy and speed, and I chose to focus on speed. The primary use case for this library, as I envisioned it, was to be able to geocode to larger administrative regions (such as cities and counties) and at this scale, inaccuracies due to latitude distortions are not so apparent. If you're interested in smaller regions, you could implement the Haversine formula as detailed in this post.

Usage at OpenSignal

Within OpenSignal, the library is used in various ways. We use it for ad-hoc analysis of our coverage and speed data and also in our coverage maps where you can compare the performance of operators in a country/region.

OpenSignal Coverage Map for Singapore (App available on the App Store and Google Play)

If you pan across the map above, the reverse geocoder library is used to determine the main country of interest. We, however, noticed a problem when you zoom into a border region between two or more countries.

The Problem

Taking Singapore as an example, the library fails when you query for locations near the border. 

An example of the library failing at border regions

The locations circled in blue are the ones in the database obtained from GeoNames. The locations circled in black are the ones given as input to the geocoder library. The lines show the geocoded locations returned by the library for those inputs. As you can see, the lines in red show the wrong country being returned. For instance, if you search for a location in the north of Singapore such as Woodlands or Yishun, the library returns Johor Bahru in Malaysia. Singapore, being a city-state, has only one administrative region in the GeoNames database and the location is unfortunately set to the downtown area in the south.

The Solution

I've now modified the library so that it accepts custom data sources. To fix the Singapore problem, you can customise the GeoNames database by adding the following three locations:

lat lon name admin1 admin2 cc
1.38511 103.98685 Singapore East SG
1.39367 103.66711 Singapore West SG
1.45106 103.80779 Singapore North SG

You can load this custom data source and pass it to the library as follows:

import io
import reverse_geocoder as rg

geo = rg.RGeocoder(mode=2, verbose=True, stream=io.StringIO(open('custom_source.csv', encoding='utf-8').read()))
coordinates = (51.5214588,-0.1729636),(9.936033, 76.259952),(37.38605,-122.08385)
results = geo.query(coordinates)

The results of this are shown in the image below.

Geocoding fixed with custom data source

The three custom locations are circled in magenta and you can see that the library now returns the right location (as shown by the green lines) within the country of interest.

Special thanks to Grégoire Charvet for reviewing my code for this enhancement.

Roundup of PyData London Meetup on 05-July-2016

My introduction to PyData London was through my boss Brendan, who recommended the conference to showcase a library that I'd built at the time. I got the wonderful opportunity to present at the conference last year, and was blown away by the community. Since then, I've tried to attend a few meetups, although I'm always on the waitlist. The community has grown massively from 1000+ members this time last year to over 3500 now... so much so that they have to host an overflow meetup at a pub nearby! The main meetup is hosted at AHL's office in Monument, and they generously provide beer and pizzas (and now gluten-free sandwiches!) to all attendees. Here's a roundup of last night's meetup #24.

It started off with community-related announcements by the organisers Emlyn Clay and Ian Ozsvald. Headlines are... NumFOCUS have received a generous grant from the Sloan Foundation to provide more awesomeness to the community. PyData EuroPython is taking place in Spain later this month and registration is still open. Ian, who has been advocating a switch to Python 3 for months, shared his survey results and it looks like one-third of the data scientists in London are now on Python 3.4+. We, at OpenSignal, have migrated most of our projects over to Python 3 as well. The package of the month is PyTables which provides a Pythonic layer for HDF5, a data model that works extremely well for time series data.

The first talk on writing clean code in Jupyter notebooks was co-presented by Vlad Kazantsev and Katya Nerush. I use notebooks a lot but mostly for prototyping / presenting results, and it was interesting to see that they were using it in production at Product Madness. The talk was mostly about coding best practices referencing "Uncle Bob" Martin's work on the clean coder. A cool package that was recommended was pytest-ipynb that can automatically detect and run tests within notebooks. You can find the slides of their talk here.

The second talk was by Dimitrios Kouzis-Loukas who took us through his journey of writing a book on Learning Scrapy. This was quite eye-opening for me and I was inspired to see that there were a hand full of people within the community who have written books. I've been dabbling with Scrapy and Scrapinghub lately and it became apparent to me during the talk that there's more for me to learn there... time to get the book. Effective product marketing I say! 

UPDATE [16-July-2016]: The slides of Dimitrios's presentation can be found here and here's the video recording on YouTube:

There were two lightning talks in the end - one by Steve Holden on Python Equivalences and the other by Lev Konstantinovskiy, who organises the London Data Science Journal Club meetup, on topic modelling in gensim. The evening ended with a hilarious take by Emelyn on Brexit from a data perspective. Here's an interesting stat... only 9% of the Leavers read the Guardian!

On the whole, it was a lovely evening with some great talks. Looking forward to the next.

Installing Scientific Python on Mac OS X using Homebrew and Pip

I encountered some difficulty while installing the SciPy stack on my Mac. I tried using Anaconda and Enthought Canopy as these contained all the essential packages in one installer. They both failed to install scipy though, possibly due to the issue with the Apple LLVM compiler that comes with the XCode 5.1.1 update. After some google-searching, I found this wonderful blog post that uses Homebrew and Pip and it almost worked for me. Here's how I fixed some of the additional issues, not detailed in that post.

Read More