I recently got one of my side-projects to a state where it's semi-usable, so thought I'd jot down some thoughts on the process. It's a London-centric transport routing engine, which will provide alternative routes to the typical "fastest, without service issues" from other apps. Some ideas:

  • Least-crowded route (where possible),
  • Routes with air-con (now that Summer is approaching, the shorts are out and so is the Central line),
  • Routes with step-free access (ideally better than "does this station have a blue wheelchair symbol").

The first stage of this was to have a database of stations, all possible links between them, and a way to discover multiple routes from a starting station to a goal station. The obvious technical choice here is to use some sort of graph - where the nodes are stations and the edges are the possible routes between them.

I started with data collection - TfL provide a Unified API which is an admirable effort at providing a single API to cover the many modes of transport that they offer (from tube, to buses to less obvious ones like river buses, Woolwich ferry and the cable car). Through usage, you'll discover it's not perfect - there are a few oddities, some bugs, and you need to learn a little about the NaPTAN corpus. Thankfully, the API is extensive enough that you can usually work around issues with alternative endpoints. I certainly don't fault the API team as wrapping such a system around many others (likely older or even spreadsheet-based data) must be very challenging.

Then I started linking these together in a data structure. Having never really used graphs before, my first thought was that I needed to jump into a high-level DMS such as neo4j. While this was a good introduction to graph problems (attributes, deduping, directional edges, etc.), and an excellent way of visualising the data I entered, it did mask some of the real problems I was going to run into. In particular, it wasn't obvious that when I requested all paths from West Ruislip to Bank, through any number of edges, that realistically the query was never going to finish. It might even be an infinite number of paths. I've now realised that neo4j likely uses DFS rather than BFS, and as such is susceptible to loops (transport networks are full of them, since lines are bidirectional), and the web interface at least will wait until all routes have been found. The best option is probably to request a fixed number of paths (i.e. 100) and then de-duplicate and score them.

I also ran into issues where neo4j would suggest paths like A-B-C-B-C-D. I suspect this was due to cases where stations could be connected by the Circle and District lines, but really this is the same track. I consolidated these with TfL's lineGroups data, but wasn't able to completely eradicate them since I didn't have control over how nodes or edges were treated as "visited".

Eventually I decided to explore the idea of building my own graph engine. Edd Mann's post on pratical DFS and BFS in Python was a great introduction to this and helped me realise that I could probably get away with an in-memory graph, given the relatively small and static size. I was quickly getting results, some of them reasonable!

I played around with a few graph-traversal algorithms at this stage - I was excited by the possibility of A*, since I had the latitude and longitude of stations and using a straight-line, great-circle distance as a heuristic approximation seemed pretty logical for deciding a route. Unfortunately, the implementations I've used all seem to be designed to return a single shortest route (which makes sense given that it was designed for real-time robot planning) and I'm not sure if it can be adapted to give multiple paths. I'm finding traversal algorithms interesting though and I'll revisit this part later - I'm just using BFS for now.

This week I switched out my homebrew Graph system for the lightweight igraph library, to reduce the amount of code I had to work with. Reassuringly it had a very similar API, so I knew I was on the right track.

I've added some small scoring rules - travel time, number of changes, A/C availablility (average temperature data would be amazing TfL!) - which are giving me pretty good results. My next steps include building some sort of UI (probably with React Native), incorporating TfL's crowding data and integrating actual train times. I've had a couple of goes at working timetables into the graph (i.e. an edge for each vehicle journey) which is not looking like a good idea - so I'll probably have to add it as some sort of post-sorting function.