Dijkstra's algorithm on huge graphs

Question

I am very familiar with Dijkstra and I have a specific question about the algorithm. If I have a huge graph, for example 3.5 billion nodes (all OpenStreetMap data) then I clearly wouldn't be able to have the graph in memory, so the graph is stored on disk in a database.

There are libraries available to compute shortest paths on such graphs. How do they do this? More specifically, how do they load the required part of the graph to run Dijkstra's algorithm?

Fetching the adjacency list of each vertex visited would require about 1,500 database queries per 10,000 nodes according to my statistical data, so that clearly is not how they do it. That would be way too slow.

How do they do it? I am trying to implement it myself.

score 6 · Accepted Answer · answered Mar 29 '16 at 14:34

There are libraries available to compute shortest paths on such graphs. How do they do this? More specifically, how do they load the required part of the graph to run Dijkstra's algorithm?

You can use a DB, a custom file format to be read from disc and an in-memory setting.

But from my experience using a DB is roughly 5 to 10 times slower and a lot more memory intense than writing your own file format based on a 'simple' linked list format.

The good thing is there are several software frameworks using OSM which are open source so you can look right into the code e.g. see here. In the GraphHopper open source routing engine it is very easy to switch from a memory mapped setting (disc based) to the in-memory setting - both using the same format. The "mmap" setting even allows usage on memory restricted mobile devices and the latter performs a lot faster if you have the necessary RAM e.g. on a server. E.g. for a world wide graph (>100mio nodes) you then need around 8-10gb RAM, plus lot of more RAM if you want to speed up everything further e.g. with Contraction Hierarchies - roughly 5-8gb more for every vehicle you want.

The format is very simplistic and basically stores only the data you need with a few tricks to make it compact. Read more about it here. Disclaimer: I'm the author of GraphHopper.

Regarding the other answers:

Dijkstras algorithm while applicable is regarded as not optimal for this problem

The 'normal' Dijkstra can perform very reasonable (<1s for country-wide queries like your 3mio nodes example) and is optimal in the 'theory sense' but needs a bit tuning to get fast in production scenarios. And techniques like Contraction Hierachies use a bidirectional modification of it and perform very well.

road networks are hierarchical and planar.

road networks are hierarchical for car only and not planar (bridges, tunnels, ...)

score 2 · Answer 2 · answered Apr 05 '16 at 20:53

You do not need to put all edges that are adjacent in the priority queue. "Lie" to Dijkstra's algorithm and give it only the shortest vertex, v, incident to the vertex, say w, pulled off the stack. Then, when v is pulled of the queue you say "oops" I made a mistake and should have given you this vertex too, which is the next closest to vertex w. It is easily seen that this way you will have a correct solution and the queue size is dramatically reduced to one incident vertex only instead of the many. You need though to keep track of the incidences to always give the next closest vertex - when required. One of the comments claimed road networks are planar that is incorrect. In fact, a study has shown they are highly non-planar. Think of all motorways crossing via bridges through a city inducing many non-planarities.

score 0 · Answer 3 · answered Mar 28 '16 at 22:27

Dijkstras algorithm while applicable is regarded as not optimal for this problem although more efficient variants could be considered as "similar". there are various simplifications. road networks are hierarchical and planar. here are the basic approaches. the area is generally known as "route planning in road networks".

a graph structure can be "compiled" from the adjacency list data. this is the approach in the library you cite, SpatiaLite. these graph structures are stored in a compressed binary format where graph locations are represented by binary-encoded integers etc., so the graph representation and manipulation takes much less space than storing all the road names etc.; it appears the SpatiaLite algorithm is not "online" and runs entirely in memory.
there are parallel/ distributed algorithms. see eg Scalable GPU Graph Traversal / Merrill, Garland, Grimshaw.
the question uses client-server terminology ie "queries". the algorithms do not run by "querying" the database in the client-server sense. higher level query languages such as SQL are an interface to the database and may be used to transmit the request to compute the minimal routes but are not used by the algorithm internally. generally the algorithm runs "inside the database" ie entirely "server side". so hence writing a shortest path algorithm in database queries is feasible for small networks but not medium/ large scale ones.
there is another approach where estimations within small percentages may be acceptable. the basic idea is to keep an index of distances between nodes. see eg Fast and Accurate Estimation of Shortest Paths in Large Graphs / Gubichev, Bedathur, Seufert, Weikum
this (235p!) Phd thesis is especially applicable. Route Planning in Road Networks / Schultes
some algorithms use many of these ideas and others, are highly tuned and proprietary and verging on competitive trade secrets. eg Google's. there may be some misleading media on this subject. eg The Simple, Elegant Algorithm That Makes Google Maps Possible which claims/ implies Google uses Dijkstras algorithm without any citation.

Dijkstra's algorithm on huge graphs

3 Answers3