16

In answering this question, I was looking for references (textbooks, papers, or implementations) which represent a graph using a set (e.g. hashtable) for the adjacent vertices, rather than a list. That is, the graph is a map from vertex labels to sets of adjacent vertices:

graph: Map<V, Set<V>>

In fact, I thought that this representation was completely standard and commonly used, since it allows O(1) querying for an edge existence, O(1) edge deletion, and O(1) iterating over the elements of the adjacency set. I have always represented graphs this way both in my own implementations and teaching.

To my surprise, most algorithms textbooks do not cover this directly, and instead represent it using a list of labels:

graph: Map<V, List<V>>

As far as I understand, adjacency lists seem strictly worse: both representations support O(1) vertex additions and iteration over adjacent edges, but adjacency lists require O(m) for edge removal or edge existence (in the worst case).

Yet I am baffled that, for example Cormen Leiserson Rivest Stein: Introduction to Algorithms, Morin: Open Data Structures, and Wikipedia all suggest using adjacency lists. They mainly contrast adjacency lists with adjacency matrices, but the idea of storing adjacent elements as a set is only mentioned briefly in an off-hand comment as an alternative to the list representation, if at all. (For example, Morin mentions this on page 255, "What type of collection should be used to store each element of adj?") I must be missing something basic.

Q: What is the advantage of using a list instead of a set for adjacent vertices?

  • Is this a pedagogical choice, an aversion to hashmaps/hashsets, a historical accident, or something else?

  • This question is closely related, but asks about the representation graph: Set<(V, V)>. The top answer suggests using my representation. Looking for a bit more context on this.

  • The second answer suggests hash collisions are a problem. But if hash sets are not preferred, another representation of maps and sets can be used, and we still get great performance for edge removal with a possible additional logarithmic factor in cost.

  • Bottom line: I don't understand why anyone would implement the edges as a list, unless all vertex degrees are expected to be small.

Caleb Stanford
  • 7,298
  • 2
  • 29
  • 50

4 Answers4

29

In many algorithms we don't need to check whether two vertices are adjacent, like in search algorithms, DFS, BFS, Dijkstra's, and many other algorithms.

In the cases where we only need to enumerate the neighborhoods, a list/vector/array far outperforms typical set structures. Python's set uses a hashtable underneath, which is both much slower to iterate over, and uses much more memory.

If you want really efficient algorithms (and who doesn't), you take this into account.

If you need $O(1)$ lookup of adjacencies and don't intend to do much neighborhood enumeration (and can afford the space), you use an adjacency matrix. If expected $O(1)$ is good enough, you can use hashtables, sets, trees, or other datastructures with the performance you need.


I suspect, however, that you don't hear about this so often, is because in algorithms classes, it makes analysis much simpler to use lists, because we don't need to talk about expected running time and hash functions.


Editing in two comments from leftaroundabout and jwezorek. Many real world graphs are very sparse and you often see $O(1)$-sized degrees for most of the graphs. This means that even if you want to do lookup, looping through a list is not necessarily much slower, and can in many cases be much faster.

As a "proof", I add some statistics from the graphs from Stanford Network Analysis Platform. Out of approximately 100 large graphs, the average degrees are

Avg. degree number of graphs
< 10 35
< 20 43
< 30 10
< 40 4
< 50 2
< 70 3
< 140 1
< 350 1
Ainsley H.
  • 17,823
  • 3
  • 43
  • 68
2

In general you are correct, however the list version is sort-of legitimate in the odd case where there can be more than one edge between two nodes. Of course it would still be more conceptually correct to represent the neighbors of each node as a mapping type between node and amount of edges.

When I say "conceptually correct" I mean that due to the fact that the neighbors of a node are not ordered, conceptually, using an ordered structure to represent that collection is not proper. The abstract mathematical definition of a graph represents the collection of neighbors as a set, because that's what it is. Programming close to the mathematical representation helps with clarity. If tomorrow I stumble upon a graph implementation where the collection of neighbors is an array or a list I will immediately be suspicious that neighbor order plays a role somewhere, which is an enormous addition on mental load.

Depending on the expected use of the graph, it can be faster to use linked lists, arrays, tree-based sets or hash-based sets to hold the neighbors of each node. In my experience (with high level languages where these collections are easy to use), sets are usually the best choice because testing for adjacency between two arbitrary nodes is what I typically happen to do a lot.

As to why they are represented like that in those books, I think you gave the answer already: the authors didn't think the distinction to be worth much discussion. Graphs are an ancient topic in programming courses, and it very well might be that these passages date back to a time where lists and arrays were abstractions implemented by the language itself, but not sets. And so it was much easier and less confusing to talk to students back then about the collections they could most easily get access to.

Kafein
  • 121
  • 1
1

In one of the comments, Yuval mentions that the memory footprint is far lower. I'd like to expand on that point a little.

The adjacency data structure is conceptually a subset of $V$. Suppose there are $k$ edges from some vertex. Assuming the edges are independent, there are ${ \left|V\right| \choose k }$ possible subsets. So any representation of this adjacency collection must require at least this many bits to store:

$$\log { \left|V\right| \choose k }$$

(Note that since we're doing information theory, all logarithms are base 2.)

Let $p = \frac{k}{\left|V\right|}$ be the probability of an edge being present. Then by Stirling's approximation:

$$\begin{eqnarray*}\log { \left|V\right| \choose k } & \approx & \left|V\right| \log \left|V\right| - k \log k - (\left|V\right|-k) \log (\left|V\right|-k) \\ & = & \left|V\right| H_b(p)\end{eqnarray*}$$

where $H_b$ is the binary entropy function:

$$H_b(p) = -p \log p - (1-p) \log (1-p)$$

When $p \approx \frac{1}{2}$, $H_b(p) \approx 1$, so you need at least $\left|V\right|$ bits of information to represent this collection. A bit vector is optimal.

When $p \ll \frac{1}{2}$, $\log { \left|V\right| \choose k } \approx k \log \left|V\right|$, so an array of $k$ elements, each containing an integer between $1$ and $\left|V\right|$, is optimal.

(Note that when $p > \frac{1}{2}$, it may be more space efficient to store the vertices that are not present.)

So in the sparse case, not only is smaller, it is an optimal representation.

Pseudonym
  • 24,523
  • 3
  • 48
  • 99
1

Here is a different way to think about it. Let's try to represent a graph with adjacency hash tables instead of adjacency lists.

What hash function will we choose, and how large must the table be?

If there are $n$ vertices and they are already numbered $0...n-1$, then I think the most sensible choice of hash function is the identity function! And the size of each hash table will be exactly $n$. Finally, each hash table is simply a boolean vector of length $n$.

Finally, instead of an array of hash tables, we end up with a boolean matrix, called an adjacency matrix. Adjacency matrices have been well-suited, and in fact they are better than adjacency lists for some purposes. But worse than adjacency lists for some other purposes.

  • "Iterate on the neighbours of a given vertex" is a $\Theta(d)$ operation with adjacency lists, but $Θ(n)$ operation with an adjacency matrix, where $d$ is the degree of that vertex and $n$ the total number of vertices, so lists are better for that purpose;
  • "Are two given vertices neighbours?" is an $O(1)$ operation with a matrix, but $O(d)$ operation with adjacency lists;
  • You can do linear algebra with a matrix, but not with lists; in particular, if $M$ is the adjacency matrix of a graph, then $M^k$ gives the number of walks of length $k$ between each pair of vertices; using fast matrix exponentiation, this can allow for much faster algorithms than adjacency lists for some problems.

Adjacency hash tables appear to be slower than adjacency lists on the first point, and slower than matrices on the last two points.

Stef
  • 570
  • 2
  • 10