12

I've heard this interview question asked a lot and I was hoping to get some opinions on what good answers might be: You have a large file 10+ GB and you want to find out which element occurs the most, what is a good way to do this?

Iterating and keeping track in a map is probably not a good idea since you use a lot of memory, and keeping track as entries come in isn't the greatest option since when this question is posed the file usually already exists.

Other thoughts I had included splitting the file to be iterated through and processed by multiple threads and then have those results combined, but the memory issue for the maps is still there.

Gilles 'SO- stop being evil'
  • 44,159
  • 8
  • 120
  • 184
Pat
  • 121
  • 1
  • 3

3 Answers3

6

When you have a really large file and many elements in it, but the most common element is very common -- occurs $> 1/k$ fraction of the time -- you can find it in linear time with space $O(k)$ words ( the constant in the $O()$ notation is very small, basically 2 if you don't count the storage for auxiliary things like hashing). Moreover, this works great with external storage, as the file is processed in sequence one element at a time, and the algorithm never "looks back". One way to do this is via a classical algorithm by Misra and Gries, see these lecture notes. The problem is now known as the heavy hitters problem (the frequent elements being the heavy hitters).

The assumption that the most frequent element appears $> 1/k$ fraction of the time for $k$ a small number may seem strong but it is in a way necessary! I.e. if you will have sequential access to your file (and in case the file is huge random access will be too expensive), any algorithm which always finds the most frequent element in a constant number of passes will use space linear in the number of elements. So if you don't assume something about the input you cannot beat a hash table. The assumption that the most frequent element is very frequent is maybe the most natural way to get around the negative results.

Here is a sketch for $k = 2$, i.e. when there is a single element that occurs more than half the time. This special case is known as the majority vote algorithm and is due to Boyer and Moore. We will keep a single element and a single count. Initialize the count to 1 and store the first element of the file. Then process the file in sequence:

  • if the current element of the file is the same as the stored element, increase the count by one
  • if the current element of the file is different from the stored element, decrease the count by one
  • if the updated count is 0, "kick out" the stored element and store the current element of the file; increase the count to 1
  • proceed to the next element of the file

A little bit of thinking about this procedure will convince you that if there exists a "majority" element, i.e. one that occurs more than half the time, then that element will be the stored element after the whole file is processed.

For general $k$, you keep $k-1$ elements and $k-1$ counts, and you initialize the elements to the first $k$ distinct elements of the file and the counts to the number of times each of these elements appears before you see the $k$-th distinct element. Then you run essentially the same procedure: an element's count is increased each time it is encountered, all element counts are decreased if an element that is not stored is encountered, and when some count is zero, that element is kicked out in favor of the current element of the file. This is the Misra-Gries algorithm.

You can of course use a hash table to index the $k-1$ stored elements. At termination, this algorithm is guaranteed to return any element that occurs more than $1/k$ fraction of the time. This is essentially the best you can do with an algorithm that makes constant number of passes over the file and stores only $O(k)$ words.

One final thing: after you found $k$ candidate "heavy hitters" (i.e. frequent elements), you can make one more pass over the file to count the frequency of each element. This way you can rank the elements between each other and verify whether all of them occur more than $1/k$ fraction of the time (if there are less than $k-1$ such elements, some of the elements returned by the algorithm might be false positives).

Sasho Nikolov
  • 2,587
  • 17
  • 20
3

The obvious answer is of course to keep a hash map and store a counter of the occurrence of elements as you move through the file as Nejc already suggested. This is (in terms of time complexity) the optimal solution.

If however, your space requirements are tight you can perform an external sort on the file and then find the longest consecutive run of equal elements. The following should have a constant memory footprint and can be done in $\Theta(n\log{n}).$

Jernej
  • 2,460
  • 15
  • 26
1

If the most common element is more common than the next common element by a substantial margin, and the number of different elements is small compared to the file size, you can randomly sample a couple of elements and return the most common element in your sample.

adrianN
  • 5,991
  • 19
  • 27