-1

Find the 10 top most occurring strings in a huge array of Strings.

Since the array is huge, it is not possible to load it in memory completely. My idea is to parse the arrays one by one and put the strings in a hash table with string as key and occurrence count as value. But this would take too much memory.

Is there any other optimized solution? Given that we only care about top 10 keys.

Ran G.
  • 20,884
  • 3
  • 61
  • 117
learner
  • 101
  • 2

1 Answers1

1

One optimized way is a streaming algorithm given by Charikar, Chen and Farach-Colton, known as CountSketch. See

It takes logarithmic space (in the size of your input), and gives an approximation for the $k$ most-frequent elements in that input. $k$ is an input parameter, that also affects the memory in use.

There are also extensions for sliding windows algorithms, and to other ways of measuring which element is considered "frequent".

Ran G.
  • 20,884
  • 3
  • 61
  • 117