7

In one of the applications I work on, it is necessary to have a function like this:

bool IsInList(int iTest)
{
   //Return if iTest appears in a set of numbers.
}

The number list is known at app load up (But are not always the same between two instances of the same application) and will not change (or added to) throughout the whole of the program. The integers themselves maybe large and have a large range so it is not efficient to have a vector<bool>. Performance is a issue as the function sits in a hot spot. I have heard about Perfect hashing but could not find out any good advice. Any pointers would be helpful. Thanks.

p.s. I'd ideally like if the solution isn't a third party library because I can't use them here. Something simple enough to be understood and manually implemented would be great if it were possible.

nakiya
  • 14,063
  • 21
  • 79
  • 118
  • I'm also trying to read about perfect hashing of strings to numbers. But find only very few information :( – Johannes Schaub - litb Nov 19 '10 at 13:35
  • Are you sure O(log n) lookup (e.g. binary search on a sorted array) isn't fast enough? Remember hashing is O(k), where k isn't related to n (and the hash is thus O(1) in terms of n) but can still be larger than O(log n) for non-huge values of n (i.e. less than a million). – Fred Nurk Nov 19 '10 at 13:36
  • [Here](http://sux4j.dsi.unimi.it/) is a Java implementation of a minimal perfect hash finder. – Pointy Nov 19 '10 at 13:37
  • @Pointy: Interesting how that calls OO features of C++ bizarre, then two of the four reasons why it uses Java are OO. – Fred Nurk Nov 19 '10 at 13:42
  • @Fred Nunk yes I didn't pay too much attention to the mini-rants; I just looked at the code, which seemed at least basically competent - plus he has references to some papers that look interesting – Pointy Nov 19 '10 at 13:43

9 Answers9

3

I would suggest using Bloom Filters in conjunction with a simple std::map.

Unfortunately the bloom filter is not part of the standard library, so you'll have to implement it yourself. However it turns out to be quite a simple structure!

A Bloom Filter is a data structure that is specialized in the question: Is this element part of the set, but does so with an incredibly tight memory requirement, and quite fast too.

The slight catch is that the answer is... special: Is this element part of the set ?

  • No
  • Maybe (with a given probability depending on the properties of the Bloom Filter)

This looks strange until you look at the implementation, and it may require some tuning (there are several properties) to lower the probability but...

What is really interesting for you, is that for all the cases it answers No, you have the guarantee that it isn't part of the set.

As such a Bloom Filter is ideal as a doorman for a Binary Tree or a Hash Map. Carefully tuned it will only let very few false positive pass. For example, gcc uses one.

Matthieu M.
  • 287,565
  • 48
  • 449
  • 722
  • Generally, a sorted `std::vector` has better performance than a `std::map` when the set will be built once and only referenced afterwards. The `std::vector` has terrific cache locality and `std::lower_bound` is just as fast as `std::map::find`. – deft_code Nov 27 '11 at 05:04
  • @deft_code: I fully agree (and actually generally use sorted arrays/vectors for constants myself). It's not only the cache locality too, there is also cache prefetching at work because the memory is contiguous whereas accesses in trees are much less predictable (for the CPU). – Matthieu M. Nov 27 '11 at 12:55
2

What comes to my mind is gperf. However, it is based in strings and not in numbers. However, part of the calculation can be tweaked to use numbers as input for the hash generator.

Diego Sevilla
  • 28,636
  • 4
  • 59
  • 87
2

integers, strings, doesn't matter

http://videolectures.net/mit6046jf05_leiserson_lec08/

After the intro, at 49:38, you'll learn how to do this. The Dot Product hash function is demonstrated since it has an elegant proof. Most hash functions are like voodoo black magic. Don't waste time here, find something that is FAST for your datatype and that offers some adjustable SEED for hashing. A good combo there is better than the alternative of growing the hash table.

@54:30 The Prof. draws picture of a standard way of doing perfect hash. The perfect minimal hash is beyond this lecture. (good luck!)

It really all depends on what you mod by.

Keep in mind, the analysis he shows can be further optimized by knowing the hardware you are running on.

The std::map you get very good performance in 99.9% scenarios. If your hot spot has the same iTest(s) multiple times, combine the map result with a temporary hash cache.

Int is one of the datatypes where it is possible to just do:

bool hash[UINT_MAX]; // stackoverflow ;)

And fill it up. If you don't care about negative numbers, then it's twice as easy.

ticktock
  • 133
  • 5
1

A perfect hash function maps a set of inputs onto the integers with no collisions. Given that your input is a set of integers, the values themselves are a perfect hash function. That really has nothing to do with the problem at hand.

The most obvious and easy to implement solution for testing existence would be a sorted list or balanced binary tree. Then you could decide existence in log(N) time. I doubt it'll get much better than that.

Donnie
  • 45,732
  • 10
  • 64
  • 86
  • 2
    What you write is true, however it remains the case that a *minimal* perfect hash function would be useful, and if the cost of finding the function isn't too great then an *O* (1) solution would result. – Pointy Nov 19 '10 at 13:34
  • 1
    If you use the source integers as hash value, you have to use a very large array with possibly very many unused items, because he says that the range is large. – Johannes Schaub - litb Nov 19 '10 at 13:36
  • 1
    @Johannes Schaub that is why one would look for a *minimal* perfect hash, which is one that produces values in a range equal to the *number* of distinct set members, independent of the original range. – Pointy Nov 19 '10 at 13:44
  • @Pointy yes but that doesn't use source integers as hash values. My comment was to @Donnie – Johannes Schaub - litb Nov 19 '10 at 13:47
  • @Pointy - Assuming no domain knowledge (ie, the integers are random) then the minimum perfect hash for a set of integers is the integers themselves. Given some set of conditions on the ints then it may be possible to find a better one, but that wasn't specified. EDIT: Read your link from above, so I may be wrong. Going to leave this anyways until I see a proof that I'm wrong and that its computationally tractable. – Donnie Nov 19 '10 at 13:52
  • 1
    @Donnie that's not actually true - the *minimal* hash maps the source set of integers into a minimal range of values. If there are *n* distinct values in the original set (spread over an arbitrary range), then a *minimal perfect hash* produces integers in the range [ 0 ... *n* ) – Pointy Nov 19 '10 at 13:55
  • Whether it's computationally sane to do this, I don't know. If we're talking about a long-running server process, it might be worth it. – Pointy Nov 19 '10 at 13:56
0

For this problem I would use a binary search, assuming it's possible to keep the list of numbers sorted.

Wikipedia has example implementations that should be simple enough to translate to C++.

zildjohn01
  • 11,339
  • 6
  • 52
  • 58
0

It's not necessary or practical to aim for mapping N distinct randomly dispersed integers to N contiguous buckets - i.e. a perfect minimal hash - the important thing is to identify an acceptable ratio. To do this at run-time, you can start by configuring a worst-acceptible ratio (say 1 to 20) and a no-point-being-better-than-this-ratio (say 1 to 4), then randomly vary (e.g. changing prime numbers used) a fast-to-calculate hash algorithm to see how easily you can meet increasingly difficult ratios. For worst-acceptible you don't time out, or you fall back on something slower but reliable (container or displacement lists to resolve collisions). Then, allow a second or ten (configurable) for each X% better until you can't succeed at that ratio or reach the no-pint-being-better ratio....

Just so everyone's clear, this works for inputs only known at run time with no useful patterns known beforehand, which is why different hash functions have to be trialed or actively derived at run time. It is not acceptible to simple say "integer inputs form a hash", because there are collisions when %-ed into any sane array size. But, you don't need to aim for a perfectly packed array either. Remember too that you can have a sparse array of pointers to a packed array, so there's little memory wasted for large objects.

Tony Delroy
  • 102,968
  • 15
  • 177
  • 252
0

Original Question

After working with it for a while, I came up with a number of hash functions that seemed to work reasonably well on strings, resulting in a unique - perfect hashing.

Let's say the values ranged from L to H in the array. This yields a Range R = H - L + 1. Generally it was pretty big.

I then applied the modulus operator from H down to L + 1, looking for a mapping that keeps them unique, but has a smaller range.

In you case you are using integers. Technically, they are already hashed, but the range is large.

It may be that you can get what you want, simply by applying the modulus operator. It may be that you need to put a hash function in front of it first.

It also may be that you can't find a perfect hash for it, in which case your container class should have a fall back position.... binary search, or map or something like that, so that you can guarantee that the container will work in all cases.

Community
  • 1
  • 1
EvilTeach
  • 28,120
  • 21
  • 85
  • 141
0

A trie or perhaps a van Emde Boas tree might be a better bet for creating a space efficient set of integers with lookup time bring constant against the number of objects in the data structure, assuming that even std::bitset would be too large.

0
  1. (Once, at startup:) Ensure the list of integers is sorted (and contiguous).
  2. (As needed:) Test membership using a binary search.

I don't think you'll do much better than this if the ranges of integers in the list are truly large and arbitrary, as you stated.

Glenn Slayden
  • 17,543
  • 3
  • 114
  • 108