Using random hexadecimal characters to generate an even distribution of random numbers within an arbitrary base-10 range

Question

I'm using a random number generator to produce a huge string of random hexadecimal characters which I then cache and pull from to generate base-10 integers within a requested range. The original (flawed) steps looked like:

Request a random base-10 integer within certain bounds
e.g.: getRandomInt(200, 250), which has a range of 250 - 200 = 50
Determine the minimum number of hex characters required to satisfy that range
e.g.: for a range of 50, we need 2 hex chars (which covers 1-256)
Pull that many hex characters from the hexadecimal cache
e.g.: "3A"
Convert those hexadecimal characters to a base-10 integer
e.g.: 3A₁₆ = 58₁₀
Use a modulus function to ensure the resulting integer is within the desired range
e.g.: 58 % 50 = 8
Add this to the lower bound for the final result
e.g.: 200 + 8 = 208

I recently realized that this biases the results towards lower numbers for ranges that aren't evenly divisible into/by 16. e.g.: if you request a number in the range [0,11], then 0₁₆ becomes 0₁₀, A₁₆ becomes 10₁₀, and B₁₆ becomes 11₁₀, but C₁₆ also becomes 0₁₀ again, giving you a 2/16 chance of generating a 0 but only a 1/16 chance of generating a 10.

Potential Solution?

After chatting with ChatGPT and facing how little math I math, I've modified the above Step #3 to pull 1 extra hexadecimal character. So if you request a random base-10 integer between [200,250] (a range of 50, which is satisfied by 2 hex chars), you'll no longer pull 2 hex characters but 3.

This seems to solve the issue and produce evenly distributed results for all ranges I've tested, but I can't say for certain if/why it works. I can kinda make sense of it when relating it to a random floating point number (e.g.: my intuition tells me that using hex characters to generate a random number such as 0.123456780, multiplying it by 10, and then removing the decimal portion to produce a random integer in the range [0,9] would be void of any bias), and I assume the same principle is at play here? With the principle being something along the lines of we can add some amount of excess to the end of our randomly generated number and trim it off to remove biases. But I don't know:

if the implemented solution is actually removing biases
if any of the above conjecture is true
if it is true, whether it applies to the above 6 steps
if it is true and applicable, how to determine the amount of excess which should be added to / trimmed from the end to ensure no biases are produced for a given range
if there is a better solution completely different from my approach

do not use the modulus, just reject if you overflow and try again — user619894, Sep 12 '23 at 20:11

Dan · Answer 1 · 2023-09-18T16:50:18.757

How many random digits do we need?

Determine the minimum number of hex characters required to satisfy that range
e.g.: for a range of 50, we need 2 hex chars (which covers 1-256)

The problem with this approach, as you have noted, is that even if your random hex digit generator gives perfect independent and uniformly-distributed digits, the output will be biased. With 256 hex sequences mapped to 50 different outputs, you'll get

44 of 50 values with a probability of 5/256 (0.01953125).
6 of 50 values with a probability of 6/256 (0.0234375).

A simple workaround for this bias is to request more hex digits than you need. For example, I'll propose the rule that we determine the minimum number of hex digits and then add 2. So for a range of 50, we get 4 hex digits, with 65536 possible combinations. This would give us:

14 of 50 values with a probability of 1310/65536 (0.019989013671875)
36 of 50 values with a probability of 1311/65536 (0.0200042724609375)

Which is still not perfect, but for most purposes the bias is now small enough not to be noticeable.

Rejecting the range bias

If you really want an exact 0.02×50 probability distribution, you can make a slight modification to the algorithm:

Let $r$ = the next 4 hex digits from your cache (interpreted as an integer between 0 and 65535).
If $0 \le r < 65500$ (the cutoff being the highest exact multiple of 50 within range), then return $\lfloor r/1310 \rfloor$ as your random number.
Otherwise ($r \ge 65500$), go back to Step 1 and try again.

Clarification (per discussion in comments): Yes, you could try this test-and-reject approach using the minimum 2 hex digits, without having to obtain extra digits. However, doing so would increase the rejection probability.

For example, if you retrieve 2 hex digits (with 256) possible states, and accept values in $[0, 250)$, you reject $\frac{6}{256} = 0.0234375$ of the random numbers. But with 4 hex digits, accepting values in $[0, 65500)$, you reject only $\frac{36}{65536} = 0.00054931640625$.

Determining the output

Use a modulus function to ensure the resulting integer is within the desired range
e.g.: 58 % 50 = 8

This rand() % n approach is popular among C programmers, but if you want a random number between 0 (inclusive) and $n$ (exclusive), it's better to use $\lfloor \frac{rn}{m} \rfloor$ (where $r \in [0, m)$ is the random variable). That way, if your random number generator is biased, the bias will be spread evenly throughout the output range, instead of systematically favoring small numbers.

Stretching the randomness

According to a comment from the asker, the hex digits are produced with a quantum random number generator (QRNG) that produces approximately 6800 hex digits per second. This might be adequate for your purposes, but if you need a lot of random numbers real fast, it might not be enough.

If this is an issue, then you can a the hybrid approach: Instead of using the QRNG directly for all random numbers, use it to seed a faster PRNG (that has been tested to meet your specific requirements for uniformity or cryptographic security) and use that for your output. Then periodically reseed it, to break any unwanted patterns that show up in the PRNG output. Something like:

int get_random(int range)
{
    static int counter = 0;
// reset the counter after MAX_PRNG_ITERATIONS calls
if (counter == MAX_PRNG_ITERATIONS)
{
    counter = 0;
}

// Re(seed) the PRNG as needed, using the QRNG.
if (counter == 0)
{
    PRNG.seed(QRNG.get_digits(PRNG_STATE_DIGITS));
}

++counter;

return PRNG.get_random(range);

}
```

This is what I was afraid of: it sounds like you're saying that increasing the number of hex characters pulled doesn't necessarily remove the bias, it simply lowers it. If I'm not mistaken, your modification essentially boils down to: if you get a number outside the requested range, then try again. In which case we don't need to pull excess hex digits; we just need to run a loop that pulls the minimum number of hex digits until it finds a value in the requested range, no? — Shaun Mitchell, Sep 12 '23 at 21:06
@ShaunMitchell: Sure. And in that case, you don't have to have the extra digits; you could just select a number from range(250) instead of range(65500). — Dan, Sep 12 '23 at 21:09
Awesome And to minimize waste, I think I'd go with something like a sliding window. e.g.: given the hex string "A14ED9...", if we need 2 chars and "A1" is too big, then it tries "14". I don't know if removing the "14" from the cache but leaving the "A" would cause any issues? It'd end up leaving higher numbers in the cache, which feels like potential for bias in later requests, so I'd lean towards just dropping the entire "A14" even though I'm only using the "14". — Shaun Mitchell, Sep 12 '23 at 21:18
Sliding window will introduce correlations. Not reccommended. — user619894, Sep 12 '23 at 21:23
@ShaunMitchell : It may very well be more efficient to anyway pull excess hex digits, because then the "rejection" probability is exponentially lowered with every extra digit. The exact optimum depends on the work of generating the digits versus the expected added work because of rejection. — Ron Kaminsky, Sep 12 '23 at 21:27
@RonKaminsky: Indeed. If you want a number in range(50), then using 2 hex digits gives you a rejection probability of $6/256 = 0.0234375$, but 4 hex digits reduces that to $36/65536 = 0.00054931640625$. — Dan, Sep 12 '23 at 21:29
@RonKaminsky: The hex digits are pulled from a QRNG API. It takes ~3 seconds to produce 20,408 hexadecimal characters, so it's not terribly expensive for normal use-cases, but it's a personal project with no deadline, so I'm taking my time aiming for all the optimums :) — Shaun Mitchell, Sep 13 '23 at 01:57
@Dan: Ahhh, okay, I'm slightly embarrassed to admit that I just understood how that math works out, but awesome. I'll do some maths to figure out what approach leads to the least waste. Might end up calculating it on the fly depending on the range if some ranges end up statistically resulting in less waste with +1 or +2 hexadecimal characters (and assuming the CPU cycles to perform that calculation on the fly each time is worth saving a few hexadecimal characters each function call). Thank you so much for all your help! — Shaun Mitchell, Sep 13 '23 at 01:59
@ShaunMitchell: Well, if you need a lot of random numbers real fast, one option is that instead of using the QRNG directly, you use it to seed a high-quality (CS)PRNG, and get your random integers from that. Then reseed after every few thousand calls to break any unwanted patterns in the PRNG. Of course, the output won't be truly random, but whether that's a problem depends on your specific use case. — Dan, Sep 13 '23 at 15:27
For random enemy spawn points in an arcade game, it doesn't matter that much. But for a random private key for a master signing certificate for a credit card processing network, use as much true randomness as possible. Consider where your program lies on the speed-accuracy tradeoff. — Dan, Sep 13 '23 at 15:31