initial setup
We train on a year's worth of sales data,
evaluate against test data spanning that year,
and deploy a model into production.
Then another month's worth of sales data becomes available,
which we want to incorporate into an improved model.
Suppose we want to train on 13 / 16ths of the dataset,
which roughly matches an 80% / 20% split.
Originally we compute SHA3 hash of something that doesn't change,
like unique row ID or perhaps simply the contents of the whole row.
That produces a big hash value, but we need only examine the first hex nybble.
If it is in {0, 1, 2} we assign the row to "test",
else we assign it to the "train" dataset.
Since SHA3 is a cryptographically secure hash,
it resembles a random number generator.
Flip even one bit of its input, and roughly half
of its output bits will flip.
(An XOR of the two hashes will have roughly half its bits set.)
Re-hashing same ID will always produce identical hash value,
so this is very different from an approach to "random"
that involves rolling dice or flipping coins.
updates
A month goes by, and we have new sales data to incorporate, appended to existing data.
Simply repeat the process.
We don't need to remember how we assigned ID 12345 or ID 12346,
we just apply the hashing rule again.
We don't need to know if a given row is new or we saw it a month ago.
The hash output for that row is stable, no matter when we compute the hash,
so we will deterministically send previously seen ("old") rows
to the very same destination, every time.
The goal here is to prevent data leakage.
Evaluating a model against unseen "test" rows
only produces a meaningful result if those rows
didn't participate in the training process.
If a model peeks at the right answer, we call
that "cheating". Even if (in our example)
it took a month for the right answer to leak
into the process, for example by flipping
coins each time we make a "train" /"test" split.
weaker hashes
I don't understand how "crc32(np.int64(identifier))" being less than "test_ratio * 2**32" will determine if the instance was in the test set
Let's move along the continuum of hash functions, from strong to weak.
We already considered a single nybble from SHA3, and declaring "test!"
if the nybble is in {0, 1, 2}. All the nybbles are as random
as one another, so it doesn't matter if we choose the first or the last.
With % 10 we are picking the last, and it makes more sense to say
the final digit of ID 12346 is "arbitrary" rather than appearing "random",
since IDs will often be assigned sequentially.
For this setting a result in {0, 1}, e.g. 12000 % 10, would go into "test",
and others, e.g. 12346 % 10, would go into the "train" dataset.
Ok, let's return to that troublesome middle one.
Crc32 does its best to scramble the bits of its input.
As its name implies it emits a 32-bit result, so range is ~ four billion values.
It's not a strong enough hash to use in an adversarial setting,
but it's still useful, and it is very fast to compute.
When given "enough" input, it strives to map it to an arbitrary 32-bit output.
That is, if we call crc32() many times on many distinct outputs,
the resulting distribution will resemble a uniform random distribution
over binary bitstrings of length 32.
If you extract different sentences of dialog from your favorite novel,
and compute crc32() of each sentence, you likely won't see a collision.
With fewer than ten thousand distinct sentences, fourteen bits
suffices to give them unique serial numbers, and we're mostly steering
clear of the
birthday paradox.
On the other hand if you fed a trillion such strings into crc32(),
pigeonhole principle says that collisions are a sure thing.
We would expect to see roughly comparable counts on each possible output value.
Not many of them with zero hits, and not many with double the average hits.
Similar to what you would get from a
TRNG,
though we're doing it in a very deterministic way.
We wish to create an 80% / 20% partition of that output space.
Values range from zero to about four billion;
20% of that is a threshold of about 860 thousand.
So compute crc32 hash of each observed example row.
If hash is below threshold, call it "test", if above call it "train".
A month goes by and additional sales data has become available.
Compute crc32 hash of everything, old and new,
and use that same threshold rule to put each row in "train" or "test".
Train a model, evaluate, deploy.
Hands-on Machine Learning with Scikit-Learn, Keras and TensorFlow (3rd edition)
It turns out the uncited work you quoted, code and prose, was Aurélien Géron's
HOML3,
page 56.