I dont understand this way of having a stable train/test split even after updating the dataset

Question

from zlib import crc32

def is_id_in_test_set(identifier, test_ratio):
    return crc32(np.int64(identifier)) < test_ratio * 2**32

def split_data_with_id_hash(data, test_ratio, id_column):
    ids = data[id_column]
    in_test_set = ids.apply(lambda id_: is_id_in_test_set(id_, test_ratio))
    return data.loc[~in_test_set], data.loc[in_test_set]



housing_with_id = housing.reset_index()  # adds an `index` column
train_set, test_set = split_data_with_id_hash(housing_with_id, 0.2, "index")

book says

compute a hash of each instance’s identifier and put that instance in the test set if the hash is lower than or equal to 20% of the maximum hash value. This ensures that the test set will remain consistent across multiple runs, even if you refresh the dataset. The new test set will contain 20% of the new instances, but it will not contain any instance that was previously in the training set.

I don't understand how if the identifier is less than 20% of the maximum hash value will guarantee that an instance does not repeat in the test set

J_H · Accepted Answer · 2024-02-01T15:38:54.537

initial setup

We train on a year's worth of sales data, evaluate against test data spanning that year, and deploy a model into production. Then another month's worth of sales data becomes available, which we want to incorporate into an improved model.

Suppose we want to train on 13 / 16ths of the dataset, which roughly matches an 80% / 20% split.

Originally we compute SHA3 hash of something that doesn't change, like unique row ID or perhaps simply the contents of the whole row. That produces a big hash value, but we need only examine the first hex nybble. If it is in {0, 1, 2} we assign the row to "test", else we assign it to the "train" dataset. Since SHA3 is a cryptographically secure hash, it resembles a random number generator. Flip even one bit of its input, and roughly half of its output bits will flip. (An XOR of the two hashes will have roughly half its bits set.) Re-hashing same ID will always produce identical hash value, so this is very different from an approach to "random" that involves rolling dice or flipping coins.

updates

A month goes by, and we have new sales data to incorporate, appended to existing data. Simply repeat the process. We don't need to remember how we assigned ID 12345 or ID 12346, we just apply the hashing rule again. We don't need to know if a given row is new or we saw it a month ago. The hash output for that row is stable, no matter when we compute the hash, so we will deterministically send previously seen ("old") rows to the very same destination, every time.

The goal here is to prevent data leakage. Evaluating a model against unseen "test" rows only produces a meaningful result if those rows didn't participate in the training process. If a model peeks at the right answer, we call that "cheating". Even if (in our example) it took a month for the right answer to leak into the process, for example by flipping coins each time we make a "train" /"test" split.

weaker hashes

I don't understand how "crc32(np.int64(identifier))" being less than "test_ratio * 2**32" will determine if the instance was in the test set

Let's move along the continuum of hash functions, from strong to weak.

SHA3
crc32
mod 10

We already considered a single nybble from SHA3, and declaring "test!" if the nybble is in {0, 1, 2}. All the nybbles are as random as one another, so it doesn't matter if we choose the first or the last.

With % 10 we are picking the last, and it makes more sense to say the final digit of ID 12346 is "arbitrary" rather than appearing "random", since IDs will often be assigned sequentially. For this setting a result in {0, 1}, e.g. 12000 % 10, would go into "test", and others, e.g. 12346 % 10, would go into the "train" dataset.

Ok, let's return to that troublesome middle one. Crc32 does its best to scramble the bits of its input. As its name implies it emits a 32-bit result, so range is ~ four billion values. It's not a strong enough hash to use in an adversarial setting, but it's still useful, and it is very fast to compute. When given "enough" input, it strives to map it to an arbitrary 32-bit output. That is, if we call crc32() many times on many distinct outputs, the resulting distribution will resemble a uniform random distribution over binary bitstrings of length 32.

If you extract different sentences of dialog from your favorite novel, and compute crc32() of each sentence, you likely won't see a collision. With fewer than ten thousand distinct sentences, fourteen bits suffices to give them unique serial numbers, and we're mostly steering clear of the birthday paradox.

On the other hand if you fed a trillion such strings into crc32(), pigeonhole principle says that collisions are a sure thing. We would expect to see roughly comparable counts on each possible output value. Not many of them with zero hits, and not many with double the average hits. Similar to what you would get from a TRNG, though we're doing it in a very deterministic way.

We wish to create an 80% / 20% partition of that output space. Values range from zero to about four billion; 20% of that is a threshold of about 860 thousand. So compute crc32 hash of each observed example row. If hash is below threshold, call it "test", if above call it "train".

A month goes by and additional sales data has become available. Compute crc32 hash of everything, old and new, and use that same threshold rule to put each row in "train" or "test". Train a model, evaluate, deploy.

Hands-on Machine Learning with Scikit-Learn, Keras and TensorFlow (3rd edition)

It turns out the uncited work you quoted, code and prose, was Aurélien Géron's HOML3, page 56.

I dont understand this way of having a stable train/test split even after updating the dataset

1 Answers1

initial setup

updates

weaker hashes