9

If I'm hashing alphanumeric strings (chars in the set 0-9, a-z, and A-Z) with a CRC-64 hashing function, how long can my strings be while guaranteeing no hash collisions?

Stated differently: If I have a set of all CRC-64 checksums computed from alphanumeric strings $0$ to $N$ bytes, how large can $N$ be before I see a collision in the set?

Paŭlo Ebermann
  • 22,946
  • 7
  • 82
  • 119
brandx
  • 193
  • 1
  • 4

4 Answers4

8

If you have 62 chars you can transform 62 letters ($10+26+26$) in 6 bit number (approx). CRC is guaranteed to be unique mapping (Injective function) as long as input is shorter than output – you can have at most 10 letters, but not 11 since $62^{10} < 2^{64} < 62^{11}$.

It same goes whit most other hash functions. Lets say that you have hash function that takes only 64 bits and makes 64 bit output and is bijective (like a symmetric cypher). That mean you will not have collision as long as your input is shorter than 64 bits.

If you have mappings from all 64 bit inputs you are 100% sure that each new input will produce collision (because there is no value that is unused)

Paŭlo Ebermann
  • 22,946
  • 7
  • 82
  • 119
ralu
  • 451
  • 5
  • 11
6

I have a different take on ralu's accepted answer and some of the comments thereafter.

Consider two $N$-bit data sequences which we think of as polynomials $$D^{(1)}(x) = \sum_{i=1}^{N-1} D_i^{(1)}x^i ~~\text{and}~~D^{(2)}(x) = \sum_{i=1}^{N=1} D_i^{(2)}x^i$$ where each $D_i^{(1)}$ and $D_i^{(2)}$ is $0$ or $1$. Let $M(x)$ of degree $64$ denote the CRC polynomial. Actual CRC implementations for data communications have many bells and whistles but let us assume that for hashing purposes, the simplest form of CRC is used so that the CRC check sums (or hashes) $R^{(1)}(x)$ and $R^{(2)}(x)$ of degree $63$ or less (and thus having $64$ bits) are are the remainders obtained by dividing $x^{64}D^{(1)}(x)$ and $x^{64}D^{(2)}(x)$ by $M(x)$. Rememeber that this is polynomial division over the binary field $\{0,1\}$ where addition (and subtraction) is the Exclusive-OR operation $\oplus$. We thus have $$\begin{align*} x^{64}D^{(1)}(x) &= Q^{(1)}(x)M(x) \oplus R^{(1)}(x)\\ x^{64}D^{(2)}(x) &= Q^{(2)}(x)M(x) \oplus R^{(2)}(x) \end{align*}$$ where $Q^{(1)}(x)$ and $Q^{(2)}(x)$ are the quotients. Adding these two equations, we have that $$x^{64}\left[D^{(1)}(x)\oplus D^{(2)}(x)\right] = \left[Q^{(1)}(x) \oplus Q^{(1)}(x)\right]M(x) \oplus \left[R^{(1)}(x) \oplus R^{(2)}(x)\right] $$ It follows that if $R^{(1)}(x) = R^{(2)}(x)$ so that $[R^{(1)}(x) \oplus R^{(2)}(x) = 0$, then it must be that $D^{(1)}(x)\oplus D^{(2)}(x)$ is a multiple of $M(x)$. Conversely, if $D^{(1)}(x)\oplus D^{(2)}(x)$ is a multiple of $M(x)$, then so is $$x^{64}\left[D^{(1)}(x)\oplus D^{(2)}(x)\right] \oplus \left[Q^{(1)}(x) \oplus Q^{(1)}(x)\right]M(x) = \left[R^{(1)}(x) \oplus R^{(2)}(x)\right]$$ a multiple of $M(x)$, and therefore $R^{(1)}(x) \oplus R^{(2)}(x)$ of degree $63$ or less is a multiple of $M(x)$ of degree $64$. Since this can happen only if $R^{(1)}(x) \oplus R^{(2)}(x) = 0$, that is, $R^{(1)}(x) = R^{(2)}(x)$, we have the following.

$D^{(1)}(x)$ and $D^{(2)}(x)$ hash to the same check sum, that is, $R^{(1)}(x) = R^{(2)}(x)$, if and only if $D^{(1)}(x)$ and $D^{(2)}(x)$ differ by a multiple of $M(x)$

This result holds even if $D^{(1)}(x)$ and $D^{(2)}(x)$ are of different degrees if we zero-pad the shorter sequence with zeroes at the high-order end to make the sequences of equal length. But if the afore-mentioned bells and whistles are included (e.g. complement the high-order two bytes before commencing CRC calculations), then the result still holds for equal length data sequences, but should not be applied blindly when $D^{(1)}(x)$ and $D^{(2)}(x)$ are of different degrees: some care is necessary.

For the simple case considered here, we have immediately that $$\deg D^{(1)}(x) = \deg D^{(2)}(x) = N-1 \geq \deg M(x) = 64$$ and so if $N \leq 64$, we are guaranteed that no two sequences hash to the same checksum.


Turning to further specifics and ralu's answer, each alphanumeric symbol can have one of $62$ different values, and while it is possible to map $11$ such symbols to $62^{11}$ different bit sequences of lengths $64$ or less, it is much more convenient to implement a symbol-by-symbol mapping into $6$-bit bytes and create a degree-$65$ data sequence $D(x)$ of $66$ bits to be hashed. The downside is that four sequences $D(x), D(x)\oplus M(x), D(x) \oplus xM(x)$, and $D(x)\oplus (1\oplus x)M(x)$ will have the same hash, and this is the price paid for simplicity of the mapping algorithm: we have to restrict ourselves to $10$ alphanumeric symbols to avoid collisions. On the other hand, compressing $11$ alphanumeric symbols to $64$ bits or less is a messy task.

An even simpler method is to use the password as entered by the user (say as a sequence of ASCII-encoded $8$-bit bytes) to create the data sequence by concatenation. Now, $8$ symbols guarantees no collisions as per the simplified analysis above, but the actual picture is somewhat different. Although with $9$ bytes and $72$ bits, collisions can occur, it is not immediately obvious that collisions will occur. For example, $D(x)\oplus M(x)$ might well be a sequence of bytes that cannot be entered by the user as a password because some of the ASCII characters are control characters that cause the computer to take other actions than to simply pass the character on to the application to be processed.

I doubt there is a simple answer to the question of what is the maximum password length for which collisions are guaranteed not to occur. The answer depends on the choice of $M(x)$ also. For example, Wikipedia's page on CRCs says that CRC-64-ISO $x^{64}+x^4+x^3+x+1$ is weak for hashing purposes, the basis of which claim the diligent reader of the above will have no difficulty understanding.

Dilip Sarwate
  • 2,801
  • 18
  • 25
5

I will assume that the question is "If we take the CRC-64 function, and consider inputs that consist only of the ASCII characters in the specified range, what's the longest inputs we can have without having a collision". The other answers assumed some mapping between the string and the CRC-64 function (and try to answer 'what sort of mapping would be best'); I'll assume that there is no such mapping.

Well, the full answer depends on the CRC polynomial, if we assume the CRC-64-ISO polynomial of $x^{64} + x^4 + x^3 + x + 1$, then I believe that the longest is 8.

We know that it is at least 8, because (as mentioned earlier) CRC's are guarranteed to have an output differential if the input differential is no longer than the CRC.

However, if we are allowed to inputs of length 9, then we can find collisions.

If the CRC-64 processes MSBits first, then:

CRC64(BxxxxxxxP) == CRC64(CxxxxxxxK)

This is because $B \oplus C = 0x42 \oplus 0x43 = 0x01$, which corresponds to the polynomial $x^{64}$ and $P \oplus K = 0x50 \oplus 0x4B = 0x1B$ which corresponds to the polynomial $x^{4} + x^{3} + x + 1$, and hence the bit-wise difference between the two inputs is exactly a multiple of the CRC polynomial (in fact, is precisely the CRC polynomial).

If the CRC-64 processes LSBits first, then:

CRC64(AxxxxxxxP) == CRC64(QxxxxxxxK)

This is because $A \oplus Q = 0x41 \oplus 0x51 = 0x10$ which corresponds to the polynomial $x^{67}$ and $P \oplus K = 0x50 \oplus 0x4B = 0x1B$ which corresponds to the polynomial $x^7 + x^6 + x^4 + x^3$ (this differs from the first example because we're counting bits from the other side of the byte), and hence the bit-wise difference between the two inputs is again an exact multiple of the CRC polynomial (in this case, $x^3$ times the CRC polynomial).

poncho
  • 154,064
  • 12
  • 239
  • 382
-1

This is classical Birthday paradox problem and the linked Wikipedia article has a nice set of approximations that help you evaluate the chance of collisions. Of course, this applies to random choice of strings only. It's not possible to guarantee against collisions even when two strings are involved - there will be some very low probability even for two strings.

sharptooth
  • 409
  • 3
  • 9