5

Definitions:

Let $h$ be a hash function with output size $n$ bytes. Suppose the file $F$ can be divided into chunks of size $n$ bytes $F=f_0+f_1+\dots +f_i$ where the operator "$+$" stands for concatenation (in real situations it is always possible to pad the file).

Is it safe, or is it collision resistant, to use the following hash function $H$?

$$ H(F)=h(f_0)\oplus h(f_1) \oplus \dots \oplus h(f_i) $$

Motivation:

Suppose we have two files $F$ and $G$, then:

$$ H(F+G)=H(F) \oplus H(G). $$

In other words, this hash function $H$ is a homomorphic hash. It would be great if $H$ inherits all properties from $h$.

D.W.
  • 36,982
  • 13
  • 107
  • 196
Rafael
  • 165
  • 2
  • 8

4 Answers4

17

I may be missing something, but it seems to me like your construction is trivially not collision resistant.

In particular, for arbitrary blocks $x_0 \neq x_1$, each of size $n$ bytes

$$ H(x_0 + x_1) = h(x_0) \oplus h(x_1) = h(x_1) \oplus h(x_0) = H(x_1 + x_0) $$

thus producing a collision.

Treeston
  • 271
  • 2
12

This hash function $H$ can be broken (collisions & second preimages can be found) in polynomial time. Here I am assuming that the number of "chunks" (number of terms in the xor) is variable, so in particular it can equal the number of output bits in the hash. If instead the number of "chunks" is fixed then the other answer here describes the best-known attack, which still does not leave much hope for this construction.

See this answer and also the original source, a paper by Bellare & Micciancio (see appendix A).

wizzwizz4
  • 107
  • 5
Mikero
  • 14,908
  • 2
  • 35
  • 58
11

An answer from real life: I was computing a hash for a tuple (p, q) as hash(p) xor hash(q). This worked fine for years until a user applied it to a dataset in which p was equal to q about 90% of the time (think p = list price, q = discounted price), which meant that 90% of the tuples had a hash code of zero. (This was a hash used for indexing, not for cryptographic purposes.)

Michael Kay
  • 241
  • 1
  • 3
6

This kind of additive combination is vulnerable to Generalized Birthday style of attacks, as proposed by Wagner and worked on by others. If you have an $\ell-$ fold sum $$ h(f_1)\oplus h(f_2) \oplus \cdots \oplus h(f_{\ell})=z $$ where $z \in \{0,1\}^d$ you can use a divide and conquer recursive attack to find a pre-image of $z$ with time and memory complexity essentially of the order $$ \widetilde{O}(2^{d/(1+\lfloor \log_2 \ell \rfloor)}) $$ where the notation $\widetilde{O}$ hides logarithmic factors. For a related question see here

Edit: As in the other answer, if the number of terms is equal to the number of bits the attack is polynomial time.

wizzwizz4
  • 107
  • 5
kodlu
  • 25,146
  • 2
  • 30
  • 63