Why do popular ARX ciphers have large states?

Question

salsa20/chacha20/blake/blake2/blake3 all utilize a 4x4 grid of words on which transformations occur row-wise and then column/diagonal-wise.

State size varies between 512 and 1024 bits based on word size (32 bit or 64 bit).

Using these alternating operations on rows and columns is by definition inefficient in respect to diffusion/confusion as there are simply more bits to mix.

Why not just take ARX row operation and base the entire cipher on that which would result in 128 bit or 256 bit permutation? A key schedule is required to not halve the security but that doesn't seem like a problem.

score 4 · Answer 1 · answered Dec 23 '24 at 11:20

One reason to use a 4x4 grid (and hence have a larger state) is to leverage instruction-level parallelism.

On Intel Skylake for instance, additions and XORs have a latency of 1 but a throughput of 0.25 clock cycles, meaning that we can interleave the ARX calculations across rows/columns to avoid CPU stalls.

Note that ARX ciphers that operate on a 128-bit state and include a key schedule, as you mentioned, have been proposed. While they are not as widely known as ChaCha or BLAKE, some have achieved significant recognition. Notably, the Korean standard LEA has also been adopted as part of the international standard ISO/IEC 29192-2:2019.

DerekKnowles · Answer 2 · 2024-12-25T18:19:59.027

It's because all of the ciphers/hash you listed are derived from one another. Blake3 is based on Blake2 which is based on Blake which is based on ChaCha which is based on Salsa20. So since Salsa20 uses a 4x4 grid its children/grandchildren ended up also using a 4x4 grid.

As for the reason Salsa20 uses a 4x4 grid it's essentially to leverage SIMD parallelism as you can easily fit 4x32-bit words into a single SSE2/NEON 128-bit register. This allows for 4 row/column operations to be done concurrently which is why ChaCha/Salsa/Blake is very fast.

Why do popular ARX ciphers have large states?

2 Answers2