Chacha20 is essentially a hash function that maps 512-bit strings to other 512-bit strings which are in turn xored with the plaintext to create the ciphertext. Of the 512-bit input 128-bit are used for the "expand 32-byte k" constant, 256-bit are used for the key, 64-bit are used for the nonce and the final 64-bit are used for the counter.
I noticed that in the eBASH/eBASC benchmarks here and here that chacha20 depending on the architecture the speed for the encryption of a 512-bit message ranges (at least for the 64-bit architectures) from slightly faster to significantly slower when compared to BLAKE2b (if we use it in a mode similar to the one that we use in chacha20 - so 512-bit consisting of a constant, key, nonce, and counter and xor the result with the message) which is a "real" cryptographic/collision resistant hash function based on chacha20 that given a message of any (reasonable) size gives a 512-bit output. BLAKE2b has 12 rounds (each of which if I understand correctly is equal to two chacha rounds).
My questions,
- How can BLAKE2b be faster in some architectures than chacha20 even though it does more work?
- I think that BLAKE2b uses 64-bit words rather than the 32-bit words that chacha20 uses, is this one of the causes?
- Also, would there be any reason not to prefer BLAKE2b (in what is essentially a CTR mode) for such an architecture?