11

I have a program where computing SHA-1 is the bottleneck. This is using OpenSSL 1.0.0e on a 2.6Ghz 16-core Opteron where I get about 325MiB/s throughput. (SHA1 here is via Andy Polyakov's x86-64 assembly implementation using SSSE3.)

If I need to go faster, what makes most sense?

  • Use multiple cores
  • Use another implementation that's faster than default OpenSSL's
  • Buy a CUDA-based video card and leverage it
  • Buy a commercial hardware accelerator

I assume multiple cores won't help me because of overhead. I don't know much about CUDA other than it seems a popular choice for SHA-1 brute forcing, but unsure if it will help when I have to hash a large file. And I haven't had much luck looking for dedicated hardware accelerators (though I'm just starting to look).

otus
  • 32,462
  • 5
  • 75
  • 167
Fixee
  • 4,258
  • 3
  • 26
  • 39

3 Answers3

9

325 MB/s is already good, i.e. you will not get much more with another implementation. Also, SHA-1 is a sequential algorithm, so multiple cores or a GPU will not help you. Specialized hardware is probably your best bet to make SHA-1 faster.

(Also, if SHA-1 is the bottleneck then you are able to move data around faster than that, which is impressive; usually, network or hard disk bandwidth is the bottleneck, not hashing.)

If you can change the protocol, you could switch to another, faster hash function. I suggest RadioGatún[64]; there is an optimized implementation (in C) in sphlib. It is faster than SHA-1 (more than 660 MB/s on an Intel Core2 clocked at 2.4 GHz). RadioGatún[64] is not actively supported anymore but it did receive some public analysis, and it should not be weaker than SHA-1 anyway.

To benefit from multiple cores, you can use tree hashing, as @Paŭlo suggests. It is not that easy to get right (I mean, if you still want security) so I would advise using a hash function for which a tree hashing mode has already been detailed by competent cryptographer. This points to Skein, a current candidate for the upcoming SHA-3 standard. The "tree mode" is not part of "the Skein for SHA-3" so it was not thoroughly investigated, but at least it was designed by generally smart people who know their trade; and Skein itself is still a "recent design"; but it is still better than SHA-1, which has known weaknesses.

Thomas Pornin
  • 88,324
  • 16
  • 246
  • 315
4

For free software-based solutions on an x86_64, OpenSSL is the best around. Intel's IPP is purported to be 20% faster, and it's software-only, but it's not free (about 200USD, or 80USD for the academic version) and you have to fill out a form saying you're not from N. Korea, etc.

There are hardware accelerators in the form of SSL cards/chips, but are somewhat pricy. Although SHA-1 is fundamentally serial (ie, you need the chaining value of the prior block before you can compute the chaining value for the next block), there is instruction-level parallelism (which is what is exploited by the SSSE3-based software implementations in OpenSSL and IPP). CUDA-based implementations can be quite fast compared to IPP. See the recent paper in NSDI, for example.

Fixee
  • 4,258
  • 3
  • 26
  • 39
1

I have a program where computing SHA-1 is the bottleneck... If I need to go faster, what makes most sense?

I believe your are using OpenSSL's sha1-586.pl written by Andy Polyakov. Andy's code is likely running around 6.0 or 7.0 cycles per byte.

Since about mid-2014 you can buy an Intel Goldmont or Goldmont+ machine, like here. SHA-1 runs around 2.8 cycles per byte. I believe the latest AMD Ryzens include SHA-1 instructions, but I don't have one and don't know the benchmark results.

Since about 2012 you can buy an ARM Aarch64 board that has crypto extensions. SHA-1 runs around 2.5 cycles per byte on Cortex A-53's, which is the lower-end ARMv8 machine. The higher-end Cortex A-57's will probably perform a little better, around 1.5 to 2.0 cycles per byte.

You might also be interested in SHA Intrinsics on GitHub. It provides reference implementations for SHA using x86, Aarch64 and PowerPC intrinsics. Andy's ASM will run faster, but the intrinsics are a little easier to use since you don't have to do the Cryptogams SHA port.