1

In x86, when I have two registers, and I know both of them have only one bit turned on, and I want to know whether they're equal, I can use either test or cmp (cmp a, b will give zero when they're equal, test a, b will give zero when they're not equal).

Questions like In x86 what's difference between "test eax,eax" and "cmp eax,0" or Test whether a register is zero with CMP reg,0 vs OR reg,reg? say that when comparing to zero it is preferred to use test over cmp. Does this advice stay when comparing two registers? Or perhaps the fact that one needs zero and the other needs not-zero affects somehow?

I'm mainly interested in 64-bit registers comparison with 64 bits processor, but if there's a difference with 32 bits I would like to hear too. Mostly important are latest Alder Lake and Zen 3, but other processors can be interesting too.

Chayim Friedman
  • 47,971
  • 5
  • 48
  • 77
  • 1
    `test r,r` and `cmp r,r` have identical performance on all microarchitectures I am aware of. Don't worry about this. – fuz Jul 19 '22 at 17:54
  • @fuz So can you post that as an answer please? – Chayim Friedman Jul 19 '22 at 17:55
  • 1
    I have not checked the tables for all these microarchitectures to tell for sure and I don't plan to put in the 30 minutes needed to do it, so no, I'm not going to post an answer. – fuz Jul 19 '22 at 17:57
  • The "Core" branding encompasses 10 microarchitectures starting in 2006 and the "Zen" branding is used for another 5 microarchitectures starting in 2017 (not sure why you don't care about the heavy-machinery branded AMD microarchitectures to better match the time frame of the Core branding better). So, if you might perhaps reduce your requirements to significantly less microarchitectures... – fuz Jul 19 '22 at 18:01
  • 3
    @fuz: [x86\_64 - Assembly - loop conditions and out of order](https://stackoverflow.com/q/31771526) shows `cmp` can't macro-fuse with `js`, `jp`, or `jo` on SnB, but `test` can. But that's irrelevant for this case since you'd be using `je`/`jne` - both instructions can fuse with either je/jne on any CPUs that can macro-fuse them at all. I thought my answers on the Q&As linked in the questions made that clear, or at least pointed to Agner Fog's guides where the details can be found. (And test and cmp are the same size, unlike `cmp r,0` vs. `test r,r`) – Peter Cordes Jul 19 '22 at 18:07
  • @fuz Relaxed to last :) I am interested in others, but they're most important. – Chayim Friedman Jul 19 '22 at 18:07
  • Pretty sure there's no performance difference on any x86 microarchitecture, and code-size is the same, unlike with `cmp reg, 0`. Optimizing away the immediate `0` is the main reason to use `test` in the Q&As you linked; differences in macro-fusion are mostly only for JCC predicates that you wouldn't normally use for `x<0`, `x == 0`, or `x>0` or whatever. – Peter Cordes Jul 19 '22 at 18:14

1 Answers1

4

In the scenario you described, both instructions perform identically on recent microarchitectures. On Alder Lake P, both can run on ports 0, 1, 5, 6, and 11 with a reciprocal throughput of 0.2 (0.25 and slightly less ports on Alder Lake E), while on Zen 3, both run on 4 ports with a reciprocal throughput of 0.25. The latency is 1 in both cases.

As for macro fusion, both instructions fuse with je and jne, which is the one you are interested in.

So really, in this case in particular it does not make a difference. There may be a difference in other use cases, e.g. when immediates or other conditions are involved.

fuz
  • 88,405
  • 25
  • 200
  • 352
  • I wonder whether `TEST` should be preferred to `CMP` where equal in performance based on (potentially) better energy efficiency? I have no idea whether any of the research projects that work on energy-efficient computing have analyzed instruction sets down to that level of detail. – njuffa Jul 19 '22 at 21:13
  • @njuffa I am inclined to believe that there is not a difference and if, it'll be in the picowatt range. – fuz Jul 19 '22 at 21:39
  • 1
    Definitely in the pJ range. But if energy-efficiency principles were followed in *all* situations where they can arise in the instruction picking process of a compiler, it could add up. Hypothesis: `TEST` involves fewer switching transistors than `CMP` since there are no carries to propagate. – njuffa Jul 19 '22 at 23:05