Why is addition as fast as bit-wise operations in modern processors?

Question

I know that bit-wise operations are so fast on modern processors, because they can operate on 32 or 64 bits on parallel, so bit-wise operations take only one clock cycle. However addition is a complex operation that consists of at least one and possibly up to a dozen bit-wise operations, so I naturally thought it will be 3-4 times slower. I was surprised to see after a simple benchmark that addition is exactly as fast as any of the bit-wise operations(XOR, OR, AND etc). Can anyone shed light on this?

D.W. · Accepted Answer · 2017-05-24T19:58:19.480

Addition is fast because CPU designers have put in the circuitry needed to make it fast. It does take significantly more gates than bitwise operations, but it is frequent enough that CPU designers have judged it to be worth it. See https://en.wikipedia.org/wiki/Adder_(electronics).

Both can be made fast enough to execute within a single CPU cycle. They're not equally fast -- addition requires more gates and more latency than a bitwise operation -- but it's fast enough that a processor can do it in one clock cycle. There is a per-instruction latency overhead for the instruction decoding and control logic, and the latency for that is significantly larger than the latency to do a bitwise operation, so the difference between the two gets swamped by that overhead. AProgrammer's answer and Paul92's answer explain those effects well.

score 43 · Answer 2 · answered May 24 '17 at 09:55

There are several aspects.

The relative cost of a bitwise operation and an addition. A naive adder will have a gate-depth which depend linearly of the width of the word. There are alternative approaches, more costly in terms of gates, which reduce the depth (IIRC the depth then depend logarithmically of the width of the word). Others have given references for such techniques, I'll just point out that difference is also less important than what it may seem just considering the cost of the operation because of the need of control logic which add delays.
Then there is the fact that processors are usually clocked (I'm aware of some research or special purpose non clocked designs, but I'm not even sure that some are available commercially). That means that whatever the speed of an operation is, it will take at an integer multiple of the clock cycle.
Finally there is the micro-architectural considerations: are you sure that you measure what you want? Nowadays, processors tend to be pipelined, multi-scalar, with out-of-order execution and whatever else. That means that they are able to execute several instructions at the same time, at various stage of completion. If you want to show by measurements that an operation takes more time that another, you have to take take those aspect in consideration as their goal is to hide those difference. You may very well have the same throughput for addition and bitwise operations when using independent data but a measure of the latency or introducing dependencies between the operations may show otherwise. And you have also to be sure that the bottleneck of your measure is in the execution, and not for instance in the memory accesses.

score 26 · Answer 3 · answered May 24 '17 at 15:18

CPUs operate in cycles. At each cycle, something happens. Usually, an instruction takes more cycles to execute, but multiple instructions are executed at the same time, in different states.

For example, a simple processor might have 3 steps for each instruction: fetch, execute and store. At any time, 3 instructions are being processed: one is being fetched, one is being executed and one stores its results. This is called a pipeline and has in this example 3 stages. Modern processors have pipelines with over 15 stages. However, addition, as well as most of the arithmetic operations, are usually executed in one stage (I am speaking about the operation of adding 2 numbers by the ALU, not about the instruction itself - depending on the processor architecture, the instruction might require more cycles for fetching arguments from memory, performing conditionals, storing results to memory).

The duration of a cycle is determined by the longest critical path. Basically, it's the longest amount of time necessary for some stage of the pipline to complete. If you want to make the CPU faster, you need to optimize the critical path. If reducing the critical path per se is not possible, it can be split into 2 stages of the pipeline, and you are now able to clock your CPU at almost twice the frequency (assuming there is not another critical path that prevents you from doing this). But this comes with an overhead: you need to insert a register between the stages of the pipeline. Which means that you don't really gain 2x speed (the register needs time to store the data), and you have complicated the whole design.

There are already quite efficient methods for performing addition (e.g. carry lookahead adders) and addition is not a critical path for the processor speed, thus is makes no sense splitting it into multiple cycles.

Also, note that while it might seem complicated for you, in hardware things can be done in parallel very fast.

score 14 · Answer 4 · edited May 24 '17 at 14:54

Processors are clocked, so even if some instructions can clearly be done faster than others, they may well take the same number of cycles.

You'll probably find that the circuitry required to transport data between registers and execution units is significantly more complicated than the adders.

Note that the simple MOV (register to register) instruction does even less computation than bitwise logic, yet both MOV and ADD usually take one cycle. If MOV could be made twice as fast, CPUs would be clocked twice as fast and the ADDs would be two cycles.

score 12 · Answer 5 · answered May 28 '17 at 11:45

Addition is important enough to not have it wait for a carry bit to ripple through a 64-bit accumulator: the term for that is a carry-lookahead adder and they are basically part of 8-bit CPUs (and their ALUs) and upwards. Indeed, modern processors tend to need not much more execution time for a full multiplication either: carry-lookahead is actually a really old (and comparatively affordable) tool in a processor designer's toolbox.

score 11 · Answer 6 · answered May 24 '17 at 12:55

I think you'd be hard pressed to find a processor that had addition taking more cycles than a bitwise operation. Partly because most processors must carry out at least one addition per instruction cycle simply to increment the program counter. Mere bitwise operations aren't all that useful.

(Instruction cycle, not clock cycle - e.g. the 6502 takes a minimum of two clock cycles per instruction due to being non-pipelined and not having an instruction cache)

The real concept you may be missing is that of the critical path: within a chip, the longest operation that may be performed within one cycle dictates, at the hardware level, how fast the chip may be clocked.

The exception to this is the (rarely used and hardly commercialised) asynchronous logic, which really does execute at different speeds depending on logic propagation time, device temperature etc.

score 10 · Answer 7 · answered May 24 '17 at 17:50

At the gate level, you are correct that it takes more work to do addition, and thus takes longer. However, that cost is sufficiently trivial that doesn't matter.

Modern processors are clocked. You cannot do instructions at anything except multiples of this clock rate. If the clock rates were pushed higher, to maximize the speed of the bitwise operations, you would have to spend at least 2 cycles on addition. Much of this time would be spent waiting around because you didn't really need the full 2 cycles worth of time. You only needed 1.1 (or some number like that). Now your chip adds slower than everyone else on the market.

Worse, the mere act of adding or doing bitwise operations is only one tiny part of what is going on during a cycle. You have to be able to fetch/decode instructions within a cycle. You have to be able to do cache operations within a cycle. Lots of other things are going on on the same timescale as the simple addition or bitwise operation.

The solution, of course, is to develop a massively deep pipeline, breaking these tasks up into tiny parts that fit into the tiny cycle time defined by a bitwise operation. The Pentium 4 famously showed the limits of thinking in these deep pipeline terms. All sorts of issues arise. In particular branching gets notoriously difficult because you have to flush the pipeline once you have the data to figure out which branch to take.

gnasher729 · Answer 8 · 2017-05-26T19:17:33.250

Modern processors are clocked: Every operation takes some integral number of clock cycles. The designers of the processor determine the length of a clock cycle. There are two considerations there: One, the speed of the hardware, for example measured as the delay of a single NAND-gate. This depends on the technology used, and on tradeoffs like speed vs. power usage. It is independent of the processor design. Two, the designers decide that the length of a clock cycle equals n delays of a single NAND-gate, where n might be 10, or 30, or any other value.

This choice n limits how complex operations can be that can be processed in one cycle. There will be operations that can be done in 16 but not in 15 NAND delays. So chosing n = 16 means such an operation can be done in a cycle, choosing n = 15 means it can't be done.

The designers will chose n so that many important operations can be just about done in one, or maybe two or three cycles. n will be chosen locally optimal: If you replaced n with n-1, then most operations would be a bit faster, but some (those that really need the full n NAND delays) would be slower. If few operations would slow down, so that overall program execution is faster on average, then you would have picked n-1. You could also have picked n+1. That makes most operations a bit slower, but if you have many operations that can't be done within n delays but can be done within n+1 delays then it would make the processor overall faster.

Now your question: Add and subtract are so common operations that you want to be able to execute them in a single cycle. As a result, it doesn't matter that AND, OR etc. could execute faster: They still need that one cycle. Of course the unit "calculating" AND, OR etc has a lot of time to twiddle its thumbs, but that can't be helped.

Note that it's not just whether an operation can be done within n NAND-delays or not: An addition for example can be made faster by being a bit clever, still faster by being very clever, still a bit faster by investing extraordinary amounts of hardware, and at last a processor can have a mixture of very fast very expensive and a bit slower and cheaper circuits, so there is the possiblity to make one operation just about fast enough by spending more money on it.

Now you could make the clock speed so high / the cycle so short that only the simple bit operations execute in one cycle and everything else in two or more. That would most likely slow the processor down. For operations that take two cycles, there is usually overhead to move an incomplete instruction from one cycle to the next, so two cycles doesn't mean you have twice as much time for execution. So to do the addition in two cycles, you couldn't double the clock speed.

AnoE · Answer 9 · 2017-05-26T07:54:45.410

Let me correct a few things that were not mentioned that explicitely in your existing answers:

I know that bitwise operations are so fast on modern processors, because they can operate on 32 or 64 bits on parallel,

This is true. Labeling a CPU as "XX" bit usually (not always) means that most of its common structures (register widths, addressable RAM etc.) are XX bits in size (often "+/- 1" or somesuch). But in regards to your question, you can safely assume that a CPU with 32 bit or 64 bit will do any basic bit operation on 32 or 64 bits in constant time.

so bitwise operations take only one clock cycle.

This conclusion is not necessarily the case. Especially CPUs with rich instruction sets (google CISC vs. RISC) can easily take more than one cycle for even simple commands. With interleaving, even simples commands might break down into fetch-exec-store with 3 clocks (as an example).

However addtion is a complex operation

No, integer addition is a simple operation; subtraction as well. It is very easy to implement adders in full hardware, and they do their stuff as instantaneously as basic bit operations.

that consists of at least one and possibly up to a dozen bitwise operations, so I naturally thought it will be 3-4 times slower.

It will take up 3-4 times as many transistors, but in comparison to the big picture that is neglectible.

I was surprised to see after a simple benchmark that addition is exactly as fast as any of the bitwise operations(XOR, OR, AND etc). Can anyone shed light on this?

Yes: integer addition is a bitwise operation (with a few more bits than the others, but still). There is no need to do anything in stages, there is no need for complicated algorithms, clocks or anything else.

If you wish to add more bits than your CPU architecture, you will incur a penalty of having to do it in stages. But this is on another level of complexity (programming language level, not assembly/machine code level). This was a common problem in the past (or today on small embedded CPUs). For PCs etc., their 32 or 64 bits are sufficient for the most common data types for this to start to become a moot point.

Why is addition as fast as bit-wise operations in modern processors?

9 Answers9