8

Suppose I am given an array of $n$ fixed width integers (i.e. they fit in a register of width $w$), $a_1, a_2, \dots a_n$. I want to compute the sum $S = a_1 + \ldots + a_n$ on a machine with 2's complement arithmetic, which performs additions modulo $2^w$ with wraparound semantics. That's easy — but the sum may overflow the register size, and if it does, the result will be wrong.

If the sum doesn't overflow, I want to compute it, and to verify that there is no overflow, as fast as possible. If the sum overflows, I only want to know that it does, I don't care about any value.

Naively adding numbers in order doesn't work, because a partial sum may overflow. For example, with 8-bit registers, $(120, 120, -115)$ is valid and has a sum of $125$, even though the partial sum $120+120$ overflows the register range $[-128,127]$.

Obviously I could use a bigger register as an accumulator, but let's assume the interesting case where I'm already using the biggest possible register size.

There is a well-known technique to add numbers with the opposite sign as the current partial sum. This technique avoids overflows at every step, at the cost of not being cache-friendly and not taking much advantage of branch prediction and speculative execution.

Is there a faster technique that perhaps takes advantage of the permission to overflow partial sums, and is faster on a typical machine with an overflow flag, a cache, a branch predictor and speculative execution and loads?

(This is a follow-up to Overflow safe summation)

Gilles 'SO- stop being evil'
  • 44,159
  • 8
  • 120
  • 184

2 Answers2

3

You can add $n$ numbers of size $w$ without any overflow if you are using $\lceil \log n\rceil + w$ bits arithmetic. My suggestion is to do just that and then check if the result is in the range. Algorithms for multiprecision arithmetic are well-known (see TAOCP section 4.3 if you need a reference), there is often hardware support for addition (carry flag and add with carry instruction), even without such support you can implement it without data dependant jump (which is good for jump predictors) and you need just one pass on the data and you may visit the data in the most convenient order (which is good for cache).

If the data doesn't fit in memory, the limiting factor will be the IO and how well you succeed in overlapping the IO with the computation.

If the data fit in memory, you'll probably have $\lceil \log n\rceil \leq w$ (the only exception I can think of is 8-bits microprocessor which usually have 64K of memory) which means you are doing double precision arithmetic. The overhead over a loop doing $w$-bits arithmetic can be just two instructions (one to sign extend, the other to add with carry) and a slight increase of register pressure (but if I'm right, even the register starved x86 has enough registers that the only memory access in the inner loop can the data fetch). I think it is probable that an OO processor will be able to schedule the additional operations during the memory load latency so the inner loop will be executed at the memory speed and thus the exercise will be one of maximising the use of the available bandwidth (prefetch or interleaving techniques could help depending on the memory architecture).

Considering the latest point, it is difficult to think of other algorithms with better performance. Data dependant (and thus not predictable) jumps are out of question as are several passes on the data. Even trying to use the several cores of today's processor would be difficult as the memory bandwidth will probably be saturated, but it could be an easy way to implement interleaved access.

AProgrammer
  • 3,099
  • 18
  • 20
1

On a machine where integer types behave as an abstract algebraic ring [basically meaning that they wrap], one could compute the sums of item[i] and (item[i] >> 16) for up to about 32767 items. The first value would give the lower 32 bits of the correct sum. The latter value would yield bits 16-47 of something close to the correct sum, and using the former value it may be easily adjusted to yield bits 16-47 of the exact correct sum.

Pseudocode would be something like:

Sum1=0 : Sum2 = 0
For up to 32768 items L[i] in list
  Sum1 = Sum1 +L[i]
  Sum2 = Sum2 +(L[i] >> 16) ' Use sign-extending shift
Loop
Sum1MSB = Sum1 >> 16 ' Cannot use division of numbers can be negative--see below
Sum2Mid = Sum2 and 65535
Sum2Adj = Sum1MSB - Sum2Mid
If Sum2Adj >= 32768 then Sum2Adj = Sum2Adj - 65536
Sum2 += Sum2Adj

After the above code, Sum2 and Sum1 together should yield the correct sum, regardless of intervening overflows. If it's necessary to total more than 32768 numbers, they may be divided into groups of 32768, and after computing Sum2 for each group, one can add it to a two-variable "big sum" for all the groups as a whole.

In some languages, the shift right operator could be replaced by a division by 65536. That generally work when computing Sum2, but not when extracting Sum1MSB. The problem is that some languages round divisions toward zero while it's necessary here to perform a division rounding to the next lower number (toward negative infinity). Errors computing Sum2 would get corrected later on, but errors computing Sum2LSB would affect the final result.

Note that nothing in the final results would indicate whether any of the computations involving Sum1 had "overflowed", but if values are guaranteed to wrap cleanly code shouldn't need to care about whether any overflow had occurred.

supercat
  • 1,281
  • 8
  • 11