3

I'm not sure if this question should be asked here...

For a general floating point system defined using the tuple $(\beta, t, L, U)$, where $\beta$ is the base, $t$ is the number of bits in the mantissa, $L$ is the lower bound for the exponent and $U$ is similarly the upper bound for the exponent, the rounding unit is defined as $$r = \frac{1}{2}\beta^{1 - t}$$

If I try to calculate the rounding unit for a single precision IEEE floating-point number which has 24 bits (23 explicit and 1 implicit), I obtain:

$$r = \frac{1}{2}2^{1 - 24} = \frac{1}{2}2^{-23} = 2^{-24}$$

which happens to be (using Matlab)

$$5.960464477539062 * 10^{-8}$$

which seems to be half of

eps('single')

that is, the machine precision for single-precision floating-point numbers for Matlab. The machine precision should be the distance from one floating-point number to another, from my understanding.

If I do the same thing for double-precision, apparently the rounding unit happens to be half of the machine precision, which is the follows

eps = 2.220446049250313e-16

Why is that?

What's the relation between machine precision and rounding unit?

I think I understood what's the rounding unit: it basically should allow us, given a real number $x$, we know that $fl(x)$ (the floating point representation of $x$) is no more far away than this unit to the actual $x$, correct?

But then what's this machine precision or epsilon?

Edit

If you look at the table in this Wikipedia article, there are two columns with the name "machine epsilon", where the values of the entries of one column seem to be half (rounding unit) of the values of the respective entries in the other column (machine precision).

https://en.wikipedia.org/wiki/Machine_epsilon

  • At the outset you defined $t$ to be "number of bits in the mantissa". Applying this to the single precision floating point computation you included the "implicit bit" in your count for $t$. However it seems to me that should not have been counted, since the "implicit bit" is essentially the binary leading one to the left of the (normalized-by-exponent) radix point. – hardmath Apr 10 '16 at 20:25
  • It makes sense to count the implicit bit in the bit precision. The size of the rounding unit should shrink with greater precision and with more bits "in the mantissa". But I suspect your formula treats the base $\beta$ mantissa size as places to the right of the radix point. Perhaps you can give a citation for the definition if it differs from the Wikipedia article? Also check the origins of the word "mantissa", the part of a logarithm that follows the decimal point. – hardmath Apr 10 '16 at 20:37
  • Thanks. Note that on page 19 they give the formula (for binary floating point) $\eta = \frac{1}{2} \times 2^{-t}$ which they say "has been called rounding unit, machine precision, machine epsilon, and more." On the previous page $t$ is used to number the bits to the right of the (floating) radix point. In both places they refer the Reader to Section 2.2 for "[f]urther details" and for $\eta$ to be "more generally defined". – hardmath Apr 10 '16 at 21:17
  • Include "the implicit $1$" in what? "Omit the $+1$" from what? The operational meaning of the formula they give for $\eta$ depends on how $t$ is defined, and I've pointed out that the numbering of bits by those authors at the beginning of the chapter counts only the $t$ bits to the right of the radix point. – hardmath Apr 10 '16 at 21:42
  • Perhaps in the first line on page 22 the authors make the point that you are, that in the special case base $\beta = 2$ we can avoid explicitly storing the leading $1$ (on any nonzero floating point values). Therefore the numbering of the "places of precision" $t$ for general base $\beta$ in Sec. 2.2 is "off by one" from how it was numbered in Sec. 2.1. – hardmath Apr 10 '16 at 22:21

1 Answers1

2

To illustrate what happens, I constructed a toy example with $\beta=2$ and $t=2$ and plotted the relative error (due to rounding) $$ \frac{|fl(x)-x|}{x}=\frac{|2^e\times1.ab-x|}{x} $$ where $e=\lfloor \log_2 x\rfloor$ and $1.ab$ is a binary number representing $x/2^e$ rounded off to two binary digits.

Here is the graph of $fl(x)$ along with the horizontal line $y=\tfrac12\times 2^{-2}=0.125$:

enter image description here

We can see that the relative rounding error is maximal just a little after $0.5,1$ and $2$. In fact the pattern repeats since each time the exponent $e$ is shifted by one, the number $x/2^e$ runs through the exact same values $[1,2)$ again.


The maximal relative error can be found exactly where $x=2^e\times (1+\tfrac12 \times 2^{-2})$ or simplified $x=2^e\times 1.125$. In base $2$ the number $1.125$ is $1.001_2$ and is being rounded up to $1.01_2=1.25$. Therefore $$ \frac{|fl(1.125)-1.125|}{x}=\frac{|2^0\times1.25-1.125|}{1.125}=\frac{0.125}{1.125} $$ which is just slightly less than $0.125$.


So why is this rounding unit (upper bound on the relative error) only half the size of the machine precision? Well, you can actually see it on the graph between $1$ and $2$: the distance between the valleys (zero error/full precision) is $2^{-2}=0.25$ which is rougly twice the height of the maximal error top.

String
  • 18,838