Number representation in computers#

In this chapter we are going to gain a basic understanding of how numbers are represented in today’s computers. We will not go too much into the details (which are perhaps not so relevant here), but it is extremely important for scientists to understand the pitfalls of the so-called “floating-point arithmetics”. At the end of this lecture, you should understand what these pitfalls are and when they occur, since sooner or later one of these pitfalls will hit you (believe me on that one).

Binary representation of integers#

Traditional arithmetics and mathematics rely on the decimal system. Every number can be decomposed into a sum of products such as: \(908.2 = 9\times10^2 + 0\times10^1 + 8\times10^{0} + 2\times10^{-1}\). This system is called decimal because the factors can take the values \([0 - 9]\) while the fixed base (the number which is scaled by an exponent) is 10.

Nearly all modern processors, however, represent numbers in binary form, with each digit being represented by a two-valued physical quantity such as a “low” or “high” voltage: 0s and 1s. These binary digits are called bits, and are the basis of the binary representation of numbers.

Basics#

With only 0s and 1s available the most efficient way to represent numbers is as a sum of 2s to the power of i, with i going from 0 to N-1. It is best shown with an example, here with N=4 digits:

0101

To convert our number to a decimal representation we compute \(0\times2^3 + 1\times2^2 + 0\times2^1 + 1\times2^0 = 5\). In this convention, the leftmost digit is called the most significant bit and the rightmost the least significant bit. If all the elements from the left are zeros, the convention is to omit them when writing. This is for example what the built-in bin function chooses to do:

bin(5)
'0b101'

which converts an integer number to a binary string prefixed with 0b.

Exercise 15

Compute (manually) the binary representation of the numbers 6, 12, and 31. Check with bin() that you are correct

Now it appears quite clear that the number of different integers that a computer can represent with this system depends on the size N of the binary sequence. A positive integer represented with a byte (8 bits are called a byte) can thus be as large as:

sum([2**i for i in range(8)])  # do you understand what this line does?
255

but not larger. Unless specified otherwise, the first bit is often used to give the sign of the integer it represents. Therefore, the actual range of numbers that a byte can represent is \([-2^7; 2^7-1]= [-128; 127]\) (the reason for this asymmetry is a matter of definition, as we will see later). If you are sure that you only want to do arithmetics with positive numbers you can spare one bit and specify your binary number as being of unsigned integer type.

So how many bits does our computer use to represent integers? Well, it depends on the platform and programming language you are using. Many older languages (including old python versions) made the distinction between short (32 bits) and long (64 bits) integers.

Exercise 16

What is the largest number that a 64-bit integer can represent? The smallest? And a 32-bit unsigned integer?

Now, what is the default length of binary integers in python 3? Let’s try to find out:

from sys import getsizeof
a = 10
getsizeof(a)
28

So 28 bytes? That is a lot of bits for the number 10. This is because the getsizeof function returns the memory consumption of the object int(10). What does that mean? Well, in python numbers are not only numbers, they are also “things” (objects). And these things come with services, like for example the bit_length method:

a.bit_length()
4

Exercise 17

What does bit_length() do? What is the bit length of 5? of 127?

These services have a memory cost (an “overhead”), and are required no matter how big our number is. This overhead is 24 bytes for an integer.

def get_int_bitsize(integer):
    """Returns the true memory consumption of an integer (in bits) without the object overhead."""
    return (getsizeof(integer) - 24) * 8
get_int_bitsize(2)
32

Ha! This looks more reasonable. So python uses 32 bits to store integers? But then, the largest number it can manipulate must be very small? Let’s see if we can create a number larger than 2**32-1:

2**68
295147905179352825856
get_int_bitsize(2**18), get_int_bitsize(2**68), get_int_bitsize(2**100000)  
(32, 96, 106688)

As shown above, it turns out that modern python versions have no limitations on the size of integers (other than the total memory available on your computer). The memory slot used to store your number simply depends on how large it is.

get_int_bitsize(2**100000) / 8 / 1024
13.0234375

So the number 2**100000 requires about 13 KB (Kilobytes) of memory.

Exercise 18

Print 2**100000 on screen, “just for fun”.

Overflow#

This dynamic resizing of integers in python means that they cannot overflow. Overflow is a common pitfall of many numerical operations, and to illustrate what it is we can either use floats (unlike python integers, python floats can overflow), or numpy, which uses integer types of fixed length:

import numpy as np
a = np.int8(127)
a.dtype
dtype('int8')
a
np.int8(127)
a + np.int8(1)
/tmp/ipykernel_298682/3783741305.py:1: RuntimeWarning: overflow encountered in scalar add
  a + np.int8(1)
np.int8(-128)

What happened here? To understand this we need to understand how binary numbers are added together first. Please read the addition chapter of the Wikipedia article on binary numbers before going on.

Basically, we added 1 (binary 00000001) to 127 (binary 01111111) which gives us the binary number 10000000, i.e. -128 in two’s complement representation, which is the most common method of representing signed integers on computers. At least, python warned us that we are doing something wrong here. But be aware that this is not always the case:

np.array(2147483648).astype('int32')
array(-2147483648, dtype=int32)
a = np.array([18])
a**a
array([-497033925936021504])

Warning

These are examples of silent overflows. Silent means that they do not warn you about the probable mistake and could happen in your code without you noticing that this happens.

Extending the binary representation to non-integers: fixed-point notation#

A more general definition of the binary representation for integers is to use negative exponents as well. With negative exponents, any rational number \(a\) can be approximated by:

\[a = \pm \sum_{i=-j}^{k} z_i b^i\]

with \(b > 1\) (base), \(0 \le z_i \le b-1\), and \(j\) and \(k\) being positive integers. The precision depends on the size of \(j\) while the maximum range depends on the size of \(k\).

In the notation, a fixed point separates digits of positive powers of the base, from those of negative powers. In base 2 (binary), the number \(10000.1_2\) is equal to \(16.5_{10}\) in base 10.

Indeed:

\[1\times2^4 + 0\times2^3 + 0\times2^2 + 0\times2^1 + 0\times2^0 + 1\times2^{-1} = 16.5_{10}\]

Although this representation is convenient, the representable value range is heavily limited by the number of bits available to represent the number (see this week’s assignments). Therefore, most computers today are relying on the floating point notation to represent real numbers.

Floating point numbers#

Introduction#

Because computers have a finite number of storage units available, they can only represent a finite number of distinguishable values. In fact, a memory slot with \(N\) available bits cannot represent more than \(2^N\) distinguishable values. The range of real (or complex) numbers is of course infinite, and therefore it becomes clear that in the computer representation of numbers there will always be a trade-off between the range of numbers one would like to represent and their relative accuracy (i.e. the absolute difference between two consecutive representable numbers).

Taking the decimal representation of the number 1/3 as an example: it can be written as 0.33, 0.333, 0.3333, etc. Depending on the numbers of digits available, the precision of the number will increase but never reach the exact value, at least in the decimal representation.

This fundamental limitation is the explanation for unexpected results of certain arithmetic operations. For example:

0.1 + 0.1  # so far so good
0.2
0.1 + 0.2  # wtf?
0.30000000000000004

This is a typical rounding error, happening because most computers do not represent numbers as decimal fractions, but as binary. Without going too much into details (which can be tricky), this chapter will give you some elements of understanding in order to prepare you for the most common pitfalls of floating point arithmetics.

Scientific (exponential) notation in base b#

In the exponential notation (used by floating point numbers), a number is approximated with a fixed number of significant digits (the significand) and scaled using an exponent in some fixed base; the base for the scaling is normally two, ten, or sixteen. A number that can be represented exactly is of the following form:

\(\mathrm{number} = \mathrm{significand} \times \mathrm{base} ^{\mathrm{exponent}}\)

For example, in base 10:

\[1.234 = 1234 \times 10^{-3}\]

The number 1.234 can easily be represented exactly in base 10, but the number 1/3 cannot. However, in base 3 (which is just used here as an example) 1/3 can be represented exactly by \(1 \times 3^{-1}\).

To approximate a number in base 10, the rule of “the more bits you have, the more precise you are” continues to hold true: \(33 \times 10^{-2}\) and \(33333333333333 \times 10^{-14}\) are two ways to approximate the number, the latter being more expensive in terms of storage requirements but more accurate.

The exponential notation is the most common way to represent real numbers in computers and is the basis of the floating-point number representation.

Floating point representation (simplified)#

A floating point number in any base will store three numbers:

  • the sign (one bit)

  • the significand (\(N_s\) bits)

  • the exponent (\(N_e\) bits)

The numbers \(N_s\) and \(N_e\) are usually fixed beforehand, in the format specification. The base also needs to be specified of course: computers usually work in base 2, but other bases have been experimented with as well (e.g. 16, or hexadecimal). Now remember:

  • the significand determines the precision of the representation (significant digits)

  • the exponent determines the magnitude of the represented number

Let’s make an example to illustrate this concept. We will work in base 10 for simplicity, and assume that we have 8 “memory slots” (the equivalent of bits) available, each memory slot capable of storing a number from 0 to 9 or the sign (+ or -). We attribute \(N_s=6\) slots to the significand (including its sign) and \(N_e=2\) slots to the exponent (including its sign).

Now, what is the smallest positive number we can represent? And the biggest? Let’s try it:

  • smallest: \(+00001 \times 10^{-9}\)

  • second smallest: \(+00002 \times 10^{-9}\)

  • biggest: \(+99999 \times 10^{+9}\)

  • second biggest: \(+99998 \times 10^{+9}\)

From these examples it becomes apparent that the precision (the distance between two consecutive numbers) is better for small numbers than for large numbers. Although our example is simplified, the principle is exactly the same for “real” floating point numbers in most computers, which follow the IEEE754 convention.

IEEE 754#

From the example above, we can see that with a fixed number of memory slots, we have a trade-off between the maximum precision of a number and its size.

This precision/size trade-off raises many challenges: memory and electronics must be optimized so that computations are fast while still allowing programs to realize computations on a wide range of numbers in the same program. For example, atmospheric models realize computations on specific humidity values (\(\mathrm{kg}\,\mathrm{kg}^{-1}\)) with typical values of 10\(^{-5}\) and on the geopotential with values up to several orders of magnitude larger. Using different standards for each variable would be impracticable. This led to the development of the IEEE Standard for Floating-Point Arithmetic (IEEE 754).

The standard defines five basic formats that are named for their numeric base and the number of bits used in their encoding, as listed in this table. For example, binary64 is the famous “double precision” format, called float64 in numpy and simply float in the python standard library. In this format, \(N_s=53\) bits are used for the significand, and the remaining \(N_e=11\) for the exponent.

It is possible to compute the approximate precision of IEEE754 floating point numbers according to their value (see also the exercises):

img Source: wikipedia

Consequences#

With the floating point format, small numbers have a larger absolute precision than large numbers. See these examples:

.99 == .98  # so far so good
False
999999999999999.99 == 999999999999999.98  # wtf?
True
999999999999999.99  # wtf?
1000000000000000.0

A further direct consequence is that when summing two numbers, precision is lost to match the size of the outcome.

Overflow#

Like numpy integers, floating point numbers can overflow:

np.float16(65510) + np.float16(20)
/tmp/ipykernel_298682/3724677978.py:1: RuntimeWarning: overflow encountered in scalar add
  np.float16(65510) + np.float16(20)
np.float16(inf)

Fortunately, the IEEE 754 standard defines two new numbers (-inf and +inf) which are more informative (and less dangerous) than the negative reset of overflowing integer numbers. IEEE 754 also defines “Not A Number” (abbreviated NaN), which propagates through computations:

np.nan * 10
nan

What to do from here?#

As we have learned, errors in floating-point numbers are unavoidable. Even if these errors are very small, simple calculations on approximated numbers can contain pitfalls that increase the error in the result way beyond just having the individual errors “add up”. Here we discuss some possible ways to deal with the pitfalls of floating point arithmetics.

Alternative data types#

In certain cases where perfect decimal accuracy is needed (for example when dealing with currencies and money), it is possible to use a decimal floating point representation instead of the default binary one:

1/10*3  # not precise
0.30000000000000004
from decimal import Decimal
Decimal(1) / Decimal(10) * 3  # precise
Decimal('0.3')

With limited-precision decimals there are no unexpected rounding errors. In practice, however, such alternative datatypes are used rarely because the precision gain comes with a performance overhead: computers work best with 0s and 1s, not with numbers humans can read.

Symbolic computations#

Symbolic computations are realized literally (like in mathematics), not approximately. SymPy is a popular python library for symbolic mathematics:

import sympy
a = sympy.sympify('1 / 3')
a + a
\[\displaystyle \frac{2}{3}\]

Seams like the perfect solution, right? It probably is if you are a mathematician, but for actual numerical computations SymPy will be way too slow to use. Symbolic mathematics can only be used for problems where analytical solutions are known. Unfortunately, this is not always the case (take numerical models of the atmosphere for example).

Deal with it#

There are no simple answers to numerical rounding errors. Therfore: be aware that they occur and try to mitigate their effect.

Awareness#

Awareness is mostly hindered by the string representation of floating point numbers. In practice:

0.1
0.1
format(0.1, '.16g')  # give 16 significant digits
'0.1'
format(0.1, '.30g')  # give 30 significant digits
'0.100000000000000005551115123126'

The default 0.1 print is therefore a “lie”, but it is a useful one: in most cases you do not want to know about these insignificant digits at the end. The numpy.finfo is a useful function informing you about the machine limits for floating point types:

info = np.finfo(np.float64)
info.bits, info.precision, info.max
(64, 15, np.float64(1.7976931348623157e+308))

Error propagation#

Preventing rounding errors to happen is not possible, but there are a few general rules:

  • Multiplication and division are “safer” operations

  • Addition and subtraction are dangerous, because when numbers of different magnitudes are involved, digits of the smaller-magnitude number are lost.

  • This loss of digits can be inevitable and benign (when the lost digits are also insignificant for the final result) or catastrophic (when the loss is magnified and distorts the result strongly).

  • The more calculations are done (especially when they form an iterative algorithm) the more important is it to consider this kind of problem.

  • A method of calculation can be stable (meaning that it tends to reduce rounding errors) or unstable (meaning that rounding errors are magnified). Very often, there are both stable and unstable solutions for a problem.

(list taken from the floating point guide)

As illustration for the difference between addition and multiplication, see the following example:

a = 10 * 0.1
b = 0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1
a == b
False
a, b
(1.0, 0.9999999999999999)

Realize safer computing therefore involves asking yourself at which stage the most precision is going to be lost: this is most often the case when adding numbers of very different magnitudes. When building numerical models, this should always be something you consider: is a formula leading to dangerous additions, then reformulate it and/or use other units for your variables (e.g. \(\mathrm{g}\,\mathrm{kg}^{-1}\) instead of \(\mathrm{kg}\,\mathrm{kg}^{-1}\) for specific humidity). Consider the following example:

a = 1
b = 1e-16
c = 10
c1 = c * (a + b)
c2 = c * a + c * b
c1 == c2
False

c * (a + b) and c * a + c * b are mathematically equivalent, and the first is computationally less expensive (two operations instead of three). However, the second is less prone to rounding errors!

Safer tests#

Fortunately, rounding errors often remain unnoticed, meaning that your computations are probably OK! In our field in particular, we often do not care if the post-processed temperature forecast is numerically precise at 0.001° since the forecast accuracy is much lower anyway. However, this can still lead to surprises when comparing arrays for equality (e.g. when testing that a temperature is equal to zero, or for matching coordinates like longitude or latitude). In these cases, prefer to use numpy’s specialized functions:

np.isclose(c1, c2)  # OK if you don't care about small numerical errors
np.True_

Take home points#

  • computers can only represent a finite number of distinguishable values.

  • the range of representable numbers depends on the size of memory allocated to store it. There is practically no limit to the size of integers in python, but there is a limit for floats. Numpy implements several types of variables named after their size in bits (e.g. np.float32, np.float64, np.float128).

  • there are many different ways to represent numbers on computers, all with different strengths and weaknesses. The vast majority of systems use the IEEE 754 standard for floating points, which is a good compromise between range and accuracy. Most systems are binary (base 2) per default, but there are other bases available: base 10 (decimal) and base 16 (hexadecimal) are frequent as well.

  • rounding errors happen because of these limitations. They always happen, even for the simplest arithmetics, and you shall not ignore them.

  • the fact that a number is printed “correctly” on screen does not mean that its internal binary representation is perfect. In fact, it is statistically much more probable (inifinitely more probable) that a number is not represented exactly by the floating-point format

  • however, there are ways to mitigate the impact of these rounding errors. This includes the use of more precise datatypes (e.g. float64 instead of float32), alternative representations (decimal instead of binary), and the use of more conservative operations (e.g. multiplication before addition when dealing with numbers of different magnitude)

  • floating point errors can have dramatic consequences in chaotic systems. A scary example is given in this paper about the influence of floating point computations on numerical weather forecasts.

Further reading#

Because of the importance of floating point computations you will find many resources online. I highly recommend to go through the short floating point guide website, which explains the problem to non specialists. It will give you another point of view on the topic.

Other resources: