Introduction to NumPy: why does NumPy exist?#

You might have read somewhere that Python is “slow” in comparison to some other languages. While generally true, this statement has only little meaning without context. As a scripting language (e.g. to simplify tasks such as file renaming, data download, etc.), Python is fast enough. For numerical computations (like the computations done by an atmospheric model or by a machine learning algorithm), “pure” Python is very slow indeed. Fortunately, there is a way to overcome this problem!

In this chapter we are going to explain why the NumPy library was created. NumPy is the fundamental library which transformed the general purpose Python language into a scientific language like MATLAB, R or IDL.

Before introducing NumPy, we will discuss some of the differences between Python and compiled languages widely used in scientific software development (like C and FORTRAN).

Why is Python “slow”?#

In the next unit about numbers, we will learn that the memory consumption of a Python int is larger than the memory needed to store the binary number alone. This overhead in memory consumption is due to the nature of Python data types, which are all objects. We have already learned that these objects come with certain “services”.

Everything is an object in Python. Yes, even functions are objects! Let me prove it to you:

def useful_function(a, b):
    """This function adds two objects together.

    Parameters
    ----------
    a : an object
    b : another object

    Returns
    -------
    The sum of the two
    """
    return a + b
type(useful_function)
function
print(useful_function.__doc__)
This function adds two objects together.

Parameters
----------
a : an object
b : another object

Returns
-------
The sum of the two

Functions are objects of type function, and one of their attributes (__doc__) gives us access to their docstring. During the course of the semester you are going to learn how to use more and more of these object features, and hopefully you are going to like them more and more.

Now, why does this make Python “slow”? Well, in simple terms, these “services” tend to increase the complexity and the number of operations an interpreter has to perform when running a program. More specialized languages will be less flexible than Python, but will be faster at running specialized operations and be less memory hungry (because they do not need this overhead of flexible memory on top of every object).

Python’s high-level of abstraction (i.e. Python’s flexibility) makes it slower than its lower-level counterparts like C or FORTRAN. But, why is that so?

Dynamically versus statically typed languages#

Python is a so-called dynamically typed language, which means that the type of a variable is determined by the interpreter at run time. To understand what that means in practice, let’s have a look at the following code snippet:

a = 2
b = 3
c = a + b

The line c = a + b is valid Python syntax. The operation that has to be applied by the + operator, however, depends on the type of the variables to be added. Remember what happens when adding two lists, for example:

a = [2]
b = [3]
a + b
[2, 3]

In this simple example it would be theoretically possible for the interpreter to predict which operation to apply beforehand (by parsing all lines of code prior to the action). In most cases, however, this is impossible: for example, a function taking arguments does not know beforehand the type of the arguments it will receive.

Languages which assess the type of variables at run time are called dynamically typed programming languages. MATLAB, Python, Julia or R are examples falling in this category.

Statically typed languages, however, require the programmer to provide the type of variables while writing the code. Here is an example of a program written in C:

#include <stdio.h> 
int main ()
{
    int a = 2;
    int b = 3;
    int c = a + b;
    printf ("Sum of two numbers : %d \n", c);
}

The major difference with the Python code above is that the programmer indicated the type of the variables when they are assigned. Variable type definition in the code script is an integral part of the C syntax. This applies to the variables themselves, but also to the output of computations. This is a fundamental difference to Python, and comes with several advantages. Static typing usually results in code that executes faster: since the program knows the exact data types that are in use, it can predict the memory consumption of operations beforehand and produce optimized machine code. Another advantage is code documentation: the statement int c = a + b makes it clear that we are adding two numbers while the Python equivalent c = a + b could produce a number, a string, or a list.

Compiled versus interpreted languages#

Statically typed languages often require compilation. To run the C code snippet you have to create a new text file (example.c), write the code, compile it ($ gcc -o myprogram example.c), before finally being able to execute it ($ ./myprogram).

gcc is the compiler I used to translate the C source code (a text file) to a low level language (machine code) in order to create an executable (myprogram). Later changes to the source code require a new compilation step for the changes to take effect.

Because of this “edit-compile-run” cycle, compiled languages are not interactive: in the C language, there is no equivalent to Python’s command line interpreter. Compiling complex programs can take up to several hours in some extreme cases. This compilation time, however, is usually associated with faster execution times: as mentioned earlier, the compiler’s task is to optimize the program for memory consumption by source code analysis. Often, a compiled program is optimized for the machine architecture it is compiled on. Like interpreters, there can be different compilers for the same language. They differ in the optimization steps they undertake to make the program faster, and in their support of various hardware architectures.

Interpreters do not require compilation: they analyze the code at run time. The following code for example is syntactically correct:

def my_func(a, b):
    return a + b

but the execution of this code results in a TypeError when the variables have the wrong type:

my_func(1, '2')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[8], line 1
----> 1 my_func(1, '2')

Cell In[7], line 2, in my_func(a, b)
      1 def my_func(a, b):
----> 2     return a + b

TypeError: unsupported operand type(s) for +: 'int' and 'str'

The interpreter cannot detect these errors before runtime: they happen when the variables are finally added together, not when they are created.

Parenthesis I: Python bytecode

When importing a Python module, the CPython interpreter creates a directory called __pycache__. This directory contains bytecode files, which are your Python source code files translated to binary files. This is an optimization step which makes subsequent executions of the program run faster. While this conversion step is sometimes called “compilation”, it should not be mistaken with a C-program compilation: indeed, Python bytecode still needs an interpreter to run, while compiled executables can be run without a C interpreter.

Parenthesis II: static typing and compilation

Statically typed languages are often compiled, and dynamically typed languages are often interpreted. While this is a good rule of thumb, this is not always true and the vast landscape of programming languages contains many exceptions. This lecture is only a very short introduction to these concepts: you will have to refer to more advanced computer science lectures if you want to learn about these topics in more detail.

Here comes NumPy#

Let’s summarize the two previous sections:

  • Python is flexible, interactive and slow

  • C is less flexible, non-interactive and fast

This is a simplification, but not far from the truth.

Now, let’s add another obstacle to using Python for science: the built-in list data type in Python is mostly useless for arithmetics or vector computations. Indeed, to add two lists together element-wise (a behavior that you would expect as a scientist), you must write:

def add_lists(A, B):
    """Element-wise addition of two lists."""
    return [a + b for a, b in zip(A, B)]
add_lists([1, 2, 3], [4, 5, 6])
[5, 7, 9]

The NumPy equivalent is much more intuitive and straightforward:

import numpy as np

def add_arrays(A, B):
    return np.add(A, B)
add_arrays([1, 2, 3], [4, 5, 6])
array([5, 7, 9])

Let’s see which of the two functions runs faster:

n = 10
A = np.random.randn(n)
B = np.random.randn(n)
%timeit add_lists(A, B)
1.97 μs ± 46.1 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
%timeit add_arrays(A, B)
518 ns ± 20.1 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

Numpy is approximately 4 times faster.

Exercise 13

Repeat the performance test with n=100 and n=10000. How does the performance scale with the size of the array? Now repeat the test but make the input arguments A and B lists instead of NumPy arrays. How is the performance comparison in this case? Why?

Why is NumPy so much faster than pure Python? One of the major reasons is vectorization, which is the process of applying mathematical operations to all elements of an array (“vector”) at once instead of looping through them like we would do in pure Python. “for loops” in Python are slow because for each addition, Python has to:

  • access the elements a and b in the lists A and B

  • check the type of both a and b

  • apply the + operator on the data they store

  • store the result.

NumPy skips the first two steps and does them only once before the actual operation. What does NumPy know about the addition operation that the pure Python version cannot infer?

  • the type of all numbers to add

  • the type of the output

  • the size of the output array (same as input)

NumPy can use this information to optimize the computation, but this is not possible without trade-offs. See the following for example:

add_lists([1, 'foo'], [3, 'bar'])  # works fine
[4, 'foobar']
add_arrays([1, 'foo'], [3, 'bar'])  # see what happens to adding 1 and 3
array(['13', 'foobar'], dtype='<U42')

\(\rightarrow\) NumPy can only be that fast because the input and output data types are uniform and known before the operation.

Internally, NumPy achieves vectorization by relying on a lower-level, statically typed and compiled language: C! The core of the numpy codebase is written in C/C++. The rest of the codebase offers an interface (a “layer”) between Python and the internal C code.

As a result, NumPy has to be compiled at installation. Most users do not notice this compilation step anymore (recent pip and conda installations are shipped with pre-compiled binaries), but installing NumPy used to require several minutes on my laptop when I started to learn Python myself.

Take home points#

  • The process of “type checking” may occur either at compile-time (statically typed language) or at runtime (dynamically typed language). These terms are not usually used in a strict sense.

  • Statically typed languages are often compiled, while dynamically typed languages are interpreted.

  • There is a trade-off between the flexibility of a language and its speed: static typing allows programs to be optimized at compilation time, thus allowing them to run faster. But writing code in a statically typed language is slower, especially for interactive data exploration (not really possible in fact).

  • When speed matters, Python allows to use compiled libraries under a Python interface. NumPy is using C under the hood to optimize its computations.

  • NumPy arrays use a continuous block of memory of homogenous data type. This allows for faster memory access and easy vectorization of mathematical operations.

Further reading#

I highly recommend to have a look at the first part of Jake Vanderplas’ blog post, Why Python is slow (up to the “hacking” part). It provides more details and a good visual illustration of the c = a + b example. The second part is a more involved read, but very interesting too!

Addendum 1: is Python really that slow?#

The short answer is: yes, Python is slower than a number of other languages. You will find many benchmarks online illustrating it.

Is it bad?

No. Jake Vanderplas (a well known contributor of the scientific Python community) writes:

As well, it comes down to this: dynamic typing makes Python easier to use than C. It’s extremely flexible and forgiving, this flexibility leads to efficient use of development time, and on those occasions that you really need the optimization of C or Fortran, Python offers easy hooks into compiled libraries. It’s why Python use within many scientific communities has been continually growing. With all that put together, Python ends up being an extremely efficient language for the overall task of doing science with code.

It’s the flexibility and readability of Python that makes it so popular. Python is the language of choice for major actors like instagram or spotify, and it has become the high-level interface to highly optimized machine learning libraries like TensorFlow or Torch.

For a scientist, writing code efficiently is much more important than writing efficient code. Or is it?

Addendum 2: support for type hints in Python#

Since Python version 3.5, the Python interpreter accepts what is called “type annotations”. Type annotations are fully optional and (currently) have no influence on performance. It looks like this:

def greeting(name: str) -> str:
    return 'Hello ' + name

This may look weird but runs just fine. Type annotations are currently mostly used by library developers who use it as a debugging and code cleaning tool, not by “casual” programmers. If you want to know more, you can have a look at MyPy.