Memory hierarchy and computing performance

I was reading Alistair Miles' Go faster Python, and amazing introduction to the subject of profiling and code optimization, and reminded myself that I wanted to talk at length about some issues related to performance. For now I just want to add a small comment that I think complements Alistair's post.

It is related to the importance of memory hierarchy on code optimization.

You already know a bit about memory hierarchy, I am sure: hard-disks are much bigger than RAM, but they are also much slower. You probably also know that SSD disks tend to be much faster than HDDs. Indeed for typical desktop computing chores the SSD vs HDD thing is probably the most important factor for your productivity. You probably also know that your CPU has cache memory. For example, an older server around here has this performance metrics for cache and RAM:

Now, in many CPU intensive applications the biggest choking point is not CPU speed but the fact that RAM is much much slower than the speed the CPU can process data. It would help a lot if your algorithms could run at L1 cache speeds instead of RAM speed. Indeed you can explore this when you design your algorithms, lets work out an example.

In our simple example we want to sum all lines or columns of a square matrix, this is trivial with numpy.sum:

import numpy as np
little_array = np.array([[1, 2], [3, 4]])
print('column sum: ' + [little_array.sum()])
print('row sum: ' )

Simple right? Also the cost of adding columns should be the same as adding rows, right? The matrix is square...

Lets test that:

bigly_size = 2000
orange_array = np.random.rand(bigly_size, bigly_size)  # big, very big!

The timeit results for this are (column sum first):

%timeit -n10 orange_array.sum(axis=1)
10 loops, best of 3: 4.33 ms per loop
%timeit -n10 orange_array.sum(axis=0)
10 loops, best of 3: 2.65 ms per loop

So, operations with apparently similar computational cost have a 40% difference in performance. Lets study this a bit further. Lets check the performance improvments as a function of matrix size. We start by coding a function to do some basic timing:

import timeit
import statistics
def time_bigly_array(bigly_size):
    orange_array = np.random.rand(bigly_size, bigly_size)  # big, very big!
    column_sum = timeit.repeat('orange_array.sum(axis=0)', 
                               repeat=9, number=10,
                               globals={'orange_array': orange_array})
    row_sum = timeit.repeat('orange_array.sum(axis=1)',
                            repeat=9, number=10,
                            globals={'orange_array': orange_array})
    return orange_array.nbytes, statistics.median(column_sum), statistics.median(row_sum)

Lets now compute some time statistics for some matrix sizes:

col_times, row_times = [], []
sizes = [100, 200, 500, 1000, 2000, 5000, 10000, 15000, 20000, 25000, 30000]
byte_sizes = []
for size in sizes:
    byte_size, col_time, row_time = time_bigly_array(size)
    col_times.append(col_time)
    row_times.append(row_time)
    byte_sizes.append(byte_size)

Finally lets see these times for row and column sums. We also plot relative time cost and memory occupied.

import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('bmh')
fig, ax = plt.subplots(1, 2, figsize=(16, 4))
ax[0].plot(sizes, col_times, 'o-', label='Column sum')
ax[0].plot(sizes, row_times, 'o-', label='Row sum')
ax[0].set_xlabel('Matrix size')
ax[0].legend(loc='upper center')
ax[0].set_ylabel('Time (s)')
ax[0].set_title('Time to compute a sum on a square matrix')


ax[1].plot(sizes, [col/row for row, col in zip(row_times, col_times)], 'bo-')
ax[1].set_xlabel('Matrix size')
ax[1].set_ylabel('Ratio', color='b')

ax[1].set_title('Time fraction (col/row)')
ax12 = ax[1].twinx()
ax12.set_yscale('log')
ax12.plot(sizes, [x/(1024**2)  for x in byte_sizes], 'g--')
ax12.set_ylabel('Memory occupied (MB)', color='g')

The result is:

For matrices above 1000x1000 we see improvements of something between 25% and 40% (right panel).

What is happening here? The memory representation of the matrix is, of course linear, and you have, by convention, all the elements of the first column, followed by the elements in the second column and so on. If you do the operation column-wise there is a big chance that the next element for the summation is already in the CPU, but if you do it row-size then the CPU is constantly swapping the data that is in the cache.

Interestingly C and Fortran use the reverse representation: in one language you have column1, column2, column2, .., columnn. whereas in the other you have row1, row2, ..., rown. Indeed you can tell numpy to chose either a Fortran or C representation (check the order parameter). Now remember that the CPU intensive parts of numeric libraries are implemented in either Fortran or C (and Fortran is more common than many people believe), so a good choice of ordering might actually give you a good speedup.

Note that for small matrices (below 1000) there is little advantage, this is because the small matrix fits the cache (I used a machine with 8 MB of L3 cache). A matrix of 1000x1000 is 7.6M according to numpy.

Now this is simple example used to illustrate the fact that you can squeeze a substantial increase in performance just by being aware of the cache/RAM hierarchy. Your mileage will vary with your algorithm and your hardware. Above all, this is not a best-practices attempt (not at all!), but just a motivator. If you have a look at your code, I am pretty sure you can do better.

If you want to learn more, I suggest having a look a bcolz.