Python Numba, Fortran, and CUDA in scientific computing

Motivation

It has long been known that Python is slow and Fortran is fast, although it is so tempting that Python code is easy to write and debug. However, with recently developed libraries such as Numba, it is now possible to write fast Python code. In addition, Numba supports CUDA, which enables us to write GPU-based code easily. As its official website claims, Numba-compiled code can approach the speeds of C or Fortran. One question arises: how fast is Numba? And what are the pros and cons of Numba-based computing?

Through googling this topic, I found that there are few comparisons. For example, Charles Jekel's blog compares the linear addition operation of vector, and it shows a 2 to 3 times difference. As for GPU-based code, Lena Oden's paper compares the performances of Numba-CUDA and C-CUDA. And the author reports that the Numba achieves 75% of the performance of the C-CUDA for small problems, and 86% for large problems.

It is with doubt that even with Numba's acceleration, Fortran is still faster. But how much faster is it comparing with Numba? We cannot find the answer by simply looking at others' results, as this is a algorithm-dependent problem. For long-term, it is necessary for one to know the pros and cons of each languages and CPU-based and GPU-based computing of the specific algorithm that he or she uses. Therefore, I spent some time to implement the same algorithm (VRMC, see the previous post) in both Python and Fortran, and tested their performances for both CPU-based and GPU-based scenarios.

Baseline

Using the same algorithm, we perform the same calculation for the phonon transport in a 200 nm thick silicon nanofilm in a 300 ps time scale.

But the benchmark is tricky: Fortran naturally supports both MPI and OpenMP, while Python does not, at least not perfectly with the Numba acceleration. For parallel computing, I can only think of calling the numba-compiled function from a main function, and main function deals with the parallelism using MPI while the numba-compiled function does not. For the Monte Carlo sampling, the only easy way is to divide the total number of samples by the number of MPI processes, and each process performs the same number of samples. This is not perfect, as some processes may finish earlier than others. OpenMP in Fortran, however, solves this problem by using a shared variable to count the total number of samples.

It is not fair to compare the performances of CPU Numba with GPU Numba, either, since the philosophy of CUDA is somewhat similar to OpenMP. Therefore, a series of tests are performed to compare the performances of all the combinations of CPU and GPU, Python and Fortran:

  • MPI Python (CPU) vs. MPI Fortran (CPU)
  • MPI Fortran (CPU) vs. OpenMP Fortran (CPU)
  • OpenMP Fortran (CPU) vs. Numba CUDA Python (GPU) vs. Nvfortran (GPU)

Very complicated though, we can know several things from the tests: (1) how fast is Numba comparing with Fortran, (2) how much time OpenMP can save comparing with MPI in this case of VRMC, (3) how fast is GPU computing, and (4) how fast is Nvfortran comparing with Numba-CUDA. To connect all the tests, we use the Fortran OpenMP as the baseline.

Results

For 108 samples, the results of CPU-based computing are shown as follows:

The GPU-based calculation is too fast to be shown in the same figure. We use 109 samples, rather than 108, to exclude the influences of reading data, launching the kernel, etc. We fix the threads per block to be 128, and increase the number of blocks until saturation. The results are:

Discussion

It can be concluded from the plots that:

  • The speed of Fortran is < 1.5 times than that of Numba for CPU-based computing.
  • OpenMP is faster for large-scale computing, but stays the same for small-scale. Of course, with the increasing amount of samples, the time saved by OpenMP might be more significant.

It is also obvious that GPU-based computing is way faster than CPU-based computing. We use the results of OpenMP Fortran as the baseline, due to the slope of \(\log(t)/\log(N)\approx -0.95\), which is close to -1, meaning that the parallelism is almost perfect. Using the fitted curve \(\log(t) = A + B\log(N)\), in which \(A = 6.69, B = -0.95\), we can estimate the equivalent cores of the GPU. The result is that:

  • The speed of Nvfortran is roughly 2.54 times faster than that of Numba CUDA.
  • A single V100 GPU is almost equivalent to 578 CPU cores.

Please notice that the above comparison is only valid for the specific algorithm of VRMC, and both CPU and GPU we used are not the latest ones. But at least a basic understanding of the magnitude of the speed can still be obtained.