Llogiq on stuff

Profiling Rust applications on Linux

What’s measured gets optimized. So if we want high-performance Rust code, we need to measure it. Some time ago I wrote about benchmarks, and how Rust’s test crate makes it easy to set up a simple benchmark.

Now I’m going to tackle the second type of performance measurement: Profiling. Whenever we want to optimize, we better profile our code to see where the time is spent.

Now Rust has no gprof support, but on Linux there are a number of options available to profile code based on the DWARF debugging information in a binary (plus supplied source). While the valgrind-based tools (for our requirements callgrind) use a virtual CPU, oprofile reads the kernel performance counters to get the actual numbers.

Aside: JMH (Java Microbenchmark Harness), which is poised to become a part of JDK9, uses the same trick with the -prof perfasm option.

So I needed some code to test the profiler on. I took the nbody benchmark from teXitoi’s repository (also the C++ #8 benchmark from the benchmarksgame page to compare against). The benchmarksgame reports 2× runtime for the Rust program compared to the C++ program.

Compiling and running the C++ program yielded:

$ /usr/bin/g++ -g -c -pipe -O3 -fomit-frame-pointer -march=native -mfpmath=sse \
  -msse3 --std=c++11 nbody.gpp-8.c++ -o nbody.gpp-8.c++.o && /usr/bin/g++ \
  nbody.gpp-8.c++.o -o nbody_gpp-8 -fopenmp
$ time ./nbody_gpp-8 50000000
-0.169075164
-0.169059907

real    0m8.417s
user    0m8.412s
sys     0m0.000s

So my PC is a bit faster than the benchmark server. Nice!

Next, let’s have a look at the Rust version. I’m using the nightly channel and compile with -O:

$ rustc -O nbody_rs.rs
$ time ./nbody_rs 50000000
-0.169075164
-0.169059907

real    0m12.651s
user    0m12.648s
sys     0m0.000s

Now this is much better than the 2× result in the benchmark, rather 1.5×C++ performance. Still there’s a difference, and profiling the rust code will show us where it likely is.

Using valgrind

Installing valgrind can be done with your system’s package manager, otherwise just download the source package; a configure && make && make install should do the trick.

Valgrind uses the DWARF debugging information, so we need to recompile with -g:

$ rustc -O -g nbody-rs.rs

Using callgrind to collect the profile is simple, just prefix valgrind --tool=callgrind to the program. This will collect the statistics to be used by callgrind_annotate later:

$ valgrind --tool=callgrind ./nbody 50000000
==11135== Callgrind, a call-graph generating cache profiler
==11135== Copyright (C) 2002-2013, and GNU GPL'd, by Josef Weidendorfer et al.
==11135== Using Valgrind-3.10.1 and LibVEX; rerun with -h for copyright info
==11135== Command: ./nbody 50000000
==11135== 
==11135== For interactive control, run 'callgrind_control -h'.
-0.169075164
-0.169059907
==11135== 
==11135== Events    : Ir
==11135== Collected : 29800561414
==11135== 
==11135== I   refs:      29,800,561,414

$ callgrind_annotate --auto=yes callgrind.out.11135
[...]
            .              for bj in b_slice.iter_mut() {
1,000,000,000                  let dx = bi.x - bj.x;
1,000,000,000                  let dy = bi.y - bj.y;
1,000,000,000                  let dz = bi.z - bj.z;
            .  
4,000,000,000                  let d2 = dx * dx + dy * dy + dz * dz;
2,000,000,002                  let mag = dt / (d2 * d2.sqrt());
            .  
  500,000,000                  let massj_mag = bj.mass * mag;
2,500,000,000                  bi.vx -= dx * massj_mag;
2,500,000,000                  bi.vy -= dy * massj_mag;
2,000,000,000                  bi.vz -= dz * massj_mag;
            .  
  500,000,000                  let massi_mag = bi.mass * mag;
1,500,000,000                  bj.vx += dx * massi_mag;
1,500,000,000                  bj.vy += dy * massi_mag;
2,000,000,000                  bj.vz += dz * massi_mag;
            .              }
1,000,000,000              bi.x += dt * bi.vx;
1,000,000,000              bi.y += dt * bi.vy;
  750,000,000              bi.z += dt * bi.vz;
            .          }
[...]

At least the values are all sufficiently high, but the cost model seems to be overly simplistic. I’ll discuss this on reddit, so if someone knows how to get more usable values out of this, drop me a line.

Since this did not show us too much (besides the predictable fact that the inner loop of our advance() function is the hottest code), let’s have a look at oprofile.

Using oprofile

To install oprofile, I simply typed:

$ sudo apt-get install linux-tools-generic oprofile

Your system may have similar options. If you cannot use a package manager, the oprofile manual has an installation guide that tells you how to install from source.

Now let’s have a look where our nbody-rs spends its CPU time. First we call operf to collect the profile, then we can use opannotate to show us an annotated source.

Note that like callgrind_annotate, opannotate uses the DWARF debugging symbols and relies on the source code being available. Luckily we have already compiled with -g.

$ operf ./nbody-rs 5000000
Kernel profiling is not possible with current system config.
Set /proc/sys/kernel/kptr_restrict to 0 to collect kernel samples.
operf: Profiler started
-0.169075164
-0.169059907

Profiling done.
$ opannotate --source
[... some inconsequential errors and source ...]
              :fn advance(bodies: &mut [Planet;N_BODIES], dt: f64, steps: i32) {
               :    for _ in (0..steps) {
               :        let mut b_slice: &mut [_] = bodies;
               :        loop {
               :            let bi = match shift_mut_ref(&mut b_slice) {
               :                Some(bi) => bi,
               :                None => break
               :            };
               :            for bj in b_slice.iter_mut() {
  4352  1.2693 :                let dx = bi.x - bj.x;
  1425  0.4156 :                let dy = bi.y - bj.y;
  3348  0.9765 :                let dz = bi.z - bj.z;
               :
 20010  5.8361 :                let d2 = dx * dx + dy * dy + dz * dz;
146801 42.8157 :                let mag = dt / (d2 * d2.sqrt());
               :
  3171  0.9248 :                let massj_mag = bj.mass * mag;
 53290 15.5425 :                bi.vx -= dx * massj_mag;
  7835  2.2851 :                bi.vy -= dy * massj_mag;
  5508  1.6065 :                bi.vz -= dz * massj_mag;
               :
   886  0.2584 :                let massi_mag = bi.mass * mag;
  6670  1.9454 :                bj.vx += dx * massi_mag;
  3720  1.0850 :                bj.vy += dy * massi_mag;
  5695  1.6610 :                bj.vz += dz * massi_mag;
               :            }
 36886 10.7581 :            bi.x += dt * bi.vx;
  7109  2.0734 :            bi.y += dt * bi.vy;
               :            bi.z += dt * bi.vz;
               :        }
               :    }

This is the hottest function, and we can see the line with the division and square root is taking more than 40% of the runtime. If we add --assembly to opannotate, we’ll get the following:

[... more inconsequential stuff ...]
               :                let d2 = dx * dx + dy * dy + dz * dz;
   403  0.1175 :    58ec:       movapd %xmm3,%xmm4
               :    58f0:       mulsd  %xmm4,%xmm4
  5532  1.6135 :    58f4:       movapd %xmm1,%xmm5
   896  0.2613 :    58f8:       mulsd  %xmm5,%xmm5
  1992  0.5810 :    58fc:       addsd  %xmm4,%xmm5
  6752  1.9693 :    5900:       movapd %xmm2,%xmm4
  4435  1.2935 :    5904:       mulsd  %xmm4,%xmm4
               :    5908:       addsd  %xmm5,%xmm4
  9860  2.8758 :    590c:       xorps  %xmm5,%xmm5
               :    590f:       sqrtsd %xmm4,%xmm5
               :                let mag = dt / (d2 * d2.sqrt());
 54280 15.8312 :    5913:       mulsd  %xmm4,%xmm5
 14744  4.3002 :    5917:       movapd %xmm0,%xmm4
               :    591b:       divsd  %xmm5,%xmm4
 77777 22.6843 :    591f:       movsd  0x30(%rsi),%xmm5

Strangely, the sqrtsd has been reported before the source line, but let’s not get hung up on that detail. We can see that the movsd instruction that writes the values from the SSE registers back into memory takes most of the time – which is to be expected.

By the way, the C++ version cheats a bit by using single precision for the square root, also it uses the rsqrtps instruction via compiler instrinsics, which approximates an inverse square root. Well, the result is the same, so I’ll let it slide. Here is the relevant section from the opannotate output:

               :        dsquared = dx[0] * dx[0] + dx[1] * dx[1] + dx[2] * dx[2];
               :  400b9b:       mulpd  %xmm2,%xmm2
  1624  0.7119 :  400b9f:       movhpd -0x18(%rax),%xmm1
               :  400ba4:       mulpd  %xmm1,%xmm1
  1054  0.4620 :  400ba8:       movhpd -0x10(%rax),%xmm0
   882  0.3866 :  400bad:       mulpd  %xmm0,%xmm0
  1159  0.5081 :  400bb1:       addpd  %xmm2,%xmm1
  3263  1.4304 :  400bb5:       addpd  %xmm1,%xmm0
               :        distance = _mm_cvtps_pd(_mm_rsqrt_ps(_mm_cvtpd_ps(dsquared)));
               :
               :        for (m=0; m<2; ++m)
               :          distance = distance * _mm_set1_pd(1.5)
               :            - ((_mm_set1_pd(0.5) * dsquared) * distance)
  8152  3.5736 :  400bb9:       movapd %xmm0,%xmm2
               :}
               :
               :extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
               :_mm_cvtpd_ps (__m128d __A)
               :{
               :  return (__m128)__builtin_ia32_cvtpd2ps ((__v2df) __A);
  1666  0.7303 :  400bbd:       cvtpd2ps %xmm0,%xmm3
               :}
               :
               :extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
               :_mm_rsqrt_ps (__m128 __A)
               :{
               :  return (__m128) __builtin_ia32_rsqrtps ((__v4sf)__A);
  2003  0.8781 :  400bc1:       rsqrtps %xmm3,%xmm3
  4126  1.8087 :  400bc4:       mulpd  %xmm7,%xmm2
               :}
               :
               :extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
               :_mm_cvtps_pd (__m128 __A)
               :{
               :  return (__m128d)__builtin_ia32_cvtps2pd ((__v4sf) __A);
  1464  0.6418 :  400bc8:       cvtps2pd %xmm3,%xmm3
  2036  0.8925 :  400bcb:       movapd %xmm3,%xmm8
               :
               :}

There are other metrics to look for, and oprofile has the ophelp program that will show all performance counters you can collect. However, those are outside the scope of this post.

For now you have a good way to CPU-profile your program with very little overhead and with the option of collecting other very useful statistics.

Bonus: There is another, newer way to profile applications in linux: perf. It looks and works very much like oprofile (and in my test produced very similar results), but requires root access. On the plus side, you get a ncurses- or gtk-based viewer for the annotated source, which is quite nice.

perf was written at a time where oprofile was using an interrupt-based collecting solution. In the meantime, oprofile was updated to use the same performance counter collection behavior as perf, so there is quite little difference between both packages.