Last week, I did a small experiment of using oprofile to profile two piece of code. In this small experiment, I learned some basic ideas of using oprofile. In particular, I used oprofile to profile memory cache behaviour and the performance effect. In this experiment, I got ideas of how memory cache invalidation can affect the speed of running program. The following is the detail.
The Experiment Configuration
I ran the experiment in an Thinkpad T400 with a Ubuntu installation. The experiment needs to run on a multi-core processor because of profiling memory cache invalidation cost. In a single core processor, there is no such memory cache invalidation cost.
12345678910
$sudo lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 12.04 LTS
Release: 12.04
Codename: precise
$ cat /proc/cpuinfo |grep "model name"
model name : Intel(R) Core(TM)2 Duo CPU P8600 @ 2.40GHz
model name : Intel(R) Core(TM)2 Duo CPU P8600 @ 2.40GHz
Source Code
There are two pieces of programs, which are “no_alignment” and “alignment”. They are compiled with GNU GCC. Each program clones a child sharing the same memory address space, which means the parent and the child updating different fileds of the same global data. The difference is that the program “alignment” optimizing the fields of the shared_data to the cache line size alignment. In this way, the two fields are able to be fetched on different cache lines. Therefore, when the parent and the child runs on different cores, the “no_alignment” program has the cache invlidation costs between two cores, but the “alignment” program doesn’t.
The following are the source codes of the two programs. The only difference is the definition of struct shared_data.
#define _GNU_SOURCE#include <sched.h>#include <stdio.h>#include <errno.h>#include <stdlib.h>// shared datastructshared_data{unsignedintnum_proc1;unsignedintnum_proc2;};structshared_datashared;// define loops to run for a whileintloop_i=100000,loop_j=100000;intchild(void*);#define STACK_SIZE (8 * 1024)intmain(void){pid_tpid;inti,j;/* Stack */char*stack=(char*)malloc(STACK_SIZE);if(!stack){perror("malloc");exit(1);}printf("main: shared %p %p\n",&shared.num_proc1,&shared.num_proc2);/* clone a thread sharing memory space with the parent process */if((pid=clone(child,stack+STACK_SIZE,CLONE_VM,NULL))<0){perror("clone");exit(1);}for(i=0;i<loop_i;i++){for(j=0;j<loop_j;j++){shared.num_proc1++;}}}intchild(void*arg){inti,j;printf("child: shared %p %p\n",&shared.num_proc1,&shared.num_proc2);for(i=0;i<loop_i;i++){for(j=0;j<loop_j;j++){shared.num_proc2++;}}}
#define _GNU_SOURCE#include <sched.h>#include <stdio.h>#include <errno.h>#include <stdlib.h>// cache line size// hardware dependent value// it can checks from /proc/cpuinfo#define CACHE_LINE_SIZE 64// shared data aligned with cache line sizestructshared_data{unsignedint__attribute__((aligned(CACHE_LINE_SIZE)))num_proc1;unsignedint__attribute__((aligned(CACHE_LINE_SIZE)))num_proc2;};structshared_datashared;// define loops to run for a whileintloop_i=100000,loop_j=100000;intchild(void*);#define STACK_SIZE (8 * 1024)intmain(void){pid_tpid;inti,j;/* Stack */char*stack=(char*)malloc(STACK_SIZE);if(!stack){perror("malloc");exit(1);}printf("main: shared %p %p\n",&shared.num_proc1,&shared.num_proc2);/* clone a thread sharing memory space with the parent process */if((pid=clone(child,stack+STACK_SIZE,CLONE_VM,NULL))<0){perror("clone");exit(1);}for(i=0;i<loop_i;i++){for(j=0;j<loop_j;j++){shared.num_proc1++;}}}intchild(void*arg){inti,j;printf("child: shared %p %p\n",&shared.num_proc1,&shared.num_proc2);for(i=0;i<loop_i;i++){for(j=0;j<loop_j;j++){shared.num_proc2++;}}}
Testing
I was using the Oprofile 0.99 which has ‘operf’ program that allows non-root users to profile a specified individual process with less setup. Different CPUs have different supported events. The supported events, their meanings and accepted event format by operf can be checked from ophelp and the CPU’s hardware manual. In this experiment, the CLOCK event is CPU_CLK_UNHALTED and L2_LINES_IN is the number of allocated lines in L2 which is L2 cache missing number. As examples, the “CPU_CLK_UNHALTED:100000” tells operf to sample every 100000 unhalted clock with default mask(unhalted core cycles). The “L2_LINES_IN:100000:0xf0” tells operf to sample every 1000000 number of allocated lines in L2 on all cores with all prefetch inclusive.