Linux Static Tracepoints

Linux has had dynamic trace functionality for a long time in the form of kprobes. Kprobes provides a kernel API for placing probes on kernel instructions and they can be exploited directly via a kernel module, or via systemtap which provides a high level scripting language. Dynamic tracing has a number of advantages – it has zero overhead when disabled and probes can be placed on almost any instruction in the kernel, not just where a kernel developer thinks you should.

All this flexibility does have some downsides. An executed kprobe has a significant overhead since it uses breakpoints and exception handlers. Having said that, there are patches that avoid the breakpoint and instead branch directly to the handler. Another issue is probe placement; kprobes are easily placed at function entry and exit but if you need to probe inside a function or probe local variables then you really need systemtap and a kernel compiled with CONFIG_DEBUG_INFO. On the other hand a static tracepoint can be placed anywhere in a function and can be passed any important local variables. Various static tracepoint patches have been available for Linux, but as of 2.6.32 a complete implementation is in mainline.

Adding a static tracepoint is very simple, an example can be found here. In this case I am adding to an existing trace group (irq), so I only need the tracepoint definitions and the tracepoints themselves. An explanation of the 5 parts of a tracepoint definition can be found in linux/samples/trace_events/trace-events-sample.h. For more complicated scenarios, refer to the files in linux/samples/trace_events/

Using static tracepoints

There are only a few steps to make use of static tracepoints. First ensure that debugfs is mounted. Most distros mount it on /sys/kernel/debug:

# mount | grep debugfs

debugfs on /sys/kernel/debug type debugfs (rw)

A list of available tracepoints can be found in tracing/available_events:

# cat /sys/kernel/debug/tracing/available_events

skb:skb_copy_datagram_iovec
skb:kfree_skb
block:block_rq_remap
block:block_remap
block:block_split
block:block_unplug_io
block:block_unplug_timer
...

Since we added our tracepoints to the irq group, we can find them in tracing/events/irq:

# ls /sys/kernel/debug/tracing/events/irq/

enable  irq_handler_entry  softirq_entry  tasklet_entry
filter  irq_handler_exit   softirq_exit   tasklet_exit

Enable the tasklet tracepoints:

# echo 1 >  /sys/kernel/debug/tracing/events/irq/tasklet_entry/enable
# echo 1 >  /sys/kernel/debug/tracing/events/irq/tasklet_exit/enable

And the output is available in the trace buffer:

# cat /sys/kernel/debug/tracing/trace

# tracer: nop
#
#           TASK-PID    CPU#    TIMESTAMP  FUNCTION
#              | |       |          |         |
-0     [000]   327.349213: tasklet_entry: func=.rpavscsi_task
-0     [000]   327.349217: tasklet_exit: func=.rpavscsi_task

When finished, we can disable the tracepoints. There are enable files at all levels of the hierarchy, so we can disable all tracepoints in one go:

# echo 0 > /sys/kernel/debug/tracing/events/enable

Using static tracepoints in kernel modules

Kernel modules can also make use of static tracepoints. A simple module that hooks the tasklet_entry tracepoint and printks the function name of the tasklet might look like (I’ve called it tracepoint-example.c):

#include <linux/module.h>
#include <trace/events/irq.h>

static void probe_tasklet_entry(struct tasklet_struct *t)
{
        printk("tasklet_entry %pf\n", t->func);
}

static int __init trace_init(void)
{
        WARN_ON(register_trace_tasklet_entry(probe_tasklet_entry));
        return 0;
}

static void __exit trace_exit(void)
{
        unregister_trace_tasklet_entry(probe_tasklet_entry);
}

module_init(trace_init)
module_exit(trace_exit)
MODULE_LICENSE("GPL");

If you are wondering, %pf is a printk formatter trick to pretty print a function name so you don’t have to go searching for the address in System.map.

Here is a Makefile to go with it:

obj-m := tracepoint-example.o
KDIR := /lib/modules/$(shell uname -r)/build
PWD := $(shell pwd)
default:
        $(MAKE) -C $(KDIR) SUBDIRS=$(PWD) modules

Using Performance Counters for Linux

The 2.6.31 Linux kernel will add a new performance counter subsystem called Performance Counters for Linux (or perfcounters for short). To use perfcounters, build a kernel with:

CONFIG_PERF_COUNTERS=y

You will need elfutils and optionally binutils (for c++ function unmangling). On debian or ubuntu:

apt-get install libelf-dev binutils-dev

The tools must be built 64bit on a 64bit kernel. If you have a mixed 64bit kernel/32bit userspace (like some amd64 and ppc64 distros) then build a 64bit version of elfutils. I usually don’t bother building the optional 64bit binutils in this case and just put up with mangled c++ names (hint: feed them into c++filt to demangle them). Now build the perf tool:

# cd tools/perf
# make

Now we can use the tools to debug a performance issue I was seeing in 2.6.31. A simple page fault microbenchmark was showing scalability issues when running multiple copies at once. When looking into performance issues in the kernel, perf top is a good place to start. It gives a constantly updating kernel profile:

# perf top

perf top output

We are spending over 70% of total time in _spin_lock, so we definitely have an issue that warrants further investigation. To get a more detailed view we can use the perf record tool. The -g option records backtraces which allows us to look at the call graphs responsible for the performance issue:

# perf record -g ./pagefault

You can either let the profiled application run to completion, or since this microbenchmark will run forever we can just wait 10 seconds and hit ctrl-c. Two more perf record options you will find useful, is -p to profile a running process and -a to profile the entire system.

Now we have a perf.data output file. I like to start with a high level summary of the recorded data first:

# perf report -g none

perf report

The perf report tool gives us some more important information that perf top does not – it shows the task associated with the function and it also profiles userspace.

Now we have confirmed that our trace has captured the _spin_lock issue, we can look at the call graph data to see what path is causing the problem:

# perf report

perf_report_callgraph

At this point its clear that the problem is a spin lock in the memory cgroup code. In order to keep accurate memory usage statistics, the current code uses a global spinlock. One way we can fix this is to use percpu_counters, which Balbir has been working on here.

Booting Linux faster with parallel probing

On machines with large numbers of SCSI adapters and disks a significant amount of time can be spent probing for disks. By default Linux probes disks serially, but there are options to parallelise this.

The two phases of disk probe that can be parallelised are:

  1. Adapter probe. This is where each adapter is probed, reset and allowed to settle. This can be parallelised with the “scsi_mod.scan=async” boot option.
  2. Disk probe. This is where each disk behind an adapter is probed. The disk is spun up if it isn’t already spinning. This can be parallelised with the “fastboot” boot option.

To highlight the importance of parallelising both parts of disk probe, I ran three tests and measured the time it took to get to userspace. I used a POWER5 system with 4 SCSI controllers and 13 disks as a benchmark system. The disks were not spinning when Linux was booted.

  • Serial adapter and disk probe: 88 seconds
  • Parallel adapter probe, serial disk probe: 67 seconds
  • Parallel adapter and disk probe: 15 seconds

On this system, full parallelisation is over 5 times faster.