Module 3: Principles of dynamic tracing

Visualization

Reading trace files is exhausting, so the most popular scenario for the post-processing is visualization. There are multiple standard ways to do that:

Use GNU Plot as shown here: System utilization graphing with Gnuplot. Tracing script directly generates commands which are passed to GNU Plot.
DTrace Chime plugin for the NetBeans
SystemTap GUI
Writing you own visualization script. For example, following examples were generated using Python library matplotlib.

Which types diagrams are mostly useful? Let's find out.

Linear diagram

The simplest one is a linear diagram. X axis in that diagram is the time, so it allows to see changes in system's behaviour over time. These diagrams may be combined together (but the time axis should be same on all plots), which allows to reveal correlations between characteristics, as shown on following image:

image:linear.png

These three characteristics are names of the probes from vminfo provider in DTrace: zfod stands for zero-filled on-demand which is page allocation, while pgpgin and pgpgout are events related to reading/writing pages to a backing store, such as disk swap partition. In this case, memeat process (which name is self-explanatory –- it allocates all available RAM) allocates plenty of memory, so number of zfod events is high, causing to some pages being read or written to a disk swap.

Tracing a scheduler

Now let's run following loop in a shell which periodically eats lots of CPU than sleeps for 5 seconds:

# while : 
    do
        for I in {0..4000}
        do  
            echo '1' > /dev/null
        done
        sleep 5
    done

And gather scheduler trace: each time it dispatches a new process we will trace process name, cpu and timestamp:

# dtrace -n '
    sched:::on-cpu { 
        printf("%d %d %s\n", timestamp, 
        cpu, (curthread == 
                curthread->t_cpu->cpu_idle_thread)
                ? "idle" 
                : execname ); }'

Histograms

In many cases of performance analysis we rely on average values, which are not very representative.

Information

Consider the following example: you apply to a job in some company which has 100 employees and average salary is about 30k roubles. This data can have many interpretations:

CEO salary Senior staff salary Junior staff salary

100k roubles 48k roubles 25k roubles

2 million roubles 23k roubles 7k roubles

Would you want to work there if salary is distributed according row? Doubtful. Like with employment, you cannot rely on average readings in performance analysis: average latency 10ms doesn't mean that all users are satisfied –- some of them may had to wait seconds for web-page to render.

If we calculate per-process difference between scheduler timestamps and build a logarithmic histogram plot, we'll see several requests which lasts for seconds:

image:density

Y axis is logarithmic and represents a number of observed intervals when CPU was busy for time period shown on X axis. If we normalize this characteristic, we will get probability density function.

Warning

Aggregations quantize()/hist_linear() and lquantize()/hist_log() might do the same, but in text terminal.

Heat maps

When two axes is not enough for your graph, you may also use a colour intensity of each pixel too. Let's see, how CPU usage is distributed across CPUs. To do so we need pick a step for the observation interval, say T=100ms, accumulate all intervals when non-idle thread were on that CPU, say t, than pixel's intensity will be 1.0 - t/T so 1.0 (white) will say that CPU was idle all the time, while 0.0 (black) will be evidence that CPU is very busy. For our example, we will see, that CPU 4 is periodically runs CPU-bound tasks:

image:heatmap

Gantt charts

Generally speaking, gantt charts help to understand state of the system across the timeline. For example they are helpful in planning projects: what job needs to be done by whom and when, so the jobs are placed on X axis while Y axis is a timeline, and color is used to distinguish teams responsible for jobs. In our case we may be interested in how load distributed across CPUs, and what's causing it, so we here are a gantt chart:

image:oncpu

We added process name to the longest bars, and it seems that bash process causing trouble. We could discover it before, adding tags on histogram.

CEO salary	Senior staff salary	Junior staff salary
100k roubles	48k roubles	25k roubles
2 million roubles	23k roubles	7k roubles