Exercise 4

Part 1

If you ever traced system calls with strace in Linux or truss in Solaris, you'll probably noticed that dynamic linker ld.so maps shared objects into memory before program is actually run:

# strace /bin/ls
...
open("/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\0\34\2\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=2107600, ...}) = 0
mmap(NULL, 3932736, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f28fc500000
mprotect(0x7f28fc6b6000, 2097152, PROT_NONE) = 0
mmap(0x7f28fc8b6000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1b6000) = 0x7f28fc8b6000
mmap(0x7f28fc8bc000, 16960, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f28fc8bc000
close(3)

Note that Linux uselib call may also being used for that.

Even if pages are already loaded into page cache, they might be mapped with different addresses or with different permissions (which is less likely for shared objects), so operating system need to create new memory segments for them. It also performs it lazily and only maps limited amount of pages. The rest are mapped on demand when minor fault occur. That is why they occur often when new processes are spawned.

We will use @count aggregation and vm.pagefault probe to count pagefaults in SystemTap for Linux. Path to file is represented by dentry structure $vma->vm_file->f_path->dentry –- we will use d_name function to access string representation of its name.

  Script file pfstat.stp

#!/usr/bin/stap

global pfs;

probe vm.pagefault {
    vm_file = "???";
    if($vma->vm_file != 0)
        vm_file = d_name($vma->vm_file->f_path->dentry);

    pfs[vm_file] <<< 1;
}

probe timer.s(1) {
    printf("%8s %s\n", "FAULTS", "VMFILE");
    foreach([vm_file] in pfs) {
        printf("%8u %s\n", @count(pfs[vm_file]), vm_file);
    }

    delete pfs;
}

Speaking of Solaris, we will need to intercept as_segat() calls. Segments that are related to files are handled by segvn segment driver, so we have to compare pointer to operations table to determine if segment is corresponding to that driver.

  Script file pfstat.d

#!/usr/sbin/dtrace -qCs

#define VNODE_NAME(vp)                                 \
    (vp) ? ((vp)->v_path)                              \
         ? stringof((vp)->v_path) : "???" : "[ anon ]"

#define IS_SEG_VN(s)  (((struct seg*) s)->s_ops == &`segvn_ops)

fbt::as_fault:entry {
    self->in_fault = 1;
}
fbt::as_fault:return {
    self->in_fault = 0;
}

fbt::as_segat:return
/self->in_fault && arg1 == 0/ {
    @faults["???"] = count();
}

fbt::as_segat:return
/self->in_fault && arg1 != 0 && IS_SEG_VN(arg1)/ {
    this->seg = (struct seg*) arg1;
    this->seg_vn = (segvn_data_t*) this->seg->s_data;
    
    @faults[VNODE_NAME(this->seg_vn->vp)] = count();
}

tick-1s {
    printf("%8s %s\n", "FAULTS", "VNODE");
    printa("%@8u %s\n", @faults);
    trunc(@faults);
}

When you run these scripts, you may see that most of faults are corresponding to libc library.

Part 2

As you can see from exercise text, you'll need to get allocator cache names, but they are not described in documentation to SLAB probes. To find them we will need to dive into Solaris and Linux kernel source. We seek for a name (which would be represented as a string), so we have to find allocator cache structure and fields of type char[] or char* in it.

Let's check prototype of kmem_cache_alloc() function which is mentioned in probe description. First argument of it is a cache structure which we seek for. As you can see from source, it is called kmem_cache_t:

void *
kmem_cache_alloc(kmem_cache_t *cp, int kmflag)
    (from usr/src/uts/common/os/kmem.c)

This type is a typedefed alias for a struct kmem_cache which is defined in usr/src/uts/common/sys/kmem_impl.h. When you check its definition, you may find cache_name field –- it is obvious that this field contains name of cache.

struct kmem_cache {
    [...]
    char        cache_name[KMEM_CACHE_NAMELEN + 1];
    [...]

Of course that is not the only way to find that field. If you know how to check cache statistics statically, you should know that is is done with ::kmastat command in mdb debugger. Looking at its source will reveal kmastat_cache() helper which prints statistics for a cache. Looking at its source may reveal accesses to cache_name field:

mdb_printf((dfp++)->fmt, cp->cache_name);
        (from usr/src/cmd/mdb/common/modules/genunix/genunix.c).

Considering our findings, we may define CACHE_NAME macro and use it in a DTrace script:

  Script file kmemstat.d

#!/usr/sbin/dtrace -qCs

#define CACHE_NAME(arg)     ((kmem_cache_t*) arg)->cache_name

fbt::kmem_cache_alloc:entry {
    @allocs[CACHE_NAME(arg0)] = count();
}

fbt::kmem_cache_free:entry {
    @frees[CACHE_NAME(arg0)] = count();
}

tick-1s {
    printf("%8s %8s %s\n", "ALLOCS", "FREES", "SLAB");
    printa("%@8u %@8u %s\n", @allocs, @frees);
    
    trunc(@allocs); trunc(@frees);
}

Seeking for a cache name in Linux is not that straightforward. First of all, we deal not with function boundary probes for SLAB caches, but with tapset aliases which do not provide pointers to a cache structures. Moreover, Linux has three implementations for kernel memory allocators: original SLAB, its improved version SLUB and a SLOB, compact allocator for embedded systems. SLOB is not generally used, so we will omit it.

vm.kmem_cache_alloc probe is defined in a vm tapset, so we will need to look into tapset directory /usr/share/systemtap/tapset/linux/ to find its definition. As you can see from it, it refers either tracepoint kmem_cache_alloc or kmem_cache_alloc() kernel function. Here are source code of it:

void *kmem_cache_alloc(struct kmem_cache *cachep, gfp_t flags)
{
    void *ret = slab_alloc(cachep, flags, _RET_IP_);

    trace_kmem_cache_alloc(_RET_IP_, ret,
                   cachep->object_size, cachep->size, flags);

    return ret;
}
    (from mm/slab.c)

Note the trace_kmem_cache_alloc call which is actually represents invocation of the FTrace tracepoint. The same function is defined in SLUB allocator, but it uses s as a name of first functions argument, so we will have to use @choose_defined construct. Both allocators use same name for cache's structure: struct kmem_cache. The name is defined in that structure in files include/linux/slab_def.h and include/linux/slub_def.h:

struct kmem_cache {
    [...]
    const char *name;
    [...]

Like in Solaris, Linux shows cache statistics in /proc/slabinfo file. Diving into Kernel sources will reveal that cache_name functions is used to get cache's name and it accesses name field.

Considering all this, here are an implementation of SystemTap script:

  Script file kmemstat.stp

#!/usr/bin/stap

global allocs, frees;

probe kernel.function("kmem_cache_alloc"),
      kernel.function("kmem_cache_alloc_node") {
    cache = @choose_defined($s, $cachep);
    name = kernel_string(@cast(cache, "struct kmem_cache")->name);

    allocs[name] <<< 1;
}

probe kernel.function("kmem_cache_free")  {
    cache = @choose_defined($s, $cachep);
    name = kernel_string(@cast(cache, "struct kmem_cache")->name);

    frees[name] <<< 1;
}

probe timer.s(1) {
    printf("%8s %8s %s\n", "ALLOCS", "FREES", "SLAB");
    foreach([cache] in allocs) {
        printf("%8u %8u %s\n", @count(allocs[cache]), 
                @count(frees[cache]), cache);
    }
    
    delete allocs;
    delete frees;
}