bcc -execsnoop 性能---未完-CFANZ编程社区

　　目前使用到的bcc程序主要包括两个部分，一部分是python语言，一部分是c语言。python部分主要做的工作是BPF程序的加载和操作BPF程序的map，并进行数据处理。c部分会被llvm编译器编译为BPF字节码，经过BPF验证器验证安全后，加载到内核中执行。python和c中出现的陌生函数可以查下面这两个手册

python 等函数：Python链接

c等函数：链接

bcc 安装：bcc_install

bcc program book： book url

https://www.ebpf.top/post/bpf_learn_path/

什么是 bcc

Bcc 的开源项目：https://github.com/iovisor/bcc
eBPF 虚拟机使用的是类似于汇编语言的指令，对于程序编写来说直接使用难度非常大。bcc 提供了一个名为 bcc 的 python 库，简化了 eBPF 应用的开发过程
Bcc 收集了大量现成的 eBPF 程序可以直接拿来使用，可以通过以下工具分布图感受一下

bcc -execsnoop 性能---未完_sed

https://github.com/brendangregg/perf-tools/blob/master/execsnoop

其execsnoop 代码实现如下：

#!/bin/bash
#
# execsnoop - trace process exec() with arguments.
#             Written using Linux ftrace.
#
# This shows the execution of new processes, especially short-lived ones that
# can be missed by sampling tools such as top(1).
#
# USAGE: ./execsnoop [-hrt] [-n name]
#
# REQUIREMENTS: FTRACE and KPROBE CONFIG, sched:sched_process_fork tracepoint,
# and either the sys_execve, stub_execve or do_execve kernel function. You may
# already have these on recent kernels. And awk.
#
# This traces exec() from the fork()->exec() sequence, which means it won't
# catch new processes that only fork(). With the -r option, it will also catch
# processes that re-exec. It makes a best-effort attempt to retrieve the program
# arguments and PPID; if these are unavailable, 0 and "[?]" are printed
# respectively. There is also a limit to the number of arguments printed (by
# default, 8), which can be increased using -a.
#
# This implementation is designed to work on older kernel versions, and without
# kernel debuginfo. It works by dynamic tracing an execve kernel function to
# read the arguments from the %si register. The sys_execve function is tried
# first, then stub_execve and do_execve. The sched:sched_process_fork
# tracepoint is used to get the PPID. This program is a workaround that should be
# improved in the future when other kernel capabilities are made available. If
# you need a more reliable tool now, then consider other tracing alternatives
# (eg, SystemTap). This tool is really a proof of concept to see what ftrace can
# currently do.
#
# From perf-tools: https://github.com/brendangregg/perf-tools
#
# See the execsnoop(8) man page (in perf-tools) for more info.
#
# COPYRIGHT: Copyright (c) 2014 Brendan Gregg.
#
#  This program is free software; you can redistribute it and/or
#  modify it under the terms of the GNU General Public License
#  as published by the Free Software Foundation; either version 2
#  of the License, or (at your option) any later version.
#
#  This program is distributed in the hope that it will be useful,
#  but WITHOUT ANY WARRANTY; without even the implied warranty of
#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#  GNU General Public License for more details.
#
#  You should have received a copy of the GNU General Public License
#  along with this program; if not, write to the Free Software Foundation,
#  Inc., 59 Temple Place - Suite 330, Boston, MA  02111-1307, USA.
#
#  (http://www.gnu.org/copyleft/gpl.html)
#
# 07-Jul-2014    Brendan Gregg    Created this.

### default variables
tracing=/sys/kernel/debug/tracing
flock=/var/tmp/.ftrace-lock; wroteflock=0
opt_duration=0; duration=; opt_name=0; name=; opt_time=0; opt_reexec=0
opt_argc=0; argc=8; max_argc=16; ftext=
trap ':' INT QUIT TERM PIPE HUP    # sends execution to end tracing section

function usage {
    cat <<-END >&2
    USAGE: execsnoop [-hrt] [-a argc] [-d secs] [name]
                     -d seconds      # trace duration, and use buffers
                     -a argc         # max args to show (default 8)
                     -r              # include re-execs
                     -t              # include time (seconds)
                     -h              # this usage message
                     name            # process name to match (REs allowed)
      eg,
           execsnoop                 # watch exec()s live (unbuffered)
           execsnoop -d 1            # trace 1 sec (buffered)
           execsnoop grep            # trace process names containing grep
           execsnoop 'udevd$'        # process names ending in "udevd"

    See the man page and example file for more info.
END
    exit
}

function warn {
    if ! eval "$@"; then
        echo >&2 "WARNING: command failed \"$@\""
    fi
}

function end {
    # disable tracing
    echo 2>/dev/null
    echo "Ending tracing..." 2>/dev/null
    cd $tracing
    warn "echo 0 > events/kprobes/$kname/enable"
    warn "echo 0 > events/sched/sched_process_fork/enable"
    warn "echo -:$kname >> kprobe_events"
    warn "echo > trace"
    (( wroteflock )) && warn "rm $flock"
}

function die {
    echo >&2 "$@"
    exit 1
}

function edie {
    # die with a quiet end()
    echo >&2 "$@"
    exec >/dev/null 2>&1
    end
    exit 1
}

### process options
while getopts a:d:hrt opt
do
    case $opt in
    a)    opt_argc=1; argc=$OPTARG ;;
    d)    opt_duration=1; duration=$OPTARG ;;
    r)    opt_reexec=1 ;;
    t)    opt_time=1 ;;
    h|?)    usage ;;
    esac
done
shift $(( $OPTIND - 1 ))
if (( $# )); then
    opt_name=1
    name=$1
    shift
fi
(( $# )) && usage

### option logic
(( opt_pid && opt_name )) && die "ERROR: use either -p or -n."
(( opt_pid )) && ftext=" issued by PID $pid"
(( opt_name )) && ftext=" issued by process name \"$name\""
(( opt_file )) && ftext="$ftext for filenames containing \"$file\""
(( opt_argc && argc > max_argc )) && die "ERROR: max -a argc is $max_argc."
if (( opt_duration )); then
    echo "Tracing exec()s$ftext for $duration seconds (buffered)..."
else
    echo "Tracing exec()s$ftext. Ctrl-C to end."
fi

### select awk
if (( opt_duration )); then
    [[ -x /usr/bin/mawk ]] && awk=mawk || awk=awk
else
    # workarounds for mawk/gawk fflush behavior
    if [[ -x /usr/bin/gawk ]]; then
        awk=gawk
    elif [[ -x /usr/bin/mawk ]]; then
        awk="mawk -W interactive"
    else
        awk=awk
    fi
fi

### check permissions
cd $tracing || die "ERROR: accessing tracing. Root user? Kernel has FTRACE?
    debugfs mounted? (mount -t debugfs debugfs /sys/kernel/debug)"

### ftrace lock
[[ -e $flock ]] && die "ERROR: ftrace may be in use by PID $(cat $flock) $flock"
echo $$ > $flock || die "ERROR: unable to write $flock."
wroteflock=1

### build probe
if [[ -x /usr/bin/getconf ]]; then
    bits=$(getconf LONG_BIT)
else
    bits=64
    [[ $(uname -m) == i* ]] && bits=32
fi
(( offset = bits / 8 ))
function makeprobe {
    func=$1
    kname=execsnoop_$func
    kprobe="p:$kname $func"
    i=0
    while (( i < argc + 1 )); do
        # p:kname do_execve +0(+0(%si)):string +0(+8(%si)):string ...
        kprobe="$kprobe +0(+$(( i * offset ))(%si)):string"
        (( i++ ))
    done
}
# try in this order: sys_execve, stub_execve, do_execve
makeprobe sys_execve

### setup and begin tracing
echo nop > current_tracer
if ! echo $kprobe >> kprobe_events 2>/dev/null; then
    makeprobe stub_execve
    if ! echo $kprobe >> kprobe_events 2>/dev/null; then
        makeprobe do_execve
        if ! echo $kprobe >> kprobe_events 2>/dev/null; then
            edie "ERROR: adding a kprobe for execve. Exiting."
        fi
    fi
fi
if ! echo 1 > events/kprobes/$kname/enable; then
    edie "ERROR: enabling kprobe for execve. Exiting."
fi
if ! echo 1 > events/sched/sched_process_fork/enable; then
    edie "ERROR: enabling sched:sched_process_fork tracepoint. Exiting."
fi
echo "Instrumenting $func"
(( opt_time )) && printf "%-16s " "TIMEs"
printf "%6s %6s %s\n" "PID" "PPID" "ARGS"

#
# Determine output format. It may be one of the following (newest first):
#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
#           TASK-PID    CPU#    TIMESTAMP  FUNCTION
# To differentiate between them, the number of header fields is counted,
# and an offset set, to skip the extra column when needed.
#
offset=$($awk 'BEGIN { o = 0; }
    $1 == "#" && $2 ~ /TASK/ && NF == 6 { o = 1; }
    $2 ~ /TASK/ { print o; exit }' trace)

### print trace buffer
warn "echo > trace"
( if (( opt_duration )); then
    # wait then dump buffer
    sleep $duration
    cat -v trace
else
    # print buffer live
    cat -v trace_pipe
fi ) | $awk -v o=$offset -v opt_name=$opt_name -v name=$name \
    -v opt_duration=$opt_duration -v opt_time=$opt_time -v kname=$kname \
    -v opt_reexec=$opt_reexec '
    # common fields
    $1 != "#" {
        # task name can contain dashes
        comm = pid = $1
        sub(/-[0-9][0-9]*/, "", comm)
        sub(/.*-/, "", pid)
    }

    $1 != "#" && $(4+o) ~ /sched_process_fork/ {
        cpid=$0
        sub(/.* child_pid=/, "", cpid)
        sub(/ .*/, "", cpid)
        getppid[cpid] = pid
        delete seen[pid]
    }

    $1 != "#" && $(4+o) ~ kname {
        if (seen[pid])
            next
        if (opt_name && comm !~ name)
            next

        #
        # examples:
        # ... arg1="/bin/echo" arg2="1" arg3="2" arg4="3" ...
        # ... arg1="sleep" arg2="2" arg3=(fault) arg4="" ...
        # ... arg1="" arg2=(fault) arg3="" arg4="" ...
        # the last example is uncommon, and may be a race.
        #
        if ($0 ~ /arg1=""/) {
            args = comm " [?]"
        } else {
            args=$0
            sub(/ arg[0-9]*=\(fault\).*/, "", args)
            sub(/.*arg1="/, "", args)
            gsub(/" arg[0-9]*="/, " ", args)
            sub(/"$/, "", args)
            if ($0 !~ /\(fault\)/)
                args = args " [...]"
        }

        if (opt_time) {
            time = $(3+o); sub(":", "", time)
            printf "%-16s ", time
        }
        printf "%6s %6d %s\n", pid, getppid[pid], args
        if (!opt_duration)
            fflush()
        if (!opt_reexec) {
            seen[pid] = 1
            delete getppid[pid]
        }
    }

    $0 ~ /LOST.*EVENT[S]/ { print "WARNING: " $0 > "/dev/stderr" }
'

### end tracing
end

python 版本依赖于bcc bpf 如下：

bcc -execsnoop 性能---未完_sed_02 bcc -execsnoop 性能---未完_sed_03

#!/usr/bin/python
# @lint-avoid-python-3-compatibility-imports
#
# execsnoop Trace new processes via exec() syscalls.
#           For Linux, uses BCC, eBPF. Embedded C.
#
# USAGE: execsnoop [-h] [-T] [-t] [-x] [-q] [-n NAME] [-l LINE]
#                  [--max-args MAX_ARGS]
#
# This currently will print up to a maximum of 19 arguments, plus the process
# name, so 20 fields in total (MAXARG).
#
# This won't catch all new processes: an application may fork() but not exec().
#
# Copyright 2016 Netflix, Inc.
# Licensed under the Apache License, Version 2.0 (the "License")
#
# 07-Feb-2016   Brendan Gregg   Created this.

from __future__ import print_function
from bcc import BPF
from bcc.utils import ArgString, printb
import bcc.utils as utils
import argparse
import re
import time
import pwd
from collections import defaultdict
from time import strftime


def parse_uid(user):
    try:
        result = int(user)
    except ValueError:
        try:
            user_info = pwd.getpwnam(user)
        except KeyError:
            raise argparse.ArgumentTypeError(
                "{0!r} is not valid UID or user entry".format(user))
        else:
            return user_info.pw_uid
    else:
        # Maybe validate if UID < 0 ?
        return result


# arguments
examples = """examples:
    ./execsnoop           # trace all exec() syscalls
    ./execsnoop -x        # include failed exec()s
    ./execsnoop -T        # include time (HH:MM:SS)
    ./execsnoop -U        # include UID
    ./execsnoop -u 1000   # only trace UID 1000
    ./execsnoop -u user   # get user UID and trace only them
    ./execsnoop -t        # include timestamps
    ./execsnoop -q        # add "quotemarks" around arguments
    ./execsnoop -n main   # only print command lines containing "main"
    ./execsnoop -l tpkg   # only print command where arguments contains "tpkg"
    ./execsnoop --cgroupmap ./mappath  # only trace cgroups in this BPF map
"""
parser = argparse.ArgumentParser(
    description="Trace exec() syscalls",
    formatter_class=argparse.RawDescriptionHelpFormatter,
    epilog=examples)
parser.add_argument("-T", "--time", action="store_true",
    help="include time column on output (HH:MM:SS)")
parser.add_argument("-t", "--timestamp", action="store_true",
    help="include timestamp on output")
parser.add_argument("-x", "--fails", action="store_true",
    help="include failed exec()s")
parser.add_argument("--cgroupmap",
    help="trace cgroups in this BPF map only")
parser.add_argument("-u", "--uid", type=parse_uid, metavar='USER',
    help="trace this UID only")
parser.add_argument("-q", "--quote", action="store_true",
    help="Add quotemarks (\") around arguments."
    )
parser.add_argument("-n", "--name",
    type=ArgString,
    help="only print commands matching this name (regex), any arg")
parser.add_argument("-l", "--line",
    type=ArgString,
    help="only print commands where arg contains this line (regex)")
parser.add_argument("-U", "--print-uid", action="store_true",
    help="print UID column")
parser.add_argument("--max-args", default="20",
    help="maximum number of arguments parsed and displayed, defaults to 20")
parser.add_argument("--ebpf", action="store_true",
    help=argparse.SUPPRESS)
args = parser.parse_args()

# define BPF program
bpf_text = """
#include <uapi/linux/ptrace.h>
#include <linux/sched.h>
#include <linux/fs.h>

#define ARGSIZE  128

enum event_type {
    EVENT_ARG,
    EVENT_RET,
};

struct data_t {
    u32 pid;  // PID as in the userspace term (i.e. task->tgid in kernel)
    u32 ppid; // Parent PID as in the userspace term (i.e task->real_parent->tgid in kernel)
    u32 uid;
    char comm[TASK_COMM_LEN];
    enum event_type type;
    char argv[ARGSIZE];
    int retval;
};

#if CGROUPSET
BPF_TABLE_PINNED("hash", u64, u64, cgroupset, 1024, "CGROUPPATH");
#endif
BPF_PERF_OUTPUT(events);

static int __submit_arg(struct pt_regs *ctx, void *ptr, struct data_t *data)
{
    bpf_probe_read(data->argv, sizeof(data->argv), ptr);
    events.perf_submit(ctx, data, sizeof(struct data_t));
    return 1;
}

static int submit_arg(struct pt_regs *ctx, void *ptr, struct data_t *data)
{
    const char *argp = NULL;
    bpf_probe_read(&argp, sizeof(argp), ptr);
    if (argp) {
        return __submit_arg(ctx, (void *)(argp), data);
    }
    return 0;
}

int syscall__execve(struct pt_regs *ctx,
    const char __user *filename,
    const char __user *const __user *__argv,
    const char __user *const __user *__envp)
{

    u32 uid = bpf_get_current_uid_gid() & 0xffffffff;

    UID_FILTER

#if CGROUPSET
    u64 cgroupid = bpf_get_current_cgroup_id();
    if (cgroupset.lookup(&cgroupid) == NULL) {
      return 0;
    }
#endif

    // create data here and pass to submit_arg to save stack space (#555)
    struct data_t data = {};
    struct task_struct *task;

    data.pid = bpf_get_current_pid_tgid() >> 32;

    task = (struct task_struct *)bpf_get_current_task();
    // Some kernels, like Ubuntu 4.13.0-generic, return 0
    // as the real_parent->tgid.
    // We use the get_ppid function as a fallback in those cases. (#1883)
    data.ppid = task->real_parent->tgid;

    bpf_get_current_comm(&data.comm, sizeof(data.comm));
    data.type = EVENT_ARG;

    __submit_arg(ctx, (void *)filename, &data);

    // skip first arg, as we submitted filename
    #pragma unroll
    for (int i = 1; i < MAXARG; i++) {
        if (submit_arg(ctx, (void *)&__argv[i], &data) == 0)
             goto out;
    }

    // handle truncated argument list
    char ellipsis[] = "...";
    __submit_arg(ctx, (void *)ellipsis, &data);
out:
    return 0;
}

int do_ret_sys_execve(struct pt_regs *ctx)
{
#if CGROUPSET
    u64 cgroupid = bpf_get_current_cgroup_id();
    if (cgroupset.lookup(&cgroupid) == NULL) {
      return 0;
    }
#endif

    struct data_t data = {};
    struct task_struct *task;

    u32 uid = bpf_get_current_uid_gid() & 0xffffffff;
    UID_FILTER

    data.pid = bpf_get_current_pid_tgid() >> 32;
    data.uid = uid;

    task = (struct task_struct *)bpf_get_current_task();
    // Some kernels, like Ubuntu 4.13.0-generic, return 0
    // as the real_parent->tgid.
    // We use the get_ppid function as a fallback in those cases. (#1883)
    data.ppid = task->real_parent->tgid;

    bpf_get_current_comm(&data.comm, sizeof(data.comm));
    data.type = EVENT_RET;
    data.retval = PT_REGS_RC(ctx);
    events.perf_submit(ctx, &data, sizeof(data));

    return 0;
}
"""

bpf_text = bpf_text.replace("MAXARG", args.max_args)

if args.uid:
    bpf_text = bpf_text.replace('UID_FILTER',
        'if (uid != %s) { return 0; }' % args.uid)
else:
    bpf_text = bpf_text.replace('UID_FILTER', '')
if args.cgroupmap:
    bpf_text = bpf_text.replace('CGROUPSET', '1')
    bpf_text = bpf_text.replace('CGROUPPATH', args.cgroupmap)
else:
    bpf_text = bpf_text.replace('CGROUPSET', '0')
if args.ebpf:
    print(bpf_text)
    exit()

# initialize BPF
b = BPF(text=bpf_text)
execve_fnname = b.get_syscall_fnname("execve")
b.attach_kprobe(event=execve_fnname, fn_name="syscall__execve")
b.attach_kretprobe(event=execve_fnname, fn_name="do_ret_sys_execve")

# header
if args.time:
    print("%-9s" % ("TIME"), end="")
if args.timestamp:
    print("%-8s" % ("TIME(s)"), end="")
if args.print_uid:
    print("%-6s" % ("UID"), end="")
print("%-16s %-6s %-6s %3s %s" % ("PCOMM", "PID", "PPID", "RET", "ARGS"))

class EventType(object):
    EVENT_ARG = 0
    EVENT_RET = 1

start_ts = time.time()
argv = defaultdict(list)

# This is best-effort PPID matching. Short-lived processes may exit
# before we get a chance to read the PPID.
# This is a fallback for when fetching the PPID from task->real_parent->tgip
# returns 0, which happens in some kernel versions.
def get_ppid(pid):
    try:
        with open("/proc/%d/status" % pid) as status:
            for line in status:
                if line.startswith("PPid:"):
                    return int(line.split()[1])
    except IOError:
        pass
    return 0

# process event
def print_event(cpu, data, size):
    event = b["events"].event(data)
    skip = False

    if event.type == EventType.EVENT_ARG:
        argv[event.pid].append(event.argv)
    elif event.type == EventType.EVENT_RET:
        if event.retval != 0 and not args.fails:
            skip = True
        if args.name and not re.search(bytes(args.name), event.comm):
            skip = True
        if args.line and not re.search(bytes(args.line),
                                       b' '.join(argv[event.pid])):
            skip = True
        if args.quote:
            argv[event.pid] = [
                b"\"" + arg.replace(b"\"", b"\\\"") + b"\""
                for arg in argv[event.pid]
            ]

        if not skip:
            if args.time:
                printb(b"%-9s" % strftime("%H:%M:%S").encode('ascii'), nl="")
            if args.timestamp:
                printb(b"%-8.3f" % (time.time() - start_ts), nl="")
            if args.print_uid:
                printb(b"%-6d" % event.uid, nl="")
            ppid = event.ppid if event.ppid > 0 else get_ppid(event.pid)
            ppid = b"%d" % ppid if ppid > 0 else b"?"
            argv_text = b' '.join(argv[event.pid]).replace(b'\n', b'\\n')
            printb(b"%-16s %-6d %-6s %3d %s" % (event.comm, event.pid,
                   ppid, event.retval, argv_text))
        try:
            del(argv[event.pid])
        except Exception:
            pass


# loop with callback to print_event
b["events"].open_perf_buffer(print_event)
while 1:
    try:
        b.perf_buffer_poll()
    except KeyboardInterrupt:
        exit()