Introduction to seccomp: BPF linux syscall filter

May 29, 2014   #containers  #linux  #seccomp  #security 

Seccomp is basic yet efficient way to filter syscalls issued by a program. It is especially useful when running untrusted third party programs. Actually, it was first introduced in linux 2.6.12 as an essential building block of “cpushare” program. The idea behind this project was to allow anyone with the proper agent installed to rent cpu cycles to third parties, without compromising its the security.

The initial implementation, also known as “mode 1 seccomp” only allowed ‘read‘, ‘write‘, ‘_exit‘ and ‘sigreturn‘ syscalls to be issued making it only possible to read/write to already opened files and to exit. It is also trivial get started with:

#include <stdio.h>         /* printf */
#include <sys/prctl.h>     /* prctl */
#include <linux/seccomp.h> /* seccomp's constants */
#include <unistd.h>        /* dup2: just for test */

int main() {
  printf("step 1: unrestricted\n");

  // Enable filtering
  prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT);
  printf("step 2: only 'read', 'write', '_exit' and 'sigreturn' syscalls\n");
  
  // Redirect stderr to stdout
  dup2(1, 2);
  printf("step 3: !! YOU SHOULD NOT SEE ME !!\n");

  // Success (well, not so in this case...)
  return 0; 
}

Build, run, test:

gcc 01-nothing.c -o 01-nothing && ./01-nothing; echo "Status: $?"

Output:

step 1: unrestricted
step 2: only 'read', 'write', '_exit' and 'sigreturn' syscalls
Processus arrêté
Status: 137        <------ 128+9 ==> SIGKILL

See the return status ? Whenever a forbidden syscall is issued, the program is immediately killed.

While this is really cool, this is also somewhat over-restrictive. This is the reason why it saw such a little adoption. Linus Torvald even suggested to ax it out of the kernel!

Fortunately, since linux 3.5, it is also possible to define advanced custom filters based on the BPF (Berkley Packet Filters). These filters may apply on any of the syscall argument but only on their value. In other words, a filter won’t be able to dereference a pointer. For example one could write a rule to forbid any call to ‘dup2‘ as long as it targets ‘stderr‘ (fd=2) but would not be able to restrict ‘open‘ to a given set of files neither bind to a specific interface or port number.

Once installed, each syscall is sent to the filter which tells what action to take:

  • SECCOMP_RET_KILL: Immediate kill with SIGSYS
  • SECCOMP_RET_TRAP: Send a catchable SIGSYS, giving a chance to emulate the syscall
  • SECCOMP_RET_ERRNO: Force errno value
  • SECCOMP_RET_TRACE: Yield decision to ptracer or set errno to -ENOSYS
  • SECCOMP_RET_ALLOW: Allow

Enough words. Let’s allow the program to redirect its stderr to stdout but nothing else. Writing BPF directly is cumbersome and far beyond the scope of this post, we’ll use the libseccomp helper to make the code easier to write… and read. Error checking stripped for brevity.

Grab the library:

sudo apt-get install libseccomp-dev

Write the code:

#include <stdio.h>   /* printf */
#include <unistd.h>  /* dup2: just for test */
#include <seccomp.h> /* libseccomp */

int main() {
  printf("step 1: unrestricted\n");

  // Init the filter
  scmp_filter_ctx ctx;
  ctx = seccomp_init(SCMP_ACT_KILL); // default action: kill

  // setup basic whitelist
  seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(rt_sigreturn), 0);
  seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit), 0);
  seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0);
  seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0);
  
  // setup our rule
  seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(dup2), 2, 
                        SCMP_A0(SCMP_CMP_EQ, 1),
                        SCMP_A1(SCMP_CMP_EQ, 2));

  // build and load the filter
  seccomp_load(ctx);
  printf("step 2: only 'write' and dup2(1, 2) syscalls\n");
  
  // Redirect stderr to stdout
  dup2(1, 2);
  printf("step 3: stderr redirected to stdout\n");

  // Duplicate stderr to arbitrary fd
  dup2(2, 42);
  printf("step 4: !! YOU SHOULD NOT SEE ME !!\n");

  // Success (well, not so in this case...)
  return 0; 
}

Build, run, test:

gcc 02-bpf-only-dup-sudo.c -o 02-bpf-only-dup-sudo -lseccomp && sudo ./02-bpf-only-dup-sudo; echo "Status: $?"

Output:

step 1: unrestricted
step 2: only 'write' and dup2(1, 2) syscalls
step 3: stderr redirected to stdout
Appel système erroné
Status: 159        <------ 128+31 ==> SIGSYS

Just as expected.

As you probably noticed, we ran the previous example as root which somewhat limits the security benefice of syscall filtering as we actually have MORE privileges than before…

This is where it really gets interesting: filters are inherited by child processes so that one could technically apply syscall filters to ‘sudo’ and maybe defeat some of its security measures and gain root on the machine ? To prevent this, one must either be ‘CAP_SYS_ADMIN‘ (read: root), either explicitly accept to never get any more privileges. For example the ‘setuid‘ bit of ‘sudo‘ would not be honored.

This can easily be achieved by adding this snippet before installing the filter:

prctl(PR_SET_NO_NEW_PRIVS, 1);

Another security note, remember the SECCOMP_RET_TRACE filter return value ? It instructs the kernel to notify the ptracer program, if any, to take the final decision. Hence the “secured” program could be run under a malicious ptracer possibly defeating the security measures. This is why another prctl is highly recommended to forbid any attempt to attach a ptracer:

prctl(PR_SET_DUMPABLE, 0);

Putting it all together we get:

#include <stdio.h>     /* printf */
#include <unistd.h>    /* dup2: just for test */
#include <seccomp.h>   /* libseccomp */
#include <sys/prctl.h> /* prctl */

int main() {
  printf("step 1: unrestricted\n");

  // ensure none of our children will ever be granted more priv
  // (via setuid, capabilities, ...)
  prctl(PR_SET_NO_NEW_PRIVS, 1);
  // ensure no escape is possible via ptrace
  prctl(PR_SET_DUMPABLE, 0);

  // Init the filter
  scmp_filter_ctx ctx;
  ctx = seccomp_init(SCMP_ACT_KILL); // default action: kill

  // setup basic whitelist
  seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(rt_sigreturn), 0);
  seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit), 0);
  seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0);
  seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0);
  
  // setup our rule
  seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(dup2), 2, 
                        SCMP_A0(SCMP_CMP_EQ, 1),
                        SCMP_A1(SCMP_CMP_EQ, 2));

  // build and load the filter
  seccomp_load(ctx);
  printf("step 2: only 'write' and dup2(1, 2) syscalls\n");
  
  // Redirect stderr to stdout
  dup2(1, 2);
  printf("step 3: stderr redirected to stdout\n");

  // Duplicate stderr to arbitrary fd
  dup2(2, 42);
  printf("step 4: !! YOU SHOULD NOT SEE ME !!\n");

  // Success (well, not so in this case...)
  return 0;
}

Build, run, test:

gcc 03-bpf-only-dup.c -o 03-bpf-only-dup -lseccomp && ./03-bpf-only-dup; echo "Status: $?"

Output:

step 1: unrestricted
step 2: only 'write' and dup2(1, 2) syscalls
step 3: stderr redirected to stdout
Appel système erroné
Status: 159        <------ 128+31 ==> SIGSYS

There we are: no more “sudo” to run it :)

Linux’s seccomp is an extremely powerful tool when dealing with untrusted program’s on Linux. (who said in “shared hosting environment”?). And we only scratched its surface. Please, keep in mind that seccomp is only a tool and should be used in combination with other Linux’s security building blocks such as namespaces and capabilities to unleash its full power.

Example applications:

  • prevent “virtual priv esc” -> clone && unshare CLONE_NEW_USER
  • prevent std{in,out,err} escape -> block close, dup2
  • restrict read/write to std{in,out,err}
  • change limits (rlimits)
  • … -> see man 2 syscalls for more ideas 😉

What you still can’t do:

  • filter base on filename: no pointer dereference

  • filter base on port/ip: same reason

Going further: