Introduction to Linux namespaces - Part 4: NS (FS)

Jan 12, 2014   #linux  #namespace 

Following the previous post on FS namespace (mountpoints table isolation), we will now have a look at an amazing one: isolated mount table. If you haven’t done so already, I encourage you to read the first post of this series for an introduction to linux namespace isolation mechanism.

[EDIT 2014-01-08] A Chinese translation of this post is available here

In the previous post we “chrooted” the PID namespace and got a new “1” process. But even with this namespace activated, there still lacked isolation for tools like “top” because they rely on the “/proc” virtual filesystem which is still shared (identical) between namespaces. In this post, let me introduce the namespace that will solve this: “NS”. This is historically the first Linux Namespace, hence the name.

Activating it is only a matter of adding “CLONE_NEWNS” to the “clone” call. It requires no additional setup. It may also be freely combined with other namespaces.

Once activated, any (un)mount operations from the child will only affect the child and vice-versa.

Let’s start experimenting. In the previous example, just activate the NS:

int child_pid = clone(child_main, child_stack+STACK_SIZE, 
      CLONE_NEWUTS | CLONE_NEWIPC | CLONE_NEWPID | CLONE_NEWNS | SIGCHLD, NULL);

Now, if we run it, we finally can fix the issue from the previous post on PID:

jean-tiare@jeantiare-Ubuntu:~/blog$ gcc -Wall ns.c -o ns && sudo ./ns
 - [14472] Hello ?
 - [    1] World !
root@In Namespace:~/blog# mount -t proc proc /proc
root@In Namespace:~/blog# ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  1.0  0.0  23620  4680 pts/4    S    00:07   0:00 /bin/bash
root        79  0.0  0.0  18492  1328 pts/4    R+   00:07   0:00 ps aux
root@In Namespace:~/blog# exit

Tadaaa ! “/proc” is now working as expected from the container, without breaking the parent.

Let’s automate it to finalize previous post’s example:

#define _GNU_SOURCE
#include <sys/types.h>
#include <sys/wait.h>
#include <sys/mount.h>
#include <stdio.h>
#include <sched.h>
#include <signal.h>
#include <unistd.h>

#define STACK_SIZE (1024 * 1024)

// sync primitive
int checkpoint[2];

static char child_stack[STACK_SIZE];
char* const child_args[] = {
  "/bin/bash",
  NULL
};

int child_main(void* arg)
{
  char c;

  // init sync primitive
  close(checkpoint[1]);

  // setup hostname
  printf(" - [%5d] World !\n", getpid());
  sethostname("In Namespace", 12);

  // remount "/proc" to get accurate "top" && "ps" output
  mount("proc", "/proc", "proc", 0, NULL);

  // wait...
  read(checkpoint[0], &c, 1);

  execv(child_args[0], child_args);
  printf("Ooops\n");
  return 1;
}

int main()
{
  // init sync primitive
  pipe(checkpoint);

  printf(" - [%5d] Hello ?\n", getpid());

  int child_pid = clone(child_main, child_stack+STACK_SIZE,
      CLONE_NEWUTS | CLONE_NEWIPC | CLONE_NEWPID | CLONE_NEWNS | SIGCHLD, NULL);

  // further init here (nothing yet)

  // signal "done"
  close(checkpoint[1]);

  waitpid(child_pid, NULL, 0);
  return 0;
}

If you run this snippet, you should get exactly the same behavior as the previous test without manually remounting “/proc” neither messing with your real parent’s “/proc”. Neat isn’t it ?

To leverage the power of this technique you could now prepare and enter a chroot to further enhance the isolation. Steps involved would be to prepare a “debootstrap”, remount some essentials filesystems like “/tmp”, “/dev/shm”, “/proc”, optionally all or part of “/dev” and “/sys” and then “chdir” + “chroot“. I’ll leave it as an exercise for the reader.

That’s all for “NS” namespace. With the next article we’ll explore an incredibly powerful namespace “NET”. It’s so powerful that it’s used as the foundation of the “CORE” lightweight network simulator. Thanks for reading !