Following the previous post on FS namespace (mountpoints table isolation), we will now have a look at an amazing one: isolated mount table. If you haven’t done so already, I encourage you to read the first post of this series for an introduction to linux namespace isolation mechanism.
[EDIT 2014-01-08] A Chinese translation of this post is available here
In the previous post we “chrooted” the PID namespace and got a new “1” process. But even with this namespace activated, there still lacked isolation for tools like “top” because they rely on the “/proc” virtual filesystem which is still shared (identical) between namespaces. In this post, let me introduce the namespace that will solve this: “NS”. This is historically the first Linux Namespace, hence the name.
Activating it is only a matter of adding “CLONE_NEWNS” to the “clone” call. It requires no additional setup. It may also be freely combined with other namespaces.
Once activated, any (un)mount operations from the child will only affect the child and vice-versa.
Let’s start experimenting. In the previous example, just activate the NS:
int child_pid = clone(child_main, child_stack+STACK_SIZE, CLONE_NEWUTS | CLONE_NEWIPC | CLONE_NEWPID | CLONE_NEWNS | SIGCHLD, NULL);
Now, if we run it, we finally can fix the issue from the previous post on PID:
jean-tiare@jeantiare-Ubuntu:~/blog$ gcc -Wall ns.c -o ns && sudo ./ns - [14472] Hello ? - [ 1] World ! root@In Namespace:~/blog# mount -t proc proc /proc root@In Namespace:~/blog# ps aux USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1 1.0 0.0 23620 4680 pts/4 S 00:07 0:00 /bin/bash root 79 0.0 0.0 18492 1328 pts/4 R+ 00:07 0:00 ps aux root@In Namespace:~/blog# exit
Tadaaa ! “/proc” is now working as expected from the container, without breaking the parent.
Let’s automate it to finalize previous post’s example:
#define _GNU_SOURCE #include <sys/types.h> #include <sys/wait.h> #include <sys/mount.h> #include <stdio.h> #include <sched.h> #include <signal.h> #include <unistd.h> #define STACK_SIZE (1024 * 1024) // sync primitive int checkpoint[2]; static char child_stack[STACK_SIZE]; char* const child_args[] = { "/bin/bash", NULL }; int child_main(void* arg) { char c; // init sync primitive close(checkpoint[1]); // setup hostname printf(" - [%5d] World !\n", getpid()); sethostname("In Namespace", 12); // remount "/proc" to get accurate "top" && "ps" output mount("proc", "/proc", "proc", 0, NULL); // wait... read(checkpoint[0], &c, 1); execv(child_args[0], child_args); printf("Ooops\n"); return 1; } int main() { // init sync primitive pipe(checkpoint); printf(" - [%5d] Hello ?\n", getpid()); int child_pid = clone(child_main, child_stack+STACK_SIZE, CLONE_NEWUTS | CLONE_NEWIPC | CLONE_NEWPID | CLONE_NEWNS | SIGCHLD, NULL); // further init here (nothing yet) // signal "done" close(checkpoint[1]); waitpid(child_pid, NULL, 0); return 0; }
If you run this snippet, you should get exactly the same behavior as the previous test without manually remounting “/proc” neither messing with your real parent’s “/proc”. Neat isn’t it ?
To leverage the power of this technique you could now prepare and enter a chroot to further enhance the isolation. Steps involved would be to prepare a “debootstrap”, remount some essentials filesystems like “/tmp”, “/dev/shm”, “/proc”, optionally all or part of “/dev” and “/sys” and then “chdir” + “chroot“. I’ll leave it as an exercise for the reader.
That’s all for “NS” namespace. With the next article we’ll explore an incredibly powerful namespace “NET”. It’s so powerful that it’s used as the foundation of the “CORE” lightweight network simulator. Thanks for reading !