Following the previous post on IPC namespace (Inter Process Communication isolation), I would now like to introduce my personal favorite one (as sysadmin): PID namespaces. If you haven’t done so already, I encourage you to read the first post of this series for an introduction to linux namespace isolation mechanism.
[EDIT 2014-01-08] A Chinese translation of this post is available here
Yes, that’s it, with this namespace it is possible to restart PID numbering and get your own “1” process. This could be seen as a “chroot” in the process identifier tree. It’s extremely handy when you need to deal with pids in day to day work and are stuck with 4 digits numbers…
Activating it is only a matter of adding “CLONE_NEWPID” to the “clone” call. It requires no additional setup. It may also be freely combined with other namespaces.
Once activated, the result of getpid() from child process will invariably be “1”.
But, WAIT! I know have to “1” process right ? What about process management ?
Well, actually, this *really* is much like a “chroot”. That is to say, a change of view point.
- Host: all processes are visible, global PIDs (init=1, …, child=xxx, ….)
- Container: only child + descendant are visible, local PIDs (child=1, …)
Here is an illustration:
#define _GNU_SOURCE #include <sys/types.h> #include <sys/wait.h> #include <stdio.h> #include <sched.h> #include <signal.h> #include <unistd.h> #define STACK_SIZE (1024 * 1024) // sync primitive int checkpoint[2]; static char child_stack[STACK_SIZE]; char* const child_args[] = { "/bin/bash", NULL }; int child_main(void* arg) { char c; // init sync primitive close(checkpoint[1]); // wait... read(checkpoint[0], &c, 1); printf(" - [%5d] World !\n", getpid()); sethostname("In Namespace", 12); execv(child_args[0], child_args); printf("Ooops\n"); return 1; } int main() { // init sync primitive pipe(checkpoint); printf(" - [%5d] Hello ?\n", getpid()); int child_pid = clone(child_main, child_stack+STACK_SIZE, CLONE_NEWUTS | CLONE_NEWIPC | CLONE_NEWPID | SIGCHLD, NULL); // further init here (nothing yet) // signal "done" close(checkpoint[1]); waitpid(child_pid, NULL, 0); return 0; }
And an example run:
jean-tiare@jeantiare-Ubuntu:~/blog$ gcc -Wall main-3-pid.c -o ns && sudo ./ns - [ 7823] Hello ? - [ 1] World ! root@In Namespace:~/blog# echo "=> My PID: $$" => My PID: 1 root@In Namespace:~/blog# exit
As expected, even thought the parent process as a PID of “7823”, the child’s PID is “1”. If you are playfull, you could try to “kill -KILL 7823” the parent process. It would do exactly… nothing:
jean-tiare@jeantiare-Ubuntu:~/blog$ gcc -Wall main-3-pid.c -o ns && sudo ./ns - [ 7823] Hello ? - [ 1] World ! root@In Namespace:~/blog# kill -KILL 7823 bash: kill: (7823) - No such process root@In Namespace:~/blog# exit
The isolation is working as expected. And, as written earlier, this behaves much like a “chroot” meaning that with a “top” or “ps exf” from the parent process will show the child process with its real un-mapped PID. This is an essential feature for process control like “kill”, “cgroups”, … and various policies.
Wait! Speaking of “top” and “ps exf”, I just ran them from the child and saw exactly the same as from the parent. You lied to me about isolation !
Well, not at all. This is because these tools get their informations from the virtual “/proc” filesystem which is not (yet) isolated. This is the purpose of the next article.
In the mean time, an easy workaround could be:
# from child root@In Namespace:~/blog# mkdir -p proc root@In Namespace:~/blog# mount -t proc proc proc root@In Namespace:~/blog# ls proc 1 dma key-users net sysvipc 80 dri kmsg pagetypeinfo timer_list acpi driver kpagecount partitions timer_stats asound execdomains kpageflags sched_debug tty buddyinfo fb latency_stats schedstat uptime bus filesystems loadavg scsi version cgroups fs locks self version_signature cmdline interrupts mdstat slabinfo vmallocinfo consoles iomem meminfo softirqs vmstat cpuinfo ioports misc stat zoneinfo crypto irq modules swaps devices kallsyms mounts sys diskstats kcore mtrr sysrq-trigger
Everything seems reasonable again. As expected, you get PID “1” for /bin/bash itself and “80” corresponds to the running “/bin/ls proc” command. Much nicer to read than usual /proc, isn’t it ? That’s why I love it.
If you attempt to run this command directly on the “/proc” from the namespace, it will seem to work in the child but BREAK your main namespace. Example:
jean-tiare@jeantiare-Ubuntu:~/blog$ ps aux Error, do this: mount -t proc proc /proc
That’s all for PID namespace. With the next article, we’ll be able to re-mount /proc itself and hence fix “top” and any similar tools without breaking the parent namespace. Thanks for reading !