Introduction to Linux namespaces - Part 3: PID

Jan 5, 2014   #linux  #namespace 

Following the previous post on IPC namespace (Inter Process Communication isolation), I would now like to introduce my personal favorite one (as sysadmin): PID namespaces. If you haven’t done so already, I encourage you to read the first post of this series for an introduction to linux namespace isolation mechanism.

[EDIT 2014-01-08] A Chinese translation of this post is available here

Yes, that’s it, with this namespace it is possible to restart PID numbering and get your own “1” process. This could be seen as a “chroot” in the process identifier tree. It’s extremely handy when you need to deal with pids in day to day work and are stuck with 4 digits numbers…

Activating it is only a matter of adding “CLONE_NEWPID” to the “clone” call. It requires no additional setup. It may also be freely combined with other namespaces.

Once activated, the result of getpid() from child process will invariably be “1”.

But, WAIT! I know have to “1” process right ? What about process management ?

Well, actually, this *really* is much like a “chroot”. That is to say, a change of view point.

  • Host: all processes are visible, global PIDs (init=1, …, child=xxx, ….)
  • Container: only child + descendant are visible, local PIDs (child=1, …)

Here is an illustration:

#define _GNU_SOURCE
#include <sys/types.h>
#include <sys/wait.h>
#include <stdio.h>
#include <sched.h>
#include <signal.h>
#include <unistd.h>

#define STACK_SIZE (1024 * 1024)

// sync primitive
int checkpoint[2];

static char child_stack[STACK_SIZE];
char* const child_args[] = {
  "/bin/bash",
  NULL
};

int child_main(void* arg)
{
  char c;

  // init sync primitive
  close(checkpoint[1]);
  // wait...
  read(checkpoint[0], &c, 1);

  printf(" - [%5d] World !\n", getpid());
  sethostname("In Namespace", 12);
  execv(child_args[0], child_args);
  printf("Ooops\n");
  return 1;
}

int main()
{
  // init sync primitive
  pipe(checkpoint);

  printf(" - [%5d] Hello ?\n", getpid());

  int child_pid = clone(child_main, child_stack+STACK_SIZE,
      CLONE_NEWUTS | CLONE_NEWIPC | CLONE_NEWPID | SIGCHLD, NULL);

  // further init here (nothing yet)

  // signal "done"
  close(checkpoint[1]);

  waitpid(child_pid, NULL, 0);
  return 0;
}

And an example run:

jean-tiare@jeantiare-Ubuntu:~/blog$ gcc -Wall main-3-pid.c -o ns && sudo ./ns
 - [ 7823] Hello ?
 - [    1] World !
root@In Namespace:~/blog# echo "=> My PID: $$"
=> My PID: 1
root@In Namespace:~/blog# exit

As expected, even thought the parent process as a PID of “7823”, the child’s PID is “1”. If you are playfull, you could try to “kill -KILL 7823” the parent process. It would do exactly… nothing:

jean-tiare@jeantiare-Ubuntu:~/blog$ gcc -Wall main-3-pid.c -o ns && sudo ./ns
 - [ 7823] Hello ?
 - [    1] World !
root@In Namespace:~/blog# kill -KILL 7823
bash: kill: (7823) - No such process
root@In Namespace:~/blog# exit

The isolation is working as expected. And, as written earlier, this behaves much like a “chroot” meaning that with a “top” or “ps exf” from the parent process will show the child process with its real un-mapped PID. This is an essential feature for process control like “kill”, “cgroups”, … and various policies.

Wait! Speaking of “top” and “ps exf”, I just ran them from the child and saw exactly the same as from the parent. You lied to me about isolation !

Well, not at all. This is because these tools get their informations from the virtual “/proc” filesystem which is not (yet) isolated. This is the purpose of the next article.

In the mean time, an easy workaround could be:

# from child
root@In Namespace:~/blog# mkdir -p proc
root@In Namespace:~/blog# mount -t proc proc proc
root@In Namespace:~/blog# ls proc
1          dma          key-users      net            sysvipc
80         dri          kmsg           pagetypeinfo   timer_list
acpi       driver       kpagecount     partitions     timer_stats
asound     execdomains  kpageflags     sched_debug    tty
buddyinfo  fb           latency_stats  schedstat      uptime
bus        filesystems  loadavg        scsi           version
cgroups    fs           locks          self           version_signature
cmdline    interrupts   mdstat         slabinfo       vmallocinfo
consoles   iomem        meminfo        softirqs       vmstat
cpuinfo    ioports      misc           stat           zoneinfo
crypto     irq          modules        swaps
devices    kallsyms     mounts         sys
diskstats  kcore        mtrr           sysrq-trigger

Everything seems reasonable again. As expected, you get PID “1” for /bin/bash itself and “80” corresponds to the running “/bin/ls proc” command. Much nicer to read than usual /proc, isn’t it ? That’s why I love it.

If you attempt to run this command directly on the “/proc” from the namespace, it will seem to work in the child but BREAK your main namespace. Example:

jean-tiare@jeantiare-Ubuntu:~/blog$ ps aux
Error, do this: mount -t proc proc /proc

That’s all for PID namespace. With the next article, we’ll be able to re-mount /proc itself and hence fix “top” and any similar tools without breaking the parent namespace. Thanks for reading !