Introduction to Linux namespaces – Part 5: NET

Jan 19, 2014   #linux  #namespace 

Following the previous post on PID namespace (Restart process numbering to “1”), would like to go further and fly eve closer to full-featured VMs ? Great ! The two last posts of this series will precisely focus on this. Isolate network interfaces with the “NET” namespace (Yes, really) and user/group identifier for even more transparency. If you haven’t done so already, I encourage you to read the first post of this series for an introduction to linux namespace isolation mechanism.

[EDIT 2014-01-08] A Chinese translation of this post is available here

For once we won’t start with the addition of the “CLONE_NEWNET” flag to the “clone” syscall. I keep it for later. For now, IMHO, the best way to get started with this namespace is the incredibly mighty “iproute2” net-admin swiss army knife. If you don’t have it (yet) I highly encourage you to install it. Nonetheless, if don’t want to / can’t, you may as well skip the explanation part and go straight to the full code sample.

First, let’s see what network interfaces we have at the moment:

ip link list

Which outputs something like:

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast state DOWN mode DEFAULT qlen 1000
    link/ether **:**:**:**:**:** brd ff:ff:ff:ff:ff:ff
3: wlan0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DORMANT qlen 1000
    link/ether **:**:**:**:**:** brd ff:ff:ff:ff:ff:ff
# ...

Nothing unexpected here. I have a working loopback, UP (Yeah, ‘UNKNOWN’ means ‘UP’…) and am connected to my wireless network + a couple of extra connections eclipsed for this article.

Now, let’s create a network namespace and run the same from inside:

# create a network namespace called "demo"
ip netns add demo
# exec "ip link list" inside the namespace
ip netns exec demo ip link list

Output is now:

1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

Huuu, not only is there only a loopback but also it is “DOWN”. Even more interesting, it is fully isolated from the main loopback. That is to say, any application inside the namespace binding on “the” loopback would only be able to communicate with applications inside the same namespace. Exactly the same level of isolation as with the IPC namespace. Neat, isnt’t ?

Right, but how do I communicate with the interwebz now ?

There are multiple solutions. The easiest and most common one is to create a Point-to-Point tunnel between your “Host” and “Guest” system. Once, again, the Linux Kernel provides multiple alternatives. I recommend to use the “veth” interfaces as these are the best integrated in the ecosystem especially with iproute2. This is also an extremely well tested piece of code as it is used by LXC and actually comes from the OpenVZ project. Another alternative could be the “etun” driver. It conceptually is the same with another name but I’m not aware of any project using it.

Both “veth” and “etun” create a pair of virtual interfaces linked on with the other in the current namespace. You can then pick one and move it in the target namespace to get a communication channel. You could think of it as intricate particles if it makes it easier to understand ;).

The next step is to give them an IP, set them up and ping ! Here is an example bash session doing just that:

# Create a "demo" namespace
ip netns add demo

# create a "veth" pair
ip link add veth0 type veth peer name veth1

# and move one to the namespace
ip link set veth1 netns demo

# configure the interfaces (up + IP)
ip netns exec demo ip link set lo up
ip netns exec demo ip link set veth1 up
ip netns exec demo ip addr add 169.254.1.2/30 dev veth1
ip link set veth0 up
ip addr add 169.254.1.1/30 dev veth0

That’s it ! Nothing scary.

If you need to get Internet access from the “guest” system using the “veth” technique, you could setup masquerding, commonly known as “NAT”. In the same way, to make a webserver listening on the :80 inside the namespace appear to listen directly on the main interface, one could use “DNAT” commonly known as port “forwarding”. I’ll leave this up to the reader.

Here is a basic example to quickly get started:

# make sure ip forwarding is enabled
echo 1 > /proc/sys/net/ipv4/ip_forward
# enable Internet access for the namespace, assuming you ran the previous example
iptables -t nat -A POSTROUTING -i veth0 -j  MASQUERADE
# Forward main ":80" to guest ":80"
iptables -t nat -A PREROUTING -d <your main ip>/32 -p tcp --dport 80 -j  DNAT --to-destination  169.254.1.2:80

Now let’s put it all together and finally append the CLONE_NEWNET flag to the clone syscall. For the sake of simplicity we’ll simply stick with direct calls to “ip” using the system() syscall.

#define _GNU_SOURCE
#include <sys/types.h>
#include <sys/wait.h>
#include <sys/mount.h>
#include <stdio.h>
#include <sched.h>
#include <signal.h>
#include <unistd.h>
#include <stdlib.h>

#define STACK_SIZE (1024 * 1024)

// sync primitive
int checkpoint[2];

static char child_stack[STACK_SIZE];
char* const child_args[] = {
  "/bin/bash",
  NULL
};

int child_main(void* arg)
{
  char c;

  // init sync primitive
  close(checkpoint[1]);

  // setup hostname
  printf(" - [%5d] World !\n", getpid());
  sethostname("In Namespace", 12);

  // remount "/proc" to get accurate "top" && "ps" output
  mount("proc", "/proc", "proc", 0, NULL);

  // wait for network setup in parent
  read(checkpoint[0], &c, 1);

  // setup network
  system("ip link set lo up");
  system("ip link set veth1 up");
  system("ip addr add 169.254.1.2/30 dev veth1");

  execv(child_args[0], child_args);
  printf("Ooops\n");
  return 1;
}

int main()
{
  // init sync primitive
  pipe(checkpoint);

  printf(" - [%5d] Hello ?\n", getpid());

  int child_pid = clone(child_main, child_stack+STACK_SIZE,
      CLONE_NEWUTS | CLONE_NEWIPC | CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET | SIGCHLD, NULL);

  // further init: create a veth pair
  char* cmd;
  asprintf(&cmd, "ip link set veth1 netns %d", child_pid);
  system("ip link add veth0 type veth peer name veth1");
  system(cmd);
  system("ip link set veth0 up");
  system("ip addr add 169.254.1.1/30 dev veth0");
  free(cmd);

  // signal "done"
  close(checkpoint[1]);

  waitpid(child_pid, NULL, 0);
  return 0;
}

Let’s give it a test run !

jean-tiare@jeantiare-Ubuntu:~/blog$ gcc -Wall main.c -o ns && sudo ./ns
 - [22094] Hello ?
 - [    1] World !
root@In Namespace:~/blog$ # run a super-powerful server, fully isolated
root@In Namespace:~/blog$ nc -l 4242
Hi !
Bye...
root@In Namespace:~/blog$ exit
jean-tiare@jeantiare-Ubuntu:~/blog$ # done !

This is what you would have seen if, from another terminal, you had:

jean-tiare@jeantiare-Ubuntu:~$ nc 169.254.1.2 4242
Hi !
Bye...
jean-tiare@jeantiare-Ubuntu:~$

To go further on the path to network virtualization, you could have a look at new interfaces types recently introduced in the Linux kernel: macvlan, vlan, vxlans, …

If you feel that running a bunch of system() calls into a production system is a dirty hack (and it is !), you could have look at the rtnetlink kernel communication interface. This is the barely documented API used by iproute under the hood.

That’s all for “NET” namespace. It’s so powerful that it’s used as the foundation of the “CORE” lightweight network simulator. With the next article we’ll explore the last and most tricky namespace “USER”. Thanks for reading !