Some weeks ago, I did an internal presentation on Docker. During the presentation, one of the ops asked an seemingly trivial question: Is there anything like a “diet program for Docker Images” ?
You can find a couple of pretty decent common-sense powered approach on the web like removing well known cache folders, temporary files, installing all superfluous packages and flatten layers if not the full image. There is also the -slim
declination of the official language images.
But, thinking at it, do we really need a full consistent base Linux install? Which files do we really need in a given image? I found a radical and pretty efficient approaches with a go binary. It was statically build, almost no external dependency. Resulting image: 6.12MB.
Whaou! Is there any chance to do something comparable, deterministic with any random application?
It turns out there could be one. The idea is simple: We could profile the image at run time one way or another to determine which files are ever accessed/opened/…, then remove all the remaining files. Hmm, sounds promising. Let’s PoC it.
Target definition:
- Start image: Ubuntu (~200MB)
- Application that MUST run:
/bin/ls
- Goal: Build the smallest possible image
/bin/ls
is a good target: It is simple enough for a PoC with no nasty behavior but still not trivial, it uses dynamic linking.
Now that we have a target, let’s pick a tool. As this is a proof of concept, using dynamites where a hole puncher would be enough IS an option, as long as it does the job.
The base idea it to record all file accesses. Be it a stat or a open. There are a couple of good candidates to help with the task. We could use inotify but it is a pain to setup and watches needs to be attached on every single files, which potentially mean a *lot* of watches. We could use LD_PRELOAD but 1/ it’s no fun to use, 2/ it won’t catch direct syscalls 3/ it won’t work with statically linked programs (who said golang’s?). A solution that would work well even for statically linked program would be to use ptrace to trace all syscalls, in realtime. It is also a pain to setup but, it would be a reliable and flexible option. A lesser known linux syscall is fanotify. As the title suggests, This is the one we’ll go with1.
fanotify
syscall has originally been implemented as “decent” mechanism for anti-virus vendors to intercept file access events, potentially on a whole mountpoint at once. Sounds familiar? While it may be used to deny file accesses, it may also just report file access events in a non-blocking fashion, potentially dropping2 events if the kernel queue overflows. In this last case, a special message will be generated to notify user-land listener about the message loss. This is perfectly what I needed. Non intrusive, a whole mountpoint at once, simple setup (well, provided that you find the documentation, no comment…). This may seem anecdotal but it has its importance, as a learned after.
Using it is fairly simple:
1/ Init fanotify
in FAN_CLASS_NOTIF
ication mode using fanotify_init
syscall
// Open ``fan`` fd for fanotify notifications. Messages will embed a // filedescriptor on accessed file. Expect it to be read-only fan = fanotify_init(FAN_CLASS_NOTIF, O_RDONLY);
2/ Subscribe to FAN_ACCESS
and FAN_OPEN
events on “/” FAN_MARK_MOUNT
point using fanotify_mark
syscall
// Watch open/access events on root mountpoint fanotify_mark( fan, FAN_MARK_ADD | FAN_MARK_MOUNT, // Add mountpoint mark to fan FAN_ACCESS | FAN_OPEN, // Report open and access events, non blocking -1, "/" // Watch root mountpoint (-1 is ignored for FAN_MARK_MOUNT type calls) );
3/ read
pending event messages from the filedescriptor returned by fanotify_init
and iterate using FAN_EVENT_NEXT
// Read pending events from ``fan`` into ``buf`` buflen = read(fan, buf, sizeof(buf)); // Position cursor on first message metadata = (struct fanotify_event_metadata*)&buf; // Loop until we reached the last event while(FAN_EVENT_OK(metadata, buflen)) { // Do something interesting with the notification // ``metadata->fd`` will contain a valid, RO fd to accessed file. // Close opened fd, otherwise we'll quickly exhaust the fd pool. close(metadata->fd); // Move to next event in buffer metadata = FAN_EVENT_NEXT(metadata, buflen); }
Putting it all together, we’ll print the full name of all accessed files and add queue overflow detection. This should be plain enough for us (comments and error checks stripped for the purpose of this illustration):
#include <fcntl.h> #include <limits.h> #include <stdio.h> #include <sys/fanotify.h> int main(int argc, char** argv) { int fan; char buf[4096]; char fdpath[32]; char path[PATH_MAX + 1]; ssize_t buflen, linklen; struct fanotify_event_metadata *metadata; // Init fanotify structure fan = fanotify_init(FAN_CLASS_NOTIF, O_RDONLY); // Watch open/access events on root mountpoint fanotify_mark( fan, FAN_MARK_ADD | FAN_MARK_MOUNT, FAN_ACCESS | FAN_OPEN, -1, "/" ); while(1) { buflen = read(fan, buf, sizeof(buf)); metadata = (struct fanotify_event_metadata*)&buf; while(FAN_EVENT_OK(metadata, buflen)) { if (metadata->mask & FAN_Q_OVERFLOW) { printf("Queue overflow!\n"); continue; } // Resolve path, using automatically opened fd sprintf(fdpath, "/proc/self/fd/%d", metadata->fd); linklen = readlink(fdpath, path, sizeof(path) - 1); path[linklen] = '\0'; printf("%s\n", path); close(metadata->fd); metadata = FAN_EVENT_NEXT(metadata, buflen); } } }
To build it, use:
gcc main.c --static -o fanotify-profiler
We basically now have a tool to report any file access on the active ‘/’ mountpoint in real time. Good.
What now? Let’s create an Ubuntu container, start the recorder and run /bin/ls
. fanotify
requires require the “CAP_SYS_ADMIN
” capability. This is basically the “catch-all” root capability. Still better than running in --privileged
mode though.
# Run image docker run --name profiler_ls \ --volume $PWD:/src \ --cap-add SYS_ADMIN \ -it ubuntu /src/fanotify-profiler # Run the command to profile, from another shell docker exec -it profiler_ls ls # Interrupt Running image using docker kill profiler_ls # You know, the "dynamite"
This should produce an output like:
/etc/passwd /etc/group /etc/passwd /etc/group /bin/ls /bin/ls /bin/ls /lib/x86_64-linux-gnu/ld-2.19.so /lib/x86_64-linux-gnu/ld-2.19.so /etc/ld.so.cache /lib/x86_64-linux-gnu/libselinux.so.1 /lib/x86_64-linux-gnu/libacl.so.1.1.0 /lib/x86_64-linux-gnu/libc-2.19.so /lib/x86_64-linux-gnu/libc-2.19.so /lib/x86_64-linux-gnu/libpcre.so.3.13.1 /lib/x86_64-linux-gnu/libdl-2.19.so /lib/x86_64-linux-gnu/libdl-2.19.so /lib/x86_64-linux-gnu/libattr.so.1.1.0
Awesome! It worked. We now know for sure what /bin/ls
ultimately needs to run.
So we’ll just copy-paste-import all this in a “FROM scratch
” Docker Image and we’ll be done. Easy. Well, not so. But let’s do it to see by ourselves.
# Export base docker image mkdir ubuntu_base docker export profiler_ls | sudo tar -x -C ubuntu_base # Create new image mkdir ubuntu_lean # Get the linker (trust me) sudo mkdir -p ubuntu_lean/lib64 sudo cp -a ubuntu_base/lib64/ld-linux-x86-64.so.2 ubuntu_lean/lib64/ # Copy the files sudo mkdir -p ubuntu_lean/etc sudo mkdir -p ubuntu_lean/bin sudo mkdir -p ubuntu_lean/lib/x86_64-linux-gnu/ sudo cp -a ubuntu_base/bin/ls ubuntu_lean/bin/ls sudo cp -a ubuntu_base/etc/group ubuntu_lean/etc/group sudo cp -a ubuntu_base/etc/passwd ubuntu_lean/etc/passwd sudo cp -a ubuntu_base/etc/ld.so.cache ubuntu_lean/etc/ld.so.cache sudo cp -a ubuntu_base/lib/x86_64-linux-gnu/ld-2.19.so ubuntu_lean/lib/x86_64-linux-gnu/ld-2.19.so sudo cp -a ubuntu_base/lib/x86_64-linux-gnu/ld-2.19.so ubuntu_lean/lib/x86_64-linux-gnu/ld-2.19.so sudo cp -a ubuntu_base/lib/x86_64-linux-gnu/libselinux.so.1 ubuntu_lean/lib/x86_64-linux-gnu/libselinux.so.1 sudo cp -a ubuntu_base/lib/x86_64-linux-gnu/libacl.so.1.1.0 ubuntu_lean/lib/x86_64-linux-gnu/libacl.so.1.1.0 sudo cp -a ubuntu_base/lib/x86_64-linux-gnu/libc-2.19.so ubuntu_lean/lib/x86_64-linux-gnu/libc-2.19.so sudo cp -a ubuntu_base/lib/x86_64-linux-gnu/libpcre.so.3.13.1 ubuntu_lean/lib/x86_64-linux-gnu/libpcre.so.3.13.1 sudo cp -a ubuntu_base/lib/x86_64-linux-gnu/libdl-2.19.so ubuntu_lean/lib/x86_64-linux-gnu/libdl-2.19.so sudo cp -a ubuntu_base/lib/x86_64-linux-gnu/libattr.so.1.1.0 ubuntu_lean/lib/x86_64-linux-gnu/libattr.so.1.1.0 # Import it back to Docker cd ubuntu_lean sudo tar -c . | docker import - ubuntu_lean
Run the resulting image:
docker run --rm -it ubuntu_lean /bin/ls
And, Tadaaaaa:
# If you did not trust me with the linker (as it was already loaded when the profiler started, it does not show in the ouput) no such file or directoryFATA[0000] Error response from daemon: Cannot start container f318adb174a9e381500431370a245275196a2948828919205524edc107626d78: no such file or directory # Otherwise /bin/ls: error while loading shared libraries: libacl.so.1: cannot open shared object file: No such file or directory
Well, not so… What went wrong? Remember when I said this syscall was primarily designed with antivirus in mind? The real-time part of the antivirus is supposed to detect that a file is being accessed, run some checks, take a decision. What matters here is the actual, real content of the file. In particular, filesystem races MUST be avoided at all costs. This is the reason why fanotify
yields filedescriptors instead of accesses path. Determining the underlying physical file is done by probing /proc/self/fd/[fd]
. It does not tell you through which symlink the file being accessed was accessed, only what file it is.
To make this work, we need to find all links to reported files and install them in the filtered image as well. A find
command like this will do the job:
# Find all files refering to a given one find -L -samefile "./lib/x86_64-linux-gnu/libacl.so.1.1.0" 2>/dev/null # If you want to exclude the target itself from the results find -L -samefile "./lib/x86_64-linux-gnu/libacl.so.1.1.0" -a ! -path "./lib/x86_64-linux-gnu/libacl.so.1.1.0" 2>/dev/null
This can easily be automated with a loop like:
for f in $(cd ubuntu_lean; find) do ( cd ubuntu_base find -L -samefile "$f" -a ! -path "$f" ) 2>/dev/null done
Which produces the list of missing symlinks. All libs.
./lib/x86_64-linux-gnu/libc.so.6 ./lib/x86_64-linux-gnu/ld-linux-x86-64.so.2 ./lib/x86_64-linux-gnu/libattr.so.1 ./lib/x86_64-linux-gnu/libdl.so.2 ./lib/x86_64-linux-gnu/libpcre.so.3 ./lib/x86_64-linux-gnu/libacl.so.1
Let’s copy them too from the source image and re-create the destination image. (Yeah, could also have created them on the fly).
# Copy the links sudo cp -a ubuntu_base/lib/x86_64-linux-gnu/libc.so.6 ubuntu_lean/lib/x86_64-linux-gnu/libc.so.6 sudo cp -a ubuntu_base/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2 ubuntu_lean/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2 sudo cp -a ubuntu_base/lib/x86_64-linux-gnu/libdl.so.2 ubuntu_lean/lib/x86_64-linux-gnu/libdl.so.2 sudo cp -a ubuntu_base/lib/x86_64-linux-gnu/libpcre.so.3 ubuntu_lean/lib/x86_64-linux-gnu/libpcre.so.3 sudo cp -a ubuntu_base/lib/x86_64-linux-gnu/libacl.so.1 ubuntu_lean/lib/x86_64-linux-gnu/libacl.so.1 sudo cp -a ubuntu_base/lib/x86_64-linux-gnu/libattr.so.1 ubuntu_lean/lib/x86_64-linux-gnu/libattr.so.1 # Import it back to Docker cd ubuntu_lean docker rmi -f ubuntu_lean; sudo tar -c . | docker import - ubuntu_lean
Warning: This method is limited. For example, it won’t return links to links to files neither absolute links. The later requiring at least a chroot. Or to be run in the source container itself, provided that find or equivalent is present.
Run the resulting image:
docker run --rm -it ubuntu_lean /bin/ls
And, Tadaaaaa:
bin dev etc lib lib64 proc sys
It works! tm
Time is over, let’s measure:
ubuntu: 209M
ubuntu_lean: 2,5M
Resulting Docker image is 83.5 times smaller3. That’s a 98.8% reduction. Looks good to me, I’ll accept it. If you agree.
Last Thought
Like all profiling based method, it will only tell you about what’s actually done/used in a specific scenario. For example, try to run /bin/ls -l
in the resulting image and see by yourself. (spoiler: it does not work. Well it does, but not as expected).
The profiling technique itself is not without flaws. It does not detect how a file was opened but only which file this is. This is a problem for symlinks, especially cross-filesytems (read: cross-volumes). With fanotify, we’ll completely miss the original symlink and break the application.
If I were to build a production shrinker, I would probably go for a ptrace
based method.
Footnotes
Let’s face the truth: What I really wanted, was experimenting with this syscall. Docker images are more of a (good) pretext.
Actually, one could use
FAN_UNLIMITED_QUEUE
well callingfanotify_init
to remove this limitation, provided that the calling process is at leastCAP_SYS_ADMIN
That’s also 2.4 times smaller that the 6.13MB image I mentioned at the beginning of this post. But the comparison is not fair.