Python and Docker: Using Minimalist Distributions

Recently I've been working on a Dockerized version of the automated regression suite for one of our company's products. I see several potential benefits from using containers over a manually configured box:

  1. No more dependency/configuration conflicts. Docker containers go a step beyond Python's virtualenv in isolating the entire OS, so everything is tailor-made for the task at hand and never changes.
  2. Never configure another test environment. If you've got Docker and access to the test environments, you can run my suites out of the box. That intricately configured Linux box with all our stuff on it could explode right before a hotfix goes out and it would have zero impact on my ability to run regression. Our stuff is all in repositories, floating above our heads in the ether, Wonkavision style. That's, uh, pretty awesome.
  3. I can build a self-sustaining development environment. Modern IDE's like PyCharm can interface with Docker containers serving Python via Remote Interpreter functionality. This means I can maintain a single image - in one place - with all the desired dependencies baked in, so engineers can awake to a fresh, up-to-date container hot off our registry every morning. Imagine how quick environment setup would be for new-hires! "Uh, install Docker."

I have yet to try out #3, but I'll cover it in a later discussion. Today we're going to talk about my experience containerizing the test suite, and the obsession that followed.

Size Matters

When I left Docker Tutorial University with my "I can install Docker" degree, my first instinct was to grab for the safety beer - a Debian or Ubuntu or something - and get to cloning my repository. This was cool for about 5 minutes, until I realized my Docker images looked something like this:

 

Smarter people than me have already realized that cramming a full OS - comprised almost entirely of things your logic will never use - onto an image is a gross misuse of resources. Over the past few years some minimalist Linux builds have arrived.

Alpine is a super stripped-down build of Linux, clocking in around 117MB with Python minus all compile-time dependencies (see Nick Janetakis's post for the how-to). It uses the apk package manager, which really just amounts to:

apk add mydependency

Docker has officially adopted Alpine over Ubuntu as their distribution of choice, but this may be due more to the close professional relationship between Docker and the Alpine developer Natanael Copa than any particular deficiencies on the part of the more ubiquitous distributions. I'll leave the religious wars to the zealots. 

Debian Jessie (Slim)

Debian is a completely open-source distribution of Linux, and the "slim" variant strips many non-essential packages. Including Python and stripped of compile-time dependencies, it sits at about 150MB (Dave Beckett has a great how-to if you want to make that number a reality).

While these distributions are rail-thin, installing the dependencies for your project tends to fatten them up at an alarming rate. Because Docker images are built using  like boxers on weigh-in day, YOU ARE HERE developers have been coming up with all sorts of sneaky ways to slim down their images.

Dependency Hell

Tiny distributions come with a cost. You are essentially stripping yourself of all maps, compasses, and GPS, so if you're not comfortable in Linux-land, you will get lost in a hurry. Your dependencies may have their own dependencies, and you are responsible for building everything from the ground up. Here are a few notes to help avoid some of the biggest pitfalls. Note: Docker pitfalls are outside this discussion, but can be found all over the web.

unable to execute 'gcc': No such file or directory

This was one of the gimmees. Make sure you install gcc before attempting to install something that needs it to build.

Problems unzipping files

The version of unzip that Alpine and Jessie (Slim) come with is not the fully-featured unzip, and cannot handle things like wildcards and multiple files. Make sure you install a --no-cache version of unzip to solve this problem.

Pycryptodome and cryptography fail to install

I use a library called cx_Oracle to speak with Oracle databases via Python, and boy was this a fun time! I probably lost a whole weekend to this little "learning opportunity". Ubuntu with libffi-dev and libssl-dev got me where I needed to be, but Alpine and Jessie were obstinately refusing to install it.

openssl/opensslv.h: No such file or directory (Alpine) 

This is rooted in the dependency openssl-dev (libssl-dev on some distributions). Add this to the packages you are installing. Note: make sure you install it after the compile-time dependencies are removed, or it will look like it never installed!

limits.h and stdio.h: No such file or directory (Alpine)

I was able to put the world back in order by using apk to add musl-dev and libffi-dev.

Once everything is up and running, you theoretically never need to touch the Dockerfile again. I'd say the minimal footprint, isolated environment, and out-of-the-box functionality for my test suite were well worth the time it took me to get it off the ground.