Robert Važan

Caching cargo in containers

Following my success with caching apt-get in containers, I wanted to do the same with cargo and rustup. Specifically, I aimed to create a localhost HTTP cache that containers can use instead of the official servers. Both cargo and rustup resist caching, so the path to success wasn't straightforward, but I was fortunately able to cache both cargo and rustup effectively in the end.

Why not volumes?

The usual advice you get for caching cargo and rustup downloads locally is to just share volumes across all containers. That however completely kills all isolation. As far as I know, there is no hashing or signing in cargo or rustup at the moment, so a compromised container will be able to infect all other containers via the shared cache. Accidental cache damage will also spread to other containers. So while volume sharing is indeed simple and effective, the compromised security and isolation makes it rather unappealing.

Standard mirroring software

Both Panamax and ROMT mirror cargo crates as well as rustup binaries. The trouble is that these tools expect you to mirror everything. They cannot fetch required files lazily. That limits their application to a few massive companies plus official public mirrors. They are useless for local caching. ROMT can be technically configured to cache only a subset of crates, but it's laborious to setup and you have to redo the setup every time your subset changes.

Initial resistance

Cargo cannot be simply fronted by a reverse HTTP proxy. Older versions of cargo used git for the index, which is not exactly your typical CDN content. Cargo seems to still use git for private repositories, but the public one has switched to what they call sparse index, which is just a set of cacheable files. That did not quite solve the caching problem though. There are two separate subdomains, index.crates.io for the index and static.crates.io for crates. Index subdomain has URL of the crate subdomain hardcoded in its /config.json like this:

{
  "dl": "https://static.crates.io/crates",
  "api": "https://crates.io"
}

Caching cargo crates locally under one port therefore requires three URL matchers: one for config.json, one for the index, and one for the crates. A working configuration is shown later in this article, but I found it rather surprising that I cannot just front the whole thing with generic HTTP proxy.

Things are even worse with rustup. Rustup lets you specify a mirror via RUSTUP_DIST_SERVER and RUSTUP_UPDATE_ROOT variables, but the default setup script from rustup.rs insists on HTTPS access to the mirror, which is of course very impractical for a localhost cache. Both Panamax and ROMT therefore ask you to bypass sh.rustup.rs and download platform-specific rustup-init directly. That however looks ugly and it unnecessarily ties your Containerfile to one platform.

I find this resistance to caching perplexing. Maybe the Rust project has lots of free mirrors with plentiful bandwidth? Even if bandwidth is not a problem, I still want cargo to be fast and reliable. I also want resiliency to network disruptions.

Cache rustup

Before describing cargo caching, let's briefly cover the simpler rustup case. I solved this by just installing Ubuntu packages for rust and cargo, which come from previously configured apt-get cache. Ubuntu packages have the downside of being several months behind current version. As I am not using cutting-edge features myself, the only nuisance I encounter with this setup is that I have to set upper version bound on crates that eagerly add requirements for latest rust without incrementing major or minor version.

Another option is to execute rustup.rs script high enough in the Containerfile to maximize probability that its layer cache will be shared among related container images. Rust is rarely upgraded in a running container, so if you can keep a consistent preamble in Containerfile for all your container images, layer cache will be sufficient. This is a good option if you absolutely must have the latest version of rust and cargo.

The third and final option is to point a simple reverse HTTP proxy at static.rust-lang.org and include this code in your Containerfile:

ARG RUSTUP_DIST_SERVER=https://static.rust-lang.org
ENV RUSTUP_UPDATE_ROOT=https://static.rust-lang.org/rustup

RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sed "s/--proto '=https' //g" | sh

Notice how we use sed to strip the HTTPS-only requirement from curl parameters inside the script. You can then point RUSTUP_DIST_SERVER to your reverse proxy. I didn't test this third option though. You might have to tweak it a bit.

With rustup caching solved, we can now look at caching cargo downloads.

Choose HTTP cache server

It's surprisingly hard to find a good HTTP cache server these days. Squid in accelerator mode is difficult to configure. Varnish does not have built-in support for HTTPS backends. Caddy has experimental cache module that is poorly documented and that does not seem to persist the cache index as far as I can tell. I have therefore settled on nginx, which seems to be the only reasonable option for our needs. I am not partucularly happy about software with Russian background and lots of vulnerable C code, but risks are minimal in a single-purpose containerized cache.

Configure nginx

Let's configure nginx then. The configuration is a bit longer, because it covers the above mentioned three routes (config.json, index, and crates) plus nginx boilerplate. It creates a lazy cargo cache listening on port 3263. Revalidation must be enabled for cargo to notice new versions of crates. I went the extra mile to add hit/miss logging for all requests, so that you can check the cache is working.

# Disable default access log to avoid duplicate logging.
access_log off;

# Cache for crate downloads.
proxy_cache_path /var/cache/nginx/crates levels=1:2 keys_zone=crates_cache:10m max_size=10g inactive=400d use_temp_path=off;
# Cache for the package index.
proxy_cache_path /var/cache/nginx/index levels=1:2 keys_zone=index_cache:10m max_size=10g inactive=400d use_temp_path=off;

# Use a resolver for runtime DNS lookups, as upstream IPs can change.
# Force IPv4-only to avoid IPv6 connection attempts on hosts without IPv6.
resolver 1.1.1.1 9.9.9.9 ipv6=off valid=300s;

# Map cache status to a printable string (defaults to "-").
map $upstream_cache_status $cache_status {
    default "-";
    HIT "HIT";
    MISS "MISS";
    BYPASS "BYPASS";
    EXPIRED "EXPIRED";
    STALE "STALE";
    UPDATING "UPDATING";
    REVALIDATED "REVALIDATED";
}

# Map upstream status to a printable string (defaults to "-").
map $upstream_status $upstream_code {
    default "-";
    ~^[0-9]+$ $upstream_status;
}

# Access log format that includes cache status and upstream status.
log_format cargo_cache '$status $upstream_code $cache_status "$request" $body_bytes_sent';

# Send logs to container stdout/stderr so Podman/Journald captures them.
error_log /dev/stderr warn;

server {
    listen 3263;

    # Enable our custom access log only within this server block.
    access_log /dev/stdout cargo_cache;

    # Common proxy settings.
    proxy_http_version 1.1;
    proxy_set_header Connection "";
    proxy_ssl_server_name on;
    proxy_ssl_verify on;
    proxy_ssl_trusted_certificate /etc/ssl/certs/ca-certificates.crt;
    add_header X-Proxy-Cache $upstream_cache_status always;

    # Enable conditional revalidation using If-Modified-Since and ETag.
    proxy_cache_revalidate on;
    # Cache 200 responses. Other responses will not be cached.
    proxy_cache_valid 200 400d;

    # Serve a custom config.json that points cargo to this cache for downloads.
    location = /config.json {
        add_header Content-Type application/json;
        return 200 '{"dl": "http://127.0.0.1:3263/crates", "api": "https://crates.io"}\n';
    }

    # Proxy crate downloads to static.crates.io.
    location /crates/ {
        proxy_ssl_name static.crates.io;
        proxy_set_header Host static.crates.io;
        proxy_cache crates_cache;
        # Use a variable to trigger the resolver at runtime.
        set $upstream https://static.crates.io;
        proxy_pass $upstream;
    }

    # Proxy index requests to index.crates.io.
    location / {
        proxy_ssl_name index.crates.io;
        proxy_set_header Host index.crates.io;
        proxy_cache index_cache;
        # Use a variable to trigger the resolver at runtime.
        set $upstream https://index.crates.io;
        proxy_pass $upstream;
    }
}

This is added to standard nginx image:

# Customize standard nginx image.
FROM docker.io/library/nginx:stable

# Copy runtime configuration.
COPY nginx.conf /etc/nginx/conf.d/default.conf

# The port the cache listens on.
EXPOSE 3263

We have to build the image before we can use it:

podman build -t localhost/cargo-cache

Configure systemd

We will use podman's systemd integration to run the image under systemd. Notice that there are two volumes, one for crate cache and one for index cache. We don't want to create volume for the whole /var/cache/nginx directory, because the standard nginx image puts other stuff there that we don't want to persist.

[Unit]
Description=Nginx-based cache for crates.io
After=network-online.target
Wants=network-online.target

[Container]
Image=localhost/cargo-cache
ContainerName=cargo-cache
LogDriver=journald
PublishPort=127.0.0.1:3263:3263
Volume=cargo-cache-crates:/var/cache/nginx/crates:Z
Volume=cargo-cache-index:/var/cache/nginx/index:Z

[Service]
Restart=always

[Install]
WantedBy=default.target

Save the above service configuration as ~/.config/containers/systemd/cargo-cache.container and start it using commands below. I prefer running everything using rootless podman container under unprivileged user. To make the container start even before the user logs in, we will enable lingering for the user.

systemctl --user daemon-reload
systemctl --user restart cargo-cache
sudo loginctl enable-linger $USER

Configure the application container

We will prepare application Containerfile for caching by including a short cargo cache setup:

ARG CARGO_MIRROR=""
RUN if [ -n "$CARGO_MIRROR" ]; then \
    mkdir -p .cargo && \
    echo '[source.crates-io]' > ~/.cargo/config.toml && \
    echo "registry = 'sparse+$CARGO_MIRROR/'" >> ~/.cargo/config.toml; \
    fi

Note that this supports full mirrors as well as lazy caches. If CARGO_MIRROR parameter is not specified during build, the container will download crates from the official cargo servers. This way the Containerfile can be placed in a public repository without breaking builds for people who do not have the cache set up.

To enable caching, we will point CARGO_MIRROR to our localhost cache. In addition to the parameter, we also have to forward cache's port into the container, during both build and execution:

podman build \
    -t localhost/cargo-cache-test \
    --build-arg CARGO_MIRROR=http://127.0.0.1:3263 \
    --network=pasta:-T,3263
podman run -it --rm \
    --network=pasta:-T,3263 \
    localhost/cargo-cache-test

If you are also using my apt-get cache, then the complete setup looks like this:

podman build \
    -t localhost/cargo-cache-test \
    --build-arg APT_PROXY=http://127.0.0.1:3142 \
    --build-arg CARGO_MIRROR=http://127.0.0.1:3263 \
    --network=pasta:-T,3142,-T,3263
podman run -it --rm \
    --network=pasta:-T,3142,-T,3263 \
    localhost/cargo-cache-test

You can now use cargo during image build and during container execution. Cargo downloads will be rerouted through nginx, which will cache them. You should see logs like this when you run journalctl --user -u cargo-cache:

200 - - "GET /config.json HTTP/1.1" 67
200 - MISS "GET /3/p/png HTTP/1.1" 4104
200 200 MISS "GET /crates/home/0.5.11/download HTTP/1.1" 9926
...
200 304 REVALIDATED "GET /3/p/png HTTP/1.1" 4104
304 - HIT "GET /ho/me/home HTTP/1.1" 0
200 - HIT "GET /crates/home/0.5.11/download HTTP/1.1" 9926

Header-only revalidation is the best you can get for the index while still making new versions visible to cargo. Plain hit (without revalidation) is what you should see for the immutable crates.