A Tale of Two /tmp Bugs

Part 1: Even maildirs need their personal space

I did a migration of my mail server over the last weekend and I ran into a strange error when I brought things back online. My system was receiving mail fine, but it was being deferred in the postfix queue with the message:

Command output: /usr/bin/maildrop: Unable to open mailbox.

This was very odd, because the bulk of the files copied over to my new mail server VM were in fact mailboxes; I had double-verified the copy before I shut down the old VM. Yet all my mails were being queued due to this error.

It was late at night when I finished the VM copy and everything else was working fine, so I left it until morning. When I started to work on it again the next day, I noticed something odd: not all of the users on the system were being affected - my wife's email and my dedicated account for my phone were working fine. So it was time to put on my Mark Watney pants and start digging into the technical detail.

A lot of the hits I found when doing my initial web searches pointed to various permissions problems on the mailboxes themselves, which I had already ruled out. One post even suggested:

Then the mailbox does not exist or has the wrong permissions. The simplest solution is to delete the mailbox and create it again.

Um, not gonna happen; some of my mail folders have gigabytes of emails. My interest in digging through forums with advice like that quickly faded. At this point, I spent several hours reading over the postfix maildrop howto to confirm my setup was correct, and fiddling with chroot and suid settings in /etc/postfix/main.cf and /etc/postfix/master.cf, even though I knew they had been working prior to the move. I ended up reverting basically all of the changes I made during that period.

As the troubleshooting progressed, I noticed another strange data point: some of my emails were being delivered. I couldn't see any obvious pattern with which ones were failing and which were succeeding, but I knew there had to be one.

It was time to pull out a bigger gun: strace. For those not familiar with it, strace shows all of the system calls (entrypoints into the kernel) that a process makes as it runs. I waited until the mail server wasn't receiving anything, then ran strace on the running postfix master process:

strace -f -p $(pidof master) |& postfix-flush.log

Because there were quite a few mails backed up, I ended up with output from a number of different child processes interspersed, but eventually I came upon my smoking gun:

[pid 21070] stat("/home/spam/Maildir/tmp", 0x7fff88ea8280) = -1 ENOENT (No such file or directory)
[pid 21070] stat("/home/spam/Maildir", {st_mode=S_IFDIR|0700, st_size=4096, ...}) = 0

The maildrop process was looking for a directory called tmp inside the user's Maildir, and it wasn't there. And then I remembered: when I copied the files across from my old VM to my new VM, I excluded the directory /tmp. But because I was doing this from within the VM's file system mounted from the host, I used relative directory names. So rsync dutifully ignored exactly what I told it to: every file and directory with the name tmp.

A quick check of the old VM's file system confirmed that every user Maildir had a tmp folder, and its absence on the new VM was causing maildrop to consider every Maildir missing. Unfortunately, maildirmake is non-idempotent, so I couldn't just re-run it on every Maildir. Instead, a short bash script took care of things:

cd /home
find */Maildir -type d -name cur | \
        sed -e 's/\/cur$/\/tmp/' | \
        while read dir; do
                mkdir "$dir"
        done
for u in *; do
        chown $u $u/Maildir
done

(At this point I did go looking for the maildrop web site to see if I could submit a patch which would make the error message less generic, but it is hosted at SourceForge and I couldn't get a git clone to work immediately, so I gave up. I probably should come back to this, but I fear that the error is probably so vague because it is issued at the end of a large block of code which could have multiple reasons for failing, and restructuring 20-year-old C++ code is not a thing that brings me great joy. But I really should come back to it. I've just added it to my personal todo backlog. Honest.)

The astute reader might be asking at this point: if your rsync copy excluded all directories named tmp, why was this bug affecting only some of your users and some of your mailboxes? It turns out that while maildrop refuses to consider the mailbox as even existing if the tmp subdirectory is missing, the dovecot IMAP server knows how to detect this and knows that it is safe to take the simple corrective action of creating the directory. So every mailbox which had been accessed by a user since the migration had an appropriate tmp directory in place.

Mail delivery sorted; on to great victory!

Part 2: DevOps' dirty little Docker secret

Little did I know that another equally silly bug in /tmp handling would bite me only a few days later, when I was migrating the last VM away from my Xen VM server. This one was my internal file server, which runs Jellyfin, a community fork of the Emby media server.

Spoiler: here's DevOps' dirty little Docker secret: containers are just Linux processes as a service, and whatever is ugly in the Linux you put in is ugly in the result you get out.

But then we would, okay, I’m going to get this application that is in a container from development. Cool. It’s—don’t look inside of it, it’s just going to make you sad, but take these containers and put them into production and you can manage them regardless of what that application is actually doing. It felt like it wasn’t so much breaking down a wall, as it was giving a mechanism to hurl things over that wall. Is that just because I worked in terrible places with bad culture? If so, I don’t know that I’m very alone in that, but that’s what it felt like.

Corey Quinn (emphasis added)

I run Jellyfin in a Docker container on my file server. This container gets read-only access to my actual media files so that I know it's not going to modify or delete anything, and Jellyfin handles any necessary media conversion in its writable cache.

This worked well for me until I pulled the latest Docker image down and found that it was perpetually stopping and restarting, with the rather perplexing error message:

Failed to create CoreCLR, HRESULT: 0x80004005

I found a Jellyfin bug report explaining exactly this behaviour, but the resolution was a little unsatisfying:

This seemed to be an issue with the container, not Jellyfin itself. Closing. Thanks for your insight...

GitHub user Alcatraz077 in a comment

The most recent comment suggested reverting to a previous version, which is not a viable long-term solution. Digging into the container itself, I found that I could start it with bash as my entrypoint:

docker run --rm -ti --entrypoint /bin/bash jellyfin/jellyfin

But running Jellyfin itself kept giving the same error. Running apt update so that I could add a couple of helpful packages gave the first clue:

root@8f5667a9fd17:/# apt update
Get:1 http://deb.debian.org/debian bullseye InRelease [116 kB]
Err:1 http://deb.debian.org/debian bullseye InRelease
  Couldn't create temporary file /tmp/apt.conf.1w476m for passing config to apt-key
...

That didn't seem right, and sure enough, /tmp was just plain missing:

root@8f5667a9fd17:/# ls -la /tmp
ls: cannot access '/tmp': No such file or directory

I created /tmp and sure enough, Jellyfin kicked into life. So then it was just a matter of creating my own Dockerfile:

FROM jellyfin/jellyfin:latest
RUN mkdir /tmp; chmod 1777 /tmp

And building that for my Docker start scripts to use:

docker build -t jellyfin:local .

I'll update that bug shortly with an explanation and link here.