Adventures in collating text in Linux, Docker and Postgres
It’s interesting how bugs can lead to a deeper understanding of how computer systems work. It’s also amazing how nuanced some technical topics can be. For instance, one would think that sorting text should have the same behaviour in any English-like locale, right? Wrong. Join me on a dive down a rabbit hole that I had at work a while ago.
All good problems start with a failing test
A weird problem turned up in one of our projects: the test suite was passing on my local dev box and on our Jenkins instance, but not within a Docker container running on our GitLab-CI system1.
The failing test checked that a list was in alphabetical order. In particular, we expected the list to be in this order:2
'Fast Ice', 'FYI (First Year Ice)', 'MYI (Multi Year Ice)'
However debugging the output from the Docker container on GitLab-CI gave this ordering:
'FYI (First Year Ice)', 'Fast Ice', 'MYI (Multi Year Ice)'
After first having thought “Err, what??”, somewhere at the back of my brain I had another thought: “Hrm, I’ve seen something like this before”. I find this happens quite often now. Perhaps this is one of the few advantages to getting older: there’s much more experience to draw from when debugging.
Many ideas to try to get the sorting sorted
So, I was fairly sure that the problem looked like a collation issue (i.e. the sorting order of text and something that is often related to the locale).
The first thing I checked was the LC_ALL
setting, which can be used to
override all locale settings. This wasn’t the problem, because it wasn’t
set anywhere. An interesting part of working out if LC_ALL
was a problem
was finding out that one shouldn’t use
LC_ALL
.
How about the LANG
setting (i.e. the default locale)? On my dev box I set
it to en_NZ.UTF-8
(because I can); on our production systems we usually
use en_US.UTF-8
because, well, that’s the most common English variant on
computer systems these days. Checking the output of the locale
command
didn’t show up anything that should be causing problems.
Setting LANG=C
in the environment also didn’t change anything, which I
thought odd, because the C
(POSIX) sorting order is slightly different
from the en_*.UTF-8
sorting orders. Hrm.
Ok, what’s the next thing to try? Maybe the problem was UTF8? I mean, it’s really unlikely, but still possible, right? Also no change; using a non-UTF-8 locale didn’t fix the failing test. Oh well.
Maybe the environment was being reset in each command in the GitLab-CI
config? I.e. what about setting LC_COLLATE
or LANG
just for the test
command itself? Wrong again.
Next idea: was it a discrepancy between the dependencies installed in the container and those used on my local system (e.g. a version was different somewhere and hence an upstream bug was causing the problem)? Nope, that wasn’t it either.
Was there something in the Docker container runtime environment that was a bit dodgy? No, everything looked good.
What about the Python version? Did the version being used in GitLab-CI not match what was being used in the other environments? Unfortunately, everything matched.
Maybe the Debian release wasn’t right? For instance, I use Buster on my local development system, maybe I was using Stretch on the GitLab-CI system (you never know)? Hrm, another dead end.
Perhaps the Postgres version was too new? Could it be the case that some behaviour had changed in the collation? I mean, this was also really unlikely, but stuff like this happens and it’s worthwhile trying to remove all potential sources of error. After all:
When you have eliminated all which is impossible, then whatever remains, however improbable, must be the truth.
Unfortunately, this idea was also a dead end.
And then a penny dropped. What if the issue isn’t at the application or
shell level? What if setting environment variables for the locale wasn’t a
sufficiently fundamental change (somehow not happening “low enough down” in
the system)? What if Postgres is using a different collation to that
specified in the shell (e.g. via LANG
)?
I tried installing Postgres in the Docker container with the LANG
variable
prefixed. I.e.
$ LANG=en_US.UTF-8 apt install -y postgresql
I will have gotten this idea from somewhere, however I can’t find the link anymore. This didn’t help things either, so the problem must be “further down” somehow.
Maybe Postgres really is where the problem is hiding?
Another penny that had been circling my head for about an hour or so dropped. Does Postgres define its own collation?
The answer: sort of. It seems to use the value from the operating system at
the time of installation. In other words, setting the LANG
variable at
installation time (as tried in the previous section) was on the right track,
but not quite there yet.
A bit more digging around online (mostly StackOverflow) produced the “handy command of the day”:
$ sudo -u postgres psql -c 'select datname, datcollate from pg_database;'
This shows the collation locale settings for each database in the Postgres installation. For example:
datname | datcollate
----------------------+-------------
postgres | en_NZ.UTF-8
template1 | en_NZ.UTF-8
template0 | en_NZ.UTF-8
On my local box this was en_NZ.UTF-8
(I’m a stubborn Kiwi living in
Germany, which makes life interesting at times: I find all kinds of corner
cases with locales), but in the Docker container it was C
, which as far as
I can tell is also the “POSIX” collation.
This is what my brain had been trying to tell me (from roughly 2 hours ago):
one system is using C
(a.k.a.
POSIX)
collation (which lists capital letters first3) and the
other system was using English collation (lower case letters first). In
other words: the Postgres installation on the GitLab-CI system was using the
POSIX collation convention because as far as it was concerned, the LANG
environment was C
, however on all other systems I’d been testing on (my
local development box and our Jenkins server) the LANG
environment was
some variation on en_*.UTF-8
. Ok, I think I’ve started to understand
the problem. We’re getting somewhere!
(Insert here a long diversion with me reading the specs for the LC_*
variables, how
collation works in glibc and in
unicode. Yes, I even read the
CLDR DTDs
for unicode. Hey, I’m a geek!)
Aside: note that it’s possible to set the datcollate
table entries
directly in Postgres and hence change the collation settings dynamically.
Fortunately doing that didn’t solve the problem for me, as that would have
taken me down a very evil path.
Building a solution to the problem
Now, how to set that up properly and repeatably within a GitLab-CI Docker container?
After much wailing and gnashing of teeth, the solution I came up with was:
- install the locales package (
apt-get install -y locales
) - set an English locale to use (
echo 'en_US.UTF-8 UTF-8' >> /etc/locale.gen
);en_US
is more standard thanen_NZ
; I don’t want to confuse my colleagues too much - generate the locale just defined with the
locale-gen
command - lastly, update the locale at the system level (
update-locale LANG=en_US.UTF-8
) before installing Postgres viaapt-get install -y postgresql
4
In code (as part of the GitLab-CI configuration), this ended up looking like:
default:
image: python:3.7-buster
stages:
- test
tests:
stage: test
script:
- apt-get update && apt-get -y install virtualenv libpython3-dev gettext libjpeg-dev libgraphviz-dev locales
- echo 'en_US.UTF-8 UTF-8' >> /etc/locale.gen
- locale-gen
- update-locale LANG=en_US.UTF-8
- apt-get -y install sudo postgresql
... more setup ...
- make test
An interesting issue popped up with this solution because one should be able
to set the locale at the system level via the localectl
command, however
this command isn’t available in Python Docker images (such as
python:3.7-buster
) because
systemd isn’t used. This is why I
had to use update-locale
above.
The Debian and Arch Linux docs were very helpful in working out these details; they form an excellent resource when trying to solve such low-level problems.
Now that the value in the datcollate
column is set in Postgres how we want
(namely to en_US.UTF-8
) we find that the tests now pass as expected,
because we now have the sorting order that we expected to have in the
beginning. Cool!
The interesting question now is: what was I trying to do before I started all of this???
I aim to please!
I hope that this post was helpful or useful! If you have any comments, questions or feedback, feel free to ping me on Mastodon or drop me a line via email.
-
Why have Jenkins and GitLab? Well, I was in the process of migrating our CI infrastructure from Jenkins to GitLab-CI at the time, hence I was using both systems. ↩
-
For those who might be wondering: fast ice isn’t ice that’s going really fast; it’s actually ice that’s stuck fast, i.e. not moving. ↩
-
Some sources (e.g. the Postgres collation docs) refer to this ordering as the “Traditional C” or “POSIX” collation sequence. The Open Group and IEEE specifications imply the ordering where capitals precede lower case letters. This is also known as the ASCIIbetical order because it follows the “alphabetical” ordering used in the ASCII character set. ↩
-
Note that even though the desired locale is defined, this doesn’t mean that setting it globally in the environment or locally for the test command (or even installation command) has the desired effect:
datcollate
stays set to “C”. Therefore it’s necessary to set the locale at the system level viaupdate-locale
before thepostgresql
package is installed. ↩
Support
If you liked this post and want to see more like this, please buy me a coffee!