Unnecessary upload evasion with lftp mirrors
I’ve been using lftp
’s reverse mirror feature for years to upload files to
this blog. I’d never worked out how to avoid repeated file uploads. Until
now.
A bit of background
Jekyll creates and populates a directory called
_site
when building the production version of a static site. To upload
the files in the _site
directory to my hosting provider, I’ve been using
an lftp
script like this:1
open sftp://<username>:not-a-password@ftp.<domain-name>
mirror -v --delete --reverse _site/ /public_html/
The open
command opens a connection2 to the FTP server
that my hosting service provides for my domain and logs me in. Since the
protocol is SFTP, the connection uses SSH and I can use an SSH key for
authentication. This is great because then I don’t need to use a password.
Note that the not-a-password
component is an important placeholder: it’s
there because a password is still expected before the @
symbol even though
authentication uses public key encryption. The authentication mechanism
thus ignores it. I chose this placeholder value to remind me that it isn’t
a password and that I shouldn’t ever put one here.
The mirror
command normally downloads files from an upstream (usually
remote) source to a local system. Thus the usual mirror process pulls
remote files from upstream. Using the --reverse
option swaps the sense of
the mirror mechanism and files are instead uploaded (i.e. pushed) to the
upstream system.
In the case I describe here, I push all files from within the _site/
directory to the /public_html/
directory on my hosting service.
When reading the documentation you will
see terms like “source” and “target”. When mirroring to a local system then
the “source” is the remote system and the “target” is the local system.
With the --reverse
option, these are swapped and the local system is now
the “source” and the upstream system is the “target”.
The --delete
option removes any files from the target system which are not
present in the source. In our case, this is anything within the
/public_html/
directory tree. This ensures that if I delete or rename a
file, it isn’t still floating around on the production system, which might
confuse someone in the future.
The -v
option turns on the first level of verbosity so that I can get
feedback about what’s happening when mirroring the files to the upstream
system.
The problem
So what’s the issue? Well, each time I mirror the site to production, the
lftp
script re-transfers all my files. In particular, lftp
removes
each file from upstream before uploading it again. This happens even if the
files haven’t changed. Here’s what I mean:
$ lftp -f deploy_site.lftp
Removing old file `feed.xml'
Transferring file `feed.xml'
Removing old file `index.html'
Transferring file `index.html'
Removing old file `sitemap.xml'
Transferring file `sitemap.xml'
Removing old file `about/index.html'
Transferring file `about/index.html'
Removing old file `add-favicon-to-mm-jekyll-site/index.html'
Transferring file `add-favicon-to-mm-jekyll-site/index.html'
<snip>
Not only is this annoying, but it’s a waste of network resources and time.
I’d tried to get lftp
to only upload changed files in the past, but never
seemed to have found the right incantation. Until today. Today, I finally
found the information I needed to make this work.
The solution
If you read the lftp
man page, you’ll
find in the mirror
section the --only-newer
option. Adding this option
to the mirror
command mentioned earlier, we get
mirror -v --only-newer --delete --reverse _site/ /public_html/
Using this command you’ll find that it still transfers all files upstream. Gah! Why doesn’t this work?
Today I managed to stumble upon why this is so. An answer to the StackOverflow question Why lftp mirror –only-newer does not transfer “only newer” file? mentions a subtlety noted on Matthieu Bouthours’ blog and seemingly not mentioned anywhere else:
When uploading, it is not possible to set the date/time on the files uploaded, that’s why
--ignore-time
is needed.
Therefore, as mentioned in the StackOverflow answer:
[I]f you use the flag combination
--only-newer
and--ignore-time
you can achieve decent backup properties, in such a way that all files that differ in size are replaced. Of course it doesn’t help if you really need to rely on time-synchronization but if it is just to perform a regular backup of data, it’ll do the job.
Updating the mirror
command like so:
mirror -v --only-newer --ignore-time --delete --reverse _site/ /public_html/
fixes the issue and only uploads new or newly changed files, which is the desired behaviour. Yay!
Implementing this change in my build scripts reduced build and deployment times from 5.5 minutes to 2 minutes. That’s more than halved the time! Brilliant!
A word of caution
There’s a caveat here, though. If a file is changed and just so happens to be of the same size as its counterpart upstream, it won’t be transferred. One needs to bear this in mind.
To be honest, I’d prefer to use rsync
because it generates checksums of
the files to detect file changes. Then I could be more certain that my
scripts upload only newer files and don’t transfer older ones unnecessarily.
However, until I have that option, this will do the job nicely.
-
Unfortunately my hosting service doesn’t allow
rsync
(at least not at my service level) and hence I can’t use a more sophisticated synchronisation mechanism. ↩ -
Thank you, Captain Obvious! ↩
Support
If you liked this post and want to see more like this, please buy me a coffee!