Unnecessary upload evasion with lftp mirrors

4 minute read

I’ve been using lftp’s reverse mirror feature for years to upload files to this blog. I’d never worked out how to avoid repeated file uploads. Until now.

Avoid uploading old files.
Avoid uploading old files.
Image credits: Dave Gandy - Flaticon, Freepik - Flaticon, Wikimedia Commons.

A bit of background

Jekyll creates and populates a directory called _site when building the production version of a static site. To upload the files in the _site directory to my hosting provider, I’ve been using an lftp script like this:1

open sftp://<username>:not-a-password@ftp.<domain-name>
mirror -v --delete --reverse _site/ /public_html/

The open command opens a connection2 to the FTP server that my hosting service provides for my domain and logs me in. Since the protocol is SFTP, the connection uses SSH and I can use an SSH key for authentication. This is great because then I don’t need to use a password.

Note that the not-a-password component is an important placeholder: it’s there because a password is still expected before the @ symbol even though authentication uses public key encryption. The authentication mechanism thus ignores it. I chose this placeholder value to remind me that it isn’t a password and that I shouldn’t ever put one here.

The mirror command normally downloads files from an upstream (usually remote) source to a local system. Thus the usual mirror process pulls remote files from upstream. Using the --reverse option swaps the sense of the mirror mechanism and files are instead uploaded (i.e. pushed) to the upstream system.

In the case I describe here, I push all files from within the _site/ directory to the /public_html/ directory on my hosting service.

When reading the documentation you will see terms like “source” and “target”. When mirroring to a local system then the “source” is the remote system and the “target” is the local system. With the --reverse option, these are swapped and the local system is now the “source” and the upstream system is the “target”.

The --delete option removes any files from the target system which are not present in the source. In our case, this is anything within the /public_html/ directory tree. This ensures that if I delete or rename a file, it isn’t still floating around on the production system, which might confuse someone in the future.

The -v option turns on the first level of verbosity so that I can get feedback about what’s happening when mirroring the files to the upstream system.

The problem

So what’s the issue? Well, each time I mirror the site to production, the lftp script re-transfers all my files. In particular, lftp removes each file from upstream before uploading it again. This happens even if the files haven’t changed. Here’s what I mean:

$ lftp -f deploy_site.lftp
Removing old file `feed.xml'
Transferring file `feed.xml'
Removing old file `index.html'
Transferring file `index.html'
Removing old file `sitemap.xml'
Transferring file `sitemap.xml'
Removing old file `about/index.html'
Transferring file `about/index.html'
Removing old file `add-favicon-to-mm-jekyll-site/index.html'
Transferring file `add-favicon-to-mm-jekyll-site/index.html'

<snip>

Not only is this annoying, but it’s a waste of network resources and time. I’d tried to get lftp to only upload changed files in the past, but never seemed to have found the right incantation. Until today. Today, I finally found the information I needed to make this work.

The solution

If you read the lftp man page, you’ll find in the mirror section the --only-newer option. Adding this option to the mirror command mentioned earlier, we get

mirror -v --only-newer --delete --reverse _site/ /public_html/

Using this command you’ll find that it still transfers all files upstream. Gah! Why doesn’t this work?

Today I managed to stumble upon why this is so. An answer to the StackOverflow question Why lftp mirror –only-newer does not transfer “only newer” file? mentions a subtlety noted on Matthieu Bouthours’ blog and seemingly not mentioned anywhere else:

When uploading, it is not possible to set the date/time on the files uploaded, that’s why --ignore-time is needed.

Therefore, as mentioned in the StackOverflow answer:

[I]f you use the flag combination --only-newer and --ignore-time you can achieve decent backup properties, in such a way that all files that differ in size are replaced. Of course it doesn’t help if you really need to rely on time-synchronization but if it is just to perform a regular backup of data, it’ll do the job.

Updating the mirror command like so:

mirror -v --only-newer --ignore-time --delete --reverse _site/ /public_html/

fixes the issue and only uploads new or newly changed files, which is the desired behaviour. Yay! :partying_face:

Implementing this change in my build scripts reduced build and deployment times from 5.5 minutes to 2 minutes. That’s more than halved the time! Brilliant!

A word of caution

There’s a caveat here, though. If a file is changed and just so happens to be of the same size as its counterpart upstream, it won’t be transferred. One needs to bear this in mind.

To be honest, I’d prefer to use rsync because it generates checksums of the files to detect file changes. Then I could be more certain that my scripts upload only newer files and don’t transfer older ones unnecessarily. However, until I have that option, this will do the job nicely.

  1. Unfortunately my hosting service doesn’t allow rsync (at least not at my service level) and hence I can’t use a more sophisticated synchronisation mechanism. 

  2. Thank you, Captain Obvious! 

Support

If you liked this post and want to see more like this, please buy me a coffee!

buy me a coffee logo