Processing files with spaces in filenames

5 minute read

Sometimes I get files from friends who use certain graphical operating systems, where it’s ok to use spaces in filenames. Processing these files on Unix isn’t that much fun since spaces separate commands and options on the command line. Thus it sucks to try and process a list of files and get the output:

My: No such file or directory
Cool: No such file or directory
File.txt: No such file or directory

when really the file My Cool File.txt should have been read and processed. How to get around this? My usual procedure is simply to convert the spaces to underscores by hand:

$ mv My\ Cool\ File.txt My_Cool_File.txt

This is not only slow, but also error-prone, since one needs to remember to escape the spaces, however for one-off instances this is a fast, pragmatic option. However, if there is a long list of such files this is amazingly tedious. So, for once I decided to sit down and try to work out how to convert the spaces to underscores with a for loop in bash.

Internal Field Separator to the rescue

After spending ages trying to work out how to e.g. pipe the output of ls into a bash array, or use some of the various quoting options to ls 1, I eventually stumbled across the IFS environment variable.

IFS stands for “Internal Field Separator” 2, and is how things like for separate their output fields. Normally the field separator for for is a space, thus filenames with spaces in them get split into multiple parts and hence commands can’t find a file called My, Cool or File.txt.

One solution is to set IFS to a new value, process the files and then set IFS back to its old value. This tip I got from the blog post about handling filenames with spaces in bash and I thought I’d share my take on what is effectively an already solved problem. Why? Well, maybe the next time I’ll be able to find the solution quicker and by writing about it I might remember it instead of having to rely on StackOverflow and Google.

A fictitious but realistic example

Let’s imagine that you’ve been given the following list of files:

A Tale of Two Cities.txt
Beowulf.txt
Pride and Prejudice.txt
The Adventures of Tom Sawyer.txt
The Count of Monte Cristo.txt
The Importance of Being Earnest - A Trivial Comedy for Serious People.txt

Let’s also imagine that in order to do simple things like mv, cp or even for name in $(ls *.txt) do; aspell -c $name; done, that first you’d like to convert all spaces in the filenames to underscores. Also, you’d like filenames using the sequence “space-hyphen-space” to be converted to a simple underscore. To achieve this, you’d use a loop something like the following:

1
2
3
4
5
6
7
8
SAVEIFS=$IFS
IFS=$(echo -en "\n\b")  # or $'\n\b'
for file in $(ls)
do
    new_file=$(echo $file | sed 's/ - /_/g' | sed 's/ /_/g')
    mv $file $new_file
done
IFS=$SAVEIFS

This code needs a bit of explaining:

  • line 1: the current value of IFS is saved in SAVEIFS so that we can set it back later.
  • line 2: IFS is set to the result of echoing (echo) the sequence of a newline character followed by a backspace ("\n\b"). This sequence is of backslash escaped characters is interpreted by echo via the switch -e; the -n switch suppresses the trailing newline which echo would normally append to its output. Exactly why this sequence is used, I’m not sure, and no reason was mentioned in the original blog post. A plain newline is not sufficient to make IFS do what we want, however perhaps the combination is sufficiently uncommon to allow it to be used as a field separator in basically any situation. The main point here is that it works. Note also that the sequence $'\n\b' would also work and doesn’t require echo to run.
  • line 3: loop over files in the current directory, selected by a simple ls. The $() around ls executes the command and is considered better practice than using backticks (which one would have used in yesteryear).
  • line 5: the new filename is determined by substituting “space-hyphen-space” and “space” in the original filename by underscores (the g means globally, so the substitution happens for all occurrences of the pattern within the filename).
  • line 6: the old filename is replaced with the new filename.
  • line 8: IFS is reset to its original value (now weird things won’t happen in later scripts or processing within the current shell session).

Running the code gives us the output we want:

$ ls
A_Tale_of_Two_Cities.txt
Beowulf.txt
Pride_and_Prejudice.txt
The_Adventures_of_Tom_Sawyer.txt
The_Count_of_Monte_Cristo.txt
The_Importance_of_Being_Earnest_A_Trivial_Comedy_for_Serious_People.txt

Note that mv will complain that Beowulf.txt is the same file as you’re trying to rename a file to the same name; this isn’t a problem, but you will get a warning, just so you know.

Voila!

And that’s it! Now it’s possible to process the files more easily since they don’t contain spaces, however if one wished, it would be possible to process them even with the spaces using the command line and knowledge of the IFS environment variable.

[Update: use globbing instead of ls]

It turns out the solution is much easier than as described above. The day after I wrote the above text, a link to the Bash Pitfalls page turned up in my Twitter feed. The first entry on that page explains that it’s not at all necessary to use the output of ls as input to a for loop (and in fact that it’s a really bad idea, since the filenames could contain spaces!). The solution is to match the filename pattern we’re interested in with a glob (such as *.txt) and then to quote the expansion when using it within a command. This means we don’t need to muck around with internal field separators at all, which can only be a good thing. Thus, the shell code we need to write looks like this:

1
2
3
4
5
for file in *.txt
do
    new_file=$(echo "$file" | sed 's/ - /_/g' | sed 's/ /_/g')
    mv "$file" $new_file
done

Notice the improved for statement on line 1 (and the lack of environment variable setup) and the quotes around the shell variable expansion on lines 3 and 4. This is shorter, simpler and easier to read and understand, which is a definite improvement.

  1. I learned that the -Q and --quoting-style options could one day come in handy, for instance ls --quoting-style=escape escapes special characters in filenames, but doesn’t help us here, unfortunately, and ls -Q puts quotes around the filenames; it’s amazing what reading man pages can bring sometimes. 

  2. You can look it up in the bash man page, although it wasn’t obvious how to solve the current problem from the manual text. 

Updated: