PDF comparator: First take ==========================

Published:   2024-12-31
Last update: 2025-01-06

Motivation

I have written earlier about the basic reasons why a PDF comparison tool would be great, but today I would like to share some specifics as well. Let's start by looking at my pathetic little how-to text notes, that helped me remember in the past what to do when I needed to compare PDF files.

First there are a couple of instructional comments:

> How to compare two PDF files page-by-page for differences
>
> 1) Install imagemagick
>
> 2) Convert PDF to multiple PNG:
> convert -density 150 input.pdf "%03d.png"
>
> Notes:
> * density unit is DPI (150 in the example)
> * specify input page range as e.g. "input.pdf[0-4]" for the first five pages
> * if there are more than 999 pages in the PDF, change the '3' in name 
>     template accordingly
> * the "-alpha remove -alpha off" options can be used to tackle transparent
>     background
>
> 3) Iterate over the folders running 'compare' on the two versions of page 
>     images.

Then there is a Bash script implementing point 3):

> #!/bin/bash
> # Compare every image in first dir to the one having the same name in second
> # Output to current dir (warning: overwrites existing files)
> dir1="$1"
> dir2="$2"
> for file in "$dir1"/*; do
>     name=$(basename "$file")
>     compare -fuzz 1000 "$file" "$dir2/$name" "./$name"
> done

These are not that old to be honest (a year or two max), but let's analyze it like I had found it on a 10 year old hard drive now trying to make heads and tails of it. Maybe we can gather some useful clues for our next implementation.

The how-to started out as a plain-text file probably because the first command, the `convert` is a one-liner, executed twice with identical options (once for each of the two input files). To include this in the script would not have helped me much (I must have thought), as three out of the five tokens are parameters which potentially need adjustment. In reality most of the time 150 DPI is fine, we need the entire page range and three digits are plenty. For the `convert`, we need the for loop as the command has to be executed for each page, hence the script. I have found that the fuzz parameter did not need much adjustment once I settled on a value that worked well.

It is obvious that this is a temporary solution requiring heavy user engagement, having to execute the first commands manually. The user has to read all the notes and make conscious decision what parameters to use. Even with the script, the fuzz might require adjustment by changing the code itself. On the other hand, this is a fairly quick solution which could be created in 10-20 minutes. It is simple and easy to comprehend in a moment. Also there is the flexibility, for example when I found a PDF specimen where the background have turned transparent for some reason after conversion, I could just look up in the ImageMagick manual how to deal with that, and add it as a note.

I know this is ugly and I was hesitant to show it here, but this is reality: simply there are times when something has to be done quick and dirty. When I want to do something like this for the first time, I search for the commands and execute on-the-fly. Second time, I look up the same commands, feeling the déjà vu, and decide to save them as a text note, so for the third time, it will not take the same 20 minutes as before, just two maybe. And yes, many times it will be stuck in that temporary state, never receiving a proper treatment to create at least a functional Bash script.

My relationship with Bash

I have mixed feelings towards the Bash shell, with love and hate at the same time. I have found that the learning curve have a strange bump in it. If I have to do something *very* simple like the script above, I can figure it out in no time. Then there are times when I need something a bit more complex, like for example if I wanted to turn the first text note above into a fully functional script. I would have to query the number of pages, handle if something goes wrong, store the immediate values in variables, do some operations on them, and finally create the pattern that will be used with the final command. This did not sound very complicated actually, right? In reality, if I consider what I would need to make it properly working, I can foresee a couple of hours of struggle.

It may be just me, but I have some kind of a mental block when it comes to Bash syntax, I have to constantly look up the simplest of things because they are not intuitive. Or worse, often there are multiple solutions, some simple but fail horribly in unexpected situations, like when fed some paths with space characters or something similar. It feels like operating an aircraft, with familiar controls and you may have the general experience, but still there are the checklists which you must consult every time you do anything, because forgetting one bullet point might very well cause lives lost. Don't get me wrong, after one or two hours when I have warmed up, it is great and I feel I can do anything in Bash. I just need some serious motivating forces to push me over the threshold.

There is also the impostor syndrome variant I have: the fear of loosing internet connection just at the moment when I had to look something critical up in a documentation. This is not about looking stupid or something silly like that, generally I do not worry about such things. It is about compromises, I very well know what I would have to do if I wanted to be prepared for these situations, it is just that it does not worth my efforts. Also I was always able to work out something when out in the field in the past, I am just really not looking forward for challenges like that. We should face it: we have been spoiled and must make a conscious effort to retain skills that enable us to do our job without connection, search engines, and especially AI.

Possible solutions

After that bit of a psychological detour, I would like to address the question: okay but what solutions might exist to the problem? First and foremost I could just sit down and try finding a good book on Bash which contains every little detail I could need. In fact this would probably be the best solution, but at the same time the most boring one. I might still do it some (other) time, and I will make sure to link it in here when I arrive at a final pick.

What other technological possibilities exist? I could use some other shell, there are many and I am sure most of them are pretty easy to install on any Linux system I might come across. It is hard to fight the ubiquity of Bash though: I could only use other shells in projects that does not need to be highly reliable, in the sense that anyone should be able to fix it if something goes wrong and I am not available. For example, no one would care about my PDF tool if it is defective or not, but someone would care if one of the building monitoring systems I am responsible for ceased to work. As the saying goes that everyone is their own point of reference, I would probably roll my eyes if I had to suddenly fiddle with someone else's script written for a shell unknown to me. I should not continue on this train of thought, because it leads to a pitfall I think, especially for us who do freelance work: does this mean we can only choose a technology that is widely accepted as default choice in a particular field? I leave the answer for another time, because in fact we are still talking about the PDF tool no one cares about, so I can use whatever I want :)

If not a shell script, than maybe a small program written in some interpreted language? This is a field where I admit I have a large gap in my knowledge. I never really felt the urging need to plug the hole, I always managed these small tasks with just Bash and simple compiled programs. I mentioned default choices in the industry, so I should bring up Python, but I really do not want to say anything because I know nothing about it and everything I heard might just be hearsay. I kind of understand its appeal but never experienced any attraction myself. There is also R which I believe is similar in a sense, I would love to learn that once, but I am not sure at this stage if it would be useful in these kind of jobs (it might be more about data manipulation). Best would be to reignite my old interest in Basic, I was a big fan of VB once and this could be quite beneficial in my retrocomputing endeavors as well, I believe. Maybe I will look around for some modern environment for that some time.

At last we arrive at our destination ("about time", you must be thinking): a compiled program executing others as subprocesses. I will use Rust and we will see if this is a viable alternative or takes too much time.

Implementation

My half-cooked reference "script" took 26 lines of code/text. In comparison, version 0.1.0 of the portable-document-comparator program extends to roughly 280 lines. Without license notes that is still 10 times as much. It turns out about the same ratio can be observed when considering the time needed: it took me a whole 8-hour shift worth of work to arrive at this minimum viable product. I believe I could have done the script in 45-50 minutes once the commands themselves were arrived upon. These numbers would include some preliminary trial runs in both cases, but definitely do not include creating the test files themselves.

At the center of the implementation are the `std::process::Command` invocations, like this one (from `lib.rs` lines 25-36, link):

> let status = Command::new("convert")
>     .args([
>         "-density",
>         "150",
>         "-alpha",
>         "remove",
>         "-alpha",
>         "off",
>         &config.left_file.to_string_lossy(),
>         &config.left_dir.join("%03d.png").to_string_lossy(),
>     ])
>     .status()?;

This of course does not feel as natural, but I can live with that. Here I have left the default setting in place, in which the subprocess inherits stdin, stdout and stderr as well. In a separate sandbox demo I experimented with pipes and catching the output, which worked well but was not needed with the `convert` and `compare` calls.

Does the new implementation have all the features of the old? At this stage, yes it does the same basic thing. I intentionally did not want to add any extra command line arguments, so all auxiliary parameters are fixed in code, just as before. I believe those could be changed from use to use without any ill consequences, just need to recompile every time.

Does the new implementation have any extra features? Yes, roughly half of the code is about user convenience, which we did not have before. Of course `clap` is the go-to crate for CLI argument parsing, then I have made sure the input files and output directory were valid, and created the left/right/diff directories programmatically (which I did manually before, it wasn't even mentioned in the text notes). Also I decided to clear these output directories automatically, because ImageMagick just overwrites everything, and it could lead to a mess when file count would not match between runs. I have included a little command line interaction to let the user decide whether to continue when some files are already present there.

In this scale, it could also be considered a major new feature to have a pause when left and right page counts do not equal. It is a well-known weakness that in these situations the comparison is defective most of the time, because pages do not pair up properly. Coming up with something to mitigate this was a major driving force behind starting this development project. Of course simply asking for confirmation does not solve the problem itself, but at least lets the user do some manual correction before continuing.

Conclusion

I have to be honest, this has been a pretty simple project so far, still it took me a day. Well okay, I never used `std::process` and some of the `std::fs` file manipulation functions before, so I had to read up on those at length. How quick would be my second attempt, six maybe five hours? (I should try.) Would it make any major difference in my final judgment?

There is no doubt that this is a better product. It may not be *that* much better right now, but it is something that can be used as a foundation. In comparison, I think the script is more similar to a sandcastle that it can be used as a model but you can only build other sand-based architecture on top. Then again, it is not necessarily the fault of the sand, it is often used in concrete as well...

For those quick and dirty jobs I encounter sometimes, still I will use Bash like before, or maybe try something new, in search for a better tool. The most important questions I should ask when starting a project could be the following:

How much time do I have right now for this: 3 minutes, 30, or 300?
How many times will I use this: 3, 30, 300 or more?

The saying goes: "right tool for the right job". Rust may be the right tool for many jobs, but not all of them.