____________________________________________________________________________ | | | ____ _____ | | | _ \ ___ ___ _ _| __/___ ___ _____ ___ ___ | | | |_| / _ \/ __| |/ | |_ / _ \| _ |_ _| |_ _/ _ \ | | | _ | |_|| |__| ' <| _| |_| | / | | _ | | |_| | | | |_| \_\___/\___|_|\_|_| \___/|_|_\ |_| |_| |___\___/ | | | |__________________________________________________________________________|
____ _____ | _ \ ___ ___ _ _| __/___ ___ _____ ___ ___ | |_| / _ \/ __| |/ | |_ / _ \| _ |_ _| |_ _/ _ \ | _ | |_|| |__| ' <| _| |_| | / | | _ | | |_| | |_| \_\___/\___|_|\_|_| \___/|_|_\ |_| |_| |___\___/

PDF comparator project intro ============================

First draft: 2024-12-15
Published:   2024-12-29
Last update: 2024-12-31

Table of contents:

I embark on a Rust development journey to create a visual PDF comparison tool.

Motivation

I am sure many of us who has to deal with Portable Document Format (PDF) files experienced a situation, when it would have been useful to be able to compare two such files. I am sure this is even more so for folks who are used to the ease of the `diff` tool for plain text and source code type files.

At first glance every file has three main properties: name, size, modification date. When considering any two files, by the naive approach it may seem that I can quickly compare those properties to check if the files could be identical, but of course not much IT knowledge is required to understand that basically file names and modification dates cannot be relied upon, and also two files might have the same size and still differ in their content. It could be much less straightforward though to realize, that semantically two files might be essentially the same even when they differ in name, date *and* size as well. For text files there could be extra white-space characters added/deleted for example. For visual data like PDF, there is also many ways possible to have the same information encoded in different data sets.

It would even be much more useful to be able to answer the question: the files are obviously not the same, but *how* do they differ? We do this task many times a day when comparing source code files with `diff`. It would be nice to have a tool to be able to use with similar ease to inspect PDF files. For example, I could see two subsequent versions of my CV exported, oh shoot what did I change last time?! I am too lazy to fire up the document editor to check. In another case, we know integrated circuit manufacturers can have very similar products on the market, differing only in one letter in their markings. They provide separate datasheets for the two, but although at first look they seem the same, they might genuinely differ only in a couple of words regarding their control inputs, signal outputs, or values in timings and other electrical characteristics. Finally it might also happen that one of my favorite board game, role-playing game, or tabletop wargame publishers releases a new version of some online document, maybe they updated minute details in the associated rules or changed a point cost value here and there. How to check what have been changed?

They say programmers are lazy in a good sense: they hate repetitive tasks and jump on chances when something can be automated. For the odd instances to quickly compare two images or pages, the Alt+Tab or similar hotkey in the window manager could get anyone out of trouble. I have also used GIMP in the past to compare single pages in difference mode. Lately I have experimented with ImageMagick to try and arrive on a usable script, I have found it works great but a bit clunky, and requires juggling in the shell when pages do not line up for example.

Roadmap

I have decided to quit fooling around in the shell and start a Rust project to address this problem. I am sure there are existing solutions, but one of my weaknesses is that I get bored very quickly while looking for things like new software if there seem to be no obvious choices. Also this project presents itself for trying a couple of things in Rust that I never attempted before.

My roadmap is the following:

  1. Simple straightforward solution to directly replace my current script, invoking the ImageMagick executable.
  2. It would be better if we used ImageMagick as a C library, through some kind of foreign function interface (FFI).
  3. As a much more elegant approach, we should use existing Rust crates to process the PDF files directly.

At this point, we are still talking about a CLI application that produces individual image files as output, one per page.

  1. The final CLI version should produce a single PDF file as output.

The real deal would be to have a GUI application at the end. I am visioning something like a cross-over between a PDF viewer and a graphical diff tool (such as Meld, that I like to use).

  1. Basic viewer functionality to display the diff with keyboard commands to navigate, but all options still provided through command line attributes.
  2. Standalone GUI frontend with all options accessible through graphical menus. Proposed functionality would include:
    • Ability to switch viewing mode between left file, right file, and diff.
    • Three-pane display: viewing all three modes at the same time side-by-side.
    • Two-pane display: this would be more natural but it needs some more thought and experiments, to see clearly what changed but at the same time do not block readability.
    • Control of the similarity threshold.
    • Page offset: comparison goes awry when whole pages are inserted or deleted (the page counts not equaling), so it is a must have function to be able to insert blank pages either to the left or right.
    • Just an idea: offset could be relevant even within one page spatially as well (think about a whole paragraph moved), so it would be interesting to be able to move (and maybe scale?) the page similarly how it would be done in GIMP difference mode.
  3. I have left one of the most important features to the end, because this has to be considered in its own right: textual diff within PDF data.

    ImageMagick converts the PDF page to a raster image and compares pixel-by-pixel. This is a good approach if we are only interested in the general area where things have changed, but it can be annoying upon closer inspection, when it turns out a whole paragraph was marked because an extra word was inserted around the beginning, which caused line wrappings to change. In cases where textual data is available, because the PDF was created with a text editor or it contains optical character recognition (OCR) data, these should be compared with similar principles as a regular textual diff.

    My first approach would be to let the user mark a frame left and right to compare textually. Later this could be improved to use the raster differences as heuristics to trigger the text comparison automatically.

Current status

Project is right now in its first burst of active development.

Project GitHub repository can be found here. Look for the dedicated version branches.

Other blog posts published in this series: