____________________________________________________________________________ | | | ____ _____ | | | _ \ ___ ___ _ _| __/___ ___ _____ ___ ___ | | | |_| / _ \/ __| |/ | |_ / _ \| _ |_ _| |_ _/ _ \ | | | _ | |_|| |__| ' <| _| |_| | / | | _ | | |_| | | | |_| \_\___/\___|_|\_|_| \___/|_|_\ |_| |_| |___\___/ | | | |__________________________________________________________________________|
____ _____ | _ \ ___ ___ _ _| __/___ ___ _____ ___ ___ | |_| / _ \/ __| |/ | |_ / _ \| _ |_ _| |_ _/ _ \ | _ | |_|| |__| ' <| _| |_| | / | | _ | | |_| | |_| \_\___/\___|_|\_|_| \___/|_|_\ |_| |_| |___\___/

Detecting time gaps in CSV data ===============================

First draft: 2024-11-18
Published:   2024-12-27

Table of contents:

The csv-detect-missing program is a CLI application I wrote in Rust. It's primary purpose is to inspect CSV data that contains a timestamp field, calculating the time difference between subsequent lines, and reporting if the difference is greater than some set gap. It started from the simple need to be able to examine sensor time series data for faulty periods, but this version became much more versatile than I originally imagined.

TL;DR: Project GitHub repository can be found here.

Motivation

The first incarnation of the concept came when I still worked at Woodspring full time. Around 2020 when we joined the MiniStor EU project consortium, I was awarded the responsibility to represent the company on the IT front. One of my tasks was to upload sensor data from our demonstration building to a central server through a Rest API. I have automated the pipeline with Bash scripts, but I needed a way to confirm that the data has arrived okay. The API supported placing queries as well, so I had the idea to download one record per hour (the original dataset had much higher temporal resolution), put the JSON results in a CSV, and calculate the difference between the individual lines.

This is the help text from the original implementation, which was done in Rust as well:

> Description:
> Tool to inspect csv data, looking for time gaps larger than 1 hour between
> subsequent lines.
>
> Usage:
> csv_detect_missing [-u] 
> - Input "csv_file" must be a text file starting with a valid RFC3339
> timestamp, eg. "yyyy-mm-ddTHH:MM:SSZ".
>   Separator character can either be ',' or ';'.
> - With option "-u", the expected format is set to unix timestamp with
> milliseconds.
>
> Created at Woodspring, 2021.

The program had been created for that exact purpose, thus it had limited features, but served me well in the project.

Now it is 2024 and I found myself again in need of a similar capability. In fact I could have used the exact same program with some input manipulation, but I no longer have the source files, and I could only find binaries compiled for ARM (having deployed the original on Raspberry Pi devices). It is time to adapt to the open-source philosophy, I thought, and made a new implementation of the program.

Specifications

Before I continue, let me post the shorter (-h) help text from the new program (version 1.0.0), so it would hopefully fit on same screen as the previous one, to compare.

> Tool to inspect CSV data, looking for (time) gaps between subsequent lines.
>
> Usage: csv-detect-missing [OPTIONS] 
>
> Arguments:
>     Input file, or '-' to read from STDIN
>
> Options:
>   -d             Input delimiter [default: ,]
>   -i             Field index [default: 1]
>   -f            Format [default: uint]
>       --gt         'Greater-than' comparison behavior (default)
>       --ge         'Greater-or-equal' comparison behavior
>       --lt         'Less-than' comparison behavior
>       --le         'Less-or-equal' comparison behavior
>   -c           Comment marker [default: #]
>   -a                    Allow empty or invalid lines
>   -D, --diff []  Diff mode (default): one delimiter-separated line 
>                         per gap [default: ,]
>   -F, --filter          Filter mode: keep only offending lines
>   -v                    Verbose mode: print debug header
>   -h, --help            Print help (see more with '--help')
>   -V, --version         Print version

The general idea is that if I wanted to release this as open-source this time, I should make it an exercise and do it properly, with all bells and whistles, so I could even show it as some kind of a reference when needed.

Following that mentality, at first I was only concentrating on coming up with all the features I wanted to have. Some of my ideas in priority:

  1. The field index should be selectable.
  2. I wanted support for all four of the most used delimiter characters.
  3. I wanted the comment character to be configurable as well.
  4. Previous two options may use a string rather than a single character.
  5. Format support should be extensible. Also, as the unix timestamp is just an integer number anyways, why not to allow the program be used on any other integer field?
  6. It is always a struggle to decide if some comparison should work like greater-than, or greater-or-equal-than. Let the user decide, and also why not to include less-than and so on? Those could be useful in some situations.
  7. I often use grep to filter the input data, so the program should be able to read from the standard input, to work on data fed through a shell pipe.
  8. I was never sure if I liked the original output where I showed one line per gap with the two timestamps. Now I wanted to try something new with the filter mode.

Implementation

In a command-line interface (CLI) application it is a good way to start with the usage and help text. In Rust, when the required command arguments start to feel complicated enough, one usually looks for a helper crate like `clap`, that does the parsing automatically. It abstracts away the menial work but highlights the importance of really thinking through all the possible options. When I finished polishing the clap `Command` builder to produce the help above, and worked like I wanted, I basically had a well-defined specification for the black box.

Another important concept Rust introduced me, and I start to appreciate more and more, is test-driven development (TDD). When I had the feature set settled, I thought: "Well, a couple of quite obscure parameter combinations are possible now, how will I even test some of those?" So the second step was to produce a few test CSV files. This quickly became a whole suite basically, because with each of the files, for example I can only test one possible delimiter. Also I wanted to have one or two that was more like dummy files for demo, easier to see what is happening on those, compared to a wall of dense sensor data.

The preparations, including setting up the help, CLI parsing and test files, took me 2 full days worth of work, all before basically writing a single line of processing code. Considering that v1.0.0 in all represents around 40 hours of effort (5 workdays essentially), that means 40% contemplation and 60% action. This sounds like a healthy balance to me.

Future plans

I can put v1.0.0 to good use as it is, but I have gathered a couple of ideas for further expansion in the project TODO file. Some of those seem logical and might worth pursuing, but others not so much: I will have to consider one-by-one if those could be actually used for something, or would be put in just for completeness and for sake of the exercise.

One topic I am most interested in from that list right now, is ISO 8601 support. I have already stumbled upon its time period syntax in the YouTube API project recently, and accidentally made a partial parser for it, so it would be quite worthwhile to further investigate the topic.