Met ODP client specs ====================

First draft: 2025-01-02
Published:   2025-01-08

Preface

I have written about my motivations in the intro article so I am skipping my usual starting section this time. Also I will try something new with this project: at this time of writing I only have a rough idea of what I want to make, and I would like to cast those into concrete here right now, before your eyes, so you can follow my thought process all the way. I hope it will prove to be an interesting read.

The general outlines are the following. We have access to a file address at the HungaroMet ODP that points to the latest measurement table at all times. Our target will be a command line application, where we have to download the CSV, then load the records into a database. It sounds pretty straightforward, but the level for a minimally viable product will be set a little higher so the program would end up being useful for a wider audience. To achieve this, extra features will be included, like filtering and options for various output formats.

Concept scripts

Before diving into the details, let me share two Bash scripts, that I set up a couple of months ago when I first had the idea for this project. I wanted to try how this would work, and at the same time start the data acquisition right away even if it is in CSV. I plan to import the records I have been saving using these scripts, into the database when that is ready.

`met_odp_download_latest.sh`:

> #!/bin/bash
> set -e
> 
> # Script to download latest 10-minute synoptic data file from odp.met.hu
> # Big thanks to the source: HungaroMet Nonprofit Zrt.
> 
> dir=$(dirname "$0")
> tmpfile="$dir/latest.csv.zip"
> 
> wget -nv "https://odp.met.hu/weather/weather_reports/synoptic/hungary/10_minutes/csv/HABP_10M_SYNOP_LATEST.csv.zip" -O "$tmpfile"
> unzip -q "$tmpfile" -d "$dir/data/"

`met_odp_process.sh`:

> #!/bin/bash
> set -e
> 
> date -Isec
> timestamp=$(date "+%Y%m%d")
> 
> dir=$(dirname "$0")
> data="$dir/data/*.csv"
> 
> for file in $data; do
>     grep "13704;Sopron" "$file" | cut -d ';' 
-f 1,7,9,11,13,15,21,25,29,31,33,35,37,39,51 -s >> "$dir/data_sopron.csv"
> done
> 
> tar -czf "$dir/backup/met_odp_10m_synop_$timestamp.tar.gz" --remove-files 
-C "$dir/data" .

Input

Let's start with the first script, which is basically just a `wget` invocation to download the latest file, then unzipping it. The result is a CSV which generally contains one record for each of the weather stations that belong to the official network. I am using the dataset with the highest temporal resolution out of those available at the ODP service, so the download script is scheduled to run in every 10 minutes to match.

Following is an excerpt from `HABP_10M_SYNOP_20250102184805.csv` file, just downloaded moments ago. I start with the station nearest to me, in Sopron located at the Kuruc-hill, because that is what I am most interested in, and also that will be used in further examples.

>         Time;StationNumber;StationName                             ;Latitude;Longitude;Elevation;    r; Q_r;    t; Q_t;   ta;Q_ta;   tn;Q_tn;   tx;Q_tx;     v; Q_v;      p; Q_p;   u; Q_u;      sg;Q_sg;     sr;Q_sr;   suv;Q_suv;   fs;Q_fs; fsd;Q_fsd;   fx;Q_fx; fxd;Q_fxd; fxm;Q_fxm; fxs;Q_fxs;  et5;Q_et5; et10;Q_et10; et20;Q_et20; et50;Q_et50;et100;Q_et100;  tsn;Q_tsn; tviz;Q_tviz;EOR
...
> 202501021840;        13704;Sopron Kuruc-domb                       ; 47.6783;  16.6022;    232.8;  0.0;    ; -1.8;    ; -1.8;    ; -1.8;    ; -1.8;    ;  -999;    ;   -999;    ; 100;    ;    -999;    ;    0.0;    ;  -999;     ;  0.3;    ;  72;     ;  0.5;    ;  79;     ;  37;     ;  23;     ; -999;     ; -999;      ; -999;      ; -999;      ; -999;       ; -1.8;     ; -999;      ;EOR
> 202501021840;        13711;Fertőrákos                              ; 47.7147;  16.6658;    116.8;  0.0;    ; -1.0;    ; -1.0;    ; -1.0;    ; -1.0;    ;  -999;    ;  997.0;    ;  98;    ;   77.54;    ;   -999;    ;  -999;     ;  0.0;    ; 329;     ;  0.4;    ; 309;     ;  32;     ;  48;     ;  1.5;     ;  1.5;      ;  2.2;      ;  4.2;      ;  7.4;       ; -1.0;     ; -999;      ;EOR
> 202501021840;        14707;Sopronhorpács                           ; 47.4806;  16.7292;    198.7;  0.0;    ; -1.0;    ; -1.0;    ; -1.0;    ; -1.0;    ;  -999;    ;   -999;    ;  97;    ;    -999;    ;   -999;    ;  -999;     ;  1.3;    ;  16;     ;  2.7;    ;  28;     ;  30;     ;  21;     ; -999;     ; -999;      ; -999;      ; -999;      ; -999;       ; -999;     ; -999;      ;EOR
> 202501021840;        14805;Gór                                     ; 47.3575;  16.7964;    166.0;  0.0;    ; -0.5;    ; -0.5;    ; -0.5;    ; -0.5;    ;  -999;    ;   -999;    ;  91;    ;    -999;    ;   -999;    ;  -999;     ; -999;    ;-999;     ; -999;    ;-999;     ;-999;     ;-999;     ; -999;     ; -999;      ; -999;      ; -999;      ; -999;       ; -999;     ; -999;      ;EOR
> 202501021840;        15310;Szombathely                             ; 47.1983;  16.6478;    200.1;  0.0;    ;  2.7;    ;  2.8;    ;  2.7;    ;  2.9;    ;  -999;    ;  986.4;    ;  81;    ;    -999;    ;    1.1;    ;  -999;     ;  1.3;    ;  97;     ;  2.0;    ;  96;     ;  33;     ;  55;     ;  1.1;     ;  1.3;      ;  2.0;      ;  3.5;      ;  5.7;       ;  1.1;     ; -999;      ;EOR
> 202501021840;        15405;Sárvár                                  ; 47.2428;  16.9083;    153.0;  0.0;    ;  0.3;    ;  0.2;    ;  0.2;    ;  0.3;    ;  -999;    ;   -999;    ;  94;    ;    -999;    ;   -999;    ;  -999;     ; -999;    ;-999;     ; -999;    ;-999;     ;-999;     ;-999;     ; -999;     ; -999;      ; -999;      ; -999;      ; -999;       ; -999;     ; -999;      ;EOR
...

Records in a file seem to be sorted by Time and StationNumber, and there are cases where the same station is listed multiple times with different timestamps (this might be a way to try catching up after some kind of delay or outage). Notice that every record ends with an "EOR" field. Every variable comes with a corresponding auxiliary field starting with "Q_", these are reserved for development purposes and generally empty. Missing values are represented as "-999".

The second script, marked "process", is scheduled to run once per day, filtering to the Sopron station and appending these records to the `data_sopron.csv` file. Original CSV files are archived at the end and copied to a backup directory.

With the `cut` command the script is filtering to certain columns as well. The selected indexes correspond to the following fields:

>            1;    7;    9;   11;   13;   15;  21;     25;   29;  31;   33;  35;  37;  39;   51
>         Time;    r;    t;   ta;   tn;   tx;   u;     sr;   fs; fsd;   fx; fxd; fxm; fxs;  tsn

I am sorry I had to use the horizontal scrollbox again, next one will be a vertical table I promise. Column filtering were done here so only those variables remain that are not missing for Sopron station.

Let's see the descriptions and measurement unit for each field, according to an official companion document accessible through ODP as well. Here I only omitted the EOF and reserved Q fields, and included sample values from Sopron and nearby Fertőrákos (which has a different set of sensors).

+----+-------+--------------------+-------+-------------------+--------------+
| ID | Name  |    Description     | Unit  |      Sopron       |  Fertőrákos  |
+----+-------+--------------------+-------+-------------------+--------------+
|  1 | Time                       |       | 202501021840      | 202501021840 |
|  2 | StationNumber              |       | 13704             | 13711        |
|  3 | StationName                |       | Sopron Kuruc-domb | Fertőrákos   |
|  4 | Latitude                   |       | 47.6783           | 47.7147      |
|  5 | Longitude                  |       | 16.6022           | 16.6658      |
|  6 | Elevation                  | m     | 232.8             | 116.8        |
|  7 | r     | Rain               | mm    | 0.0               | 0.0          |
|  9 | t     | Temperature        | °C    | -1.8              | -1.0         |
| 11 | ta    | Temperature (avg)  | °C    | -1.8              | -1.0         |
| 13 | tn    | Temperature (min)  | °C    | -1.8              | -1.0         |
| 15 | tx    | Temperature (max)  | °C    | -1.8              | -1.0         |
| 17 | v     | Visibility         | m     | -999              | -999         |
| 19 | p     | Pressure           | hPa   | -999              | 997.0        |
| 21 | u     | Relative humidity  | %     | 100               | 98           |
| 23 | sg    | Gamma radiation    | nSv/h | -999              | 77.54        |
| 25 | sr    | Solar radiation    | W/m^2 | 0.0               | -999         |
| 27 | suv   | UV radiation       | MED/h | -999              | -999         |
| 29 | fs    | Wind speed (avg)   | m/s   | 0.3               | 0.0          |
| 31 | fsd   | Wind direction     | °     | 72                | 329          |
| 33 | fx    | Max gust speed     | m/s   | 0.5               | 0.4          |
| 35 | fxd   | Max gust direction | °     | 79                | 309          |
| 37 | fxm   | Max gust minute    | '     | 37                | 32           |
| 39 | fxs   | Max gust second    | "     | 23                | 48           |
| 41 | et5   | Ground temp (5cm)  | °C    | -999              | 1.5          |
| 43 | et10  | Ground temp (10cm) | °C    | -999              | 1.5          |
| 45 | et20  | Ground temp (20cm) | °C    | -999              | 2.2          |
| 47 | et50  | Ground temp (50cm) | °C    | -999              | 4.2          |
| 49 | et100 | Ground temp (1m)   | °C    | -999              | 7.4          |
| 51 | tsn   | Surface temp (min) | °C    | -1.8              | -1.0         |
| 53 | tviz  | Water temperature  | °C    | -999              | -999         |
+----+-------+--------------------+-------+-------------------+--------------+

Output

Now what do we want to do with all this data? Some possibilities might be the following:

Do nothing yet, we just want to download the zipped archives and keep for later. The program has to do some manipulation on the file name though, as the default "HABP_10M_SYNOP_LATEST" is not good here.
We just want the program to unzip the archive, and save the payload some place. The name of the CSV file inside is marked with a timestamp already.
We want to create a continuous CSV by appending to a file. The program has to handle the field headers accordingly, printing it only when the output file does not exist yet (or empty).
We do not want to write to a file, but have the program print the data through STDOUT. The output can then be piped into some other commands for further processing.
We want the program to alter the default CSV format. I am thinking about options like changing the delimiter, replacing "-999" with "NULL" or empty "", and removing the extra spaces used for formatting.
With the previous case we entered *processing* territory, and there might be cases when these kind of operations might fail due to some defect present in the input file. The program should have routines to inspect the CSV content and detect if something does not seem right, like missing EOR, incorrect number of fields, unexpected characters. There are a couple of possible scenarios regarding what to do when a failure occurs. Instead of simply aborting, we could log the warning but try to recover by repeating the download, skipping the offending lines or set some fields to NULL.

Validation should be turned on by default for all situations when the program tries to interpret the data. Also it could be requested optionally even in simpler cases like A and B above. The erroneous input files should be kept for further inspection.
We need the data to be converted to some alternative file format, like JSON, YAML, XML, serialized binary etc. This is only a theoretical possibility, and I wanted to include it here in its own place in the thought sequence of this list. In reality, I do not plan to include any of these in the implementation, at least in short-term, unless an actual need would emerge. With the STDOUT support the application should be able to feed external tools to achieve this kind of conversions.
We need to store the data in a database. At this time I want PostgreSQL support, and it has to be investigated what is needed to extend compatibility with other systems. Most likely this will be similar to the previous point, we should be somewhat prepared on the library level but no alternatives are planned short-term.

Lastly, I would like to bring up an additional aspect we should consider. Are these points above mutually exclusive? Is it a valid use case to have the program write to database AND output a subset as a file AND save the original input zip as backup? It might sound as a bit of a stretch but in the next section I will propose a solution, how this could be done without overcomplicating our interface.

Input revisited

There was a key term mentioned in the previous section: keeping some files for later. How will these files be processed at that time? Granted, if they were placed aside as defective (failed validation), one would inspect them manually, apply corrections or discard parts. Then what? My point is, that more than likely we will want to process them the same way as usual, as it would have happened without the failure.

By recognizing this, we can come to the conclusion that our program needs to accept local files too for input. Going one step further, we could make good use of STDIN as well to allow maximal flexibility. This way even if we would not support multiple output paths internally, the tool could be used with pipes and external programs to create complex setups that were mentioned before.

On the input we should accept zip archives just like the original, also raw CSV in its unaltered form, and possibly almost any other format that the program is capable to produce on its standard output. There are some edge cases that could be problematic, like filtering out some columns with also omitting the header row, we cannot make heads or tails out of that obviously.

Filters

Let's talk about filtering rows first. I can think of two reasons why someone would want that. The first and most evident situation is when we want only one station or a certain subset to appear on the output. I did that with my original script, remember? It turns out there is a need to be able to specify more than one stations to filter to.

Today just as I prepared the input samples for the article, I realized that upon closer inspection, some of those sensors that show -999 for Sopron, are in fact available at Fertőrákos station (just a couple of kilometers north from here). In addition to be able to monitor a larger set of variables, I can also improve my data quality by storing records from multiple stations in the vicinity. The application should support these use cases.

What if I do not want to store the data, just to display the latest record, but the input file contains multiple rows from the same station? We should be able to activate something similar like the `distinct` keyword in SQL, which would keep only the latest record for every station.

The program should also support filtering columns. Raw format includes everything, and a cleaned format should get rid of EOR and the Q fields. When filtering to only one station, it might not be that useful to keep the station info in columns 2-6. Apart from these, we still have more than 20 fields with sensor data that we might not want, probably including some static "-999"s that we most definitely not want to keep. For example, I have only found two stations that have water temperature sensors installed, for every other station it is just a waste of storage. We will convert those to NULL in database, but for textual output the program should provide the option to cherry-pick what columns to keep (or in reverse, what to discard).

Database

I would like to share my idea on how the DB interaction should work. Easiest would be to have only one table with the exact same fields as the raw input, maybe without the EOR. I do not want to save the Q fields either, sorry if someone would like to look for undocumented gems there, it has to be done in some other way. Also the static station info fields do not have to be stored every time, so I will put those in a separate table. Let's call the two tables `data` and `station`.

I am usually very hesitant to use identifiers generated outside, so a new auto-incremented ID will be used internally instead of the "StationNumber". We will treat any change in the station info as a potential break in continuity (i.e. same number, different name and/or coordinates would be treated as a different station). It means situations like this will need to be manually addressed by the user, for example by writing queries accordingly (relying on coordinates rather than identifiers).

Here is the sequence I plan to execute for every CSV record:

Separate the record for info (2-6) and data fields (1,7-53).
Check if `station` table has an entry identical to the info we have now, getting the internal station_id on success.
If the station is missing or has differing attributes for the same station_number, store a new record in `station` with the fresh timestamp and info, and use its new station_id.
Store the measurement values in `data`, uniquely identified by station_id and timestamp.

As of this time of writing I have not decided yet how will column filtering work with DB output. When pointed to an empty database, the program should be able to create its own schema, but with an existing `data` table, care should be taken if there are any mismatch regarding the fields to be included.

Last details

With a stable idea about core functionality, there are only a couple of pieces missing from our specification. All these are in connection with possible forms of execution.

I have been referring to the software to make here as a simple command line application so far. If we look at my prototype download script, it has to be scheduled to run once in every 10 minutes, and my first goal is to have a permanent replacement for that. How about having an option to invoke as a daemon, running in the background continuously?

Main advantage to a daemon would be better timing. Our target file is changed periodically at the ODP, but we can only guess (or manually check) when this is happening within the 10-minute window. I may be overly paranoid, but I like to avoid race conditions where I can. So with a daemon, the dates could be checked and the download scheduled accordingly.

It has to be noted that we are talking about multiple dates that possibly do not match:

Time instant when LATEST.zip was replaced with a new file.
Modification date of LATEST zip file.
Timestamp in the filename of the CSV inside.
Modification date of the CSV inside.
Actual data record timestamp(s) in the CSV.

As you can see it is non-trivial what is the best option here to process the available new data as soon as possible while minimizing risk of failure. We should also be prepared for the case of failure, and a daemon could be better suited to deal with retries.

Another aspect to mention regarding the application is configuration parameters. Most of the options will be accessible through command line arguments, but there are some that need to be read from a config file: I am thinking about database connection parameters in particular. Naturally we should try to avoid storing any passwords in plain text.

I will conclude with pointing out that although the main goal of the project is to create the CLI application described above, I would like to shape the Rust crate so as many of the components as possible would be exposed as library functions. This way some of the capabilities could be incorporated in the future in other standalone solutions, which might also like to use the HungaroMet ODP as their source of data.