I recently wanted to understand the Parquet binary format so I attempted to write a specification for it in Kaitai Struct. Kaitai Struct is a declarative binary format specification language, embedded in YAML. After writing such specification for a format one can generate parsers for it in several supported languages. Another thing that one can use such specification for is to visualize various elements of the format in the Kaitai Web IDE, which shows output somewhat similar to what Wireshark shows for network protocols. That’s what I want to use (if I ever finish the spec).
One stumbling block that I encountered was related to the way a SchemaElement is encoded in Parquet. A SchemaElement is a struct with a couple of fields selected from a fixed collection. The fields have numerical identifiers which are not provided directly but through deltas between their values. To obtain actual field id one has to compute the sum of the deltas provided (as 4-bit integers) in the file. For a while it was unclear if there is a way to encode such sum in Kaitai Struct and there was an issue about it on Kaitai GitHub repository opened more that four years ago. However, some three weeks ago cher-nov came up with a solution to this problem and posted it to the issue linked above. In this post I want to present an application of this solution to a toy format as another example that one can look at.
Read the rest of this entry »