A super hacky tool to decode unknown binary formats
Table of Contents
- The Issue
- The Solution
- The Future
I often work with either new or unknown binary formats and decoding those with a hex editor is hard work. Especially when trying to understand and reverse engineer an unknown binary format, having just a hex editor is often not enough.
I converged to the same solution several times in the last years, and the last time I thought:
Let’s make it an actual tool, not just another hacky script for the current purpose.
The tool I created is called livedecode. It consumes two files, and dumps its output on stdout:
[user@host project]$ livedecode example/png.spec example/example.png
This will read a specification file (
png.spec), and will apply the decoding instructions to
png.spec looks something like this:
endian be print "File Header:" u8 magic_8bit # should be 0x89 str 5 magic # PNG\r\n u8 magic_1a # 0x1A u8 magic_0a # 0x0a def tEXt 1950701684 def IHDR 1229472850 def tIME 1950960965 call chunk "IHDR" call chunk "gAMA" call chunk "cHRM" ...
As you can see, the format is line based, and has some assembler-like syntax. Each thing you decode (e.g.
u8 magic_1a) will print its result to stdout.
You can also invoke subprograms with
call <pgm> <args..>, like you can see in
call chunk IHDR.
A smaller specification might look like this:
endian le u32 magic u32 type u32 offset u32 length print section 1 seek *offset dump *length .if *type 10 seek 0x200 str 10 description .endif
This format has a header built of a magic number, a file type, offset and length. The program will then seek to the specified offset and prints a hex dump with the length specified in the header.
Also, if the type is 10, it will seek to offset 512 and will print a 10 characters long string labelled
I use livedecode in VSCode by running it periodically in a terminal:
[user@host project]$ while true; do clear date livedecode docs/wmb6.spec data/wmb/wmb6/block.wmb > /tmp/dump.txt sleep 1 done
Then I view the
dump.txt side by side with my spec file:
This way, I can type, save and immediatly see the new decoding result. Working this way is very efficient and is really supporting an explorative workflow.
At roughly the same time where livedecode was finished, I started working on a new project that has a similar goal, but a different approach:
BFDL is using a formal syntax to describe the file formats instead of just executing a loose set of instructions. The benefits of that approach are that if you’re done discovering or designing your file format, you can then simply generate a serializer/deserializer for your format straight from your specification.