Reverse-Engineering File Formats for Fun

Jun 11, 2012, 9:00 AM

Tags: java

Well, okay, it wasn't for fun; it was for work. About a year and a half ago, I had occasion to parse the contents of shared object files from a Flash Media Server ("*.fso" files), and I figured it might be interesting to go back over the kind of things I had to do to accomplish that.

The first thing I did was to search around to see if anyone else had solved the same problem. However, while there are plenty of parsers for "Flash shared objects", not the least of which is in Flash itself, it seemed that, unless I was doing it wrong, that format is different from the one used by servers, so I couldn't use any of the ones I found. So I was left with a folder full of binary files and a task to accomplish. Sounds like a job for programming!

Fortunately, I knew the shape of the data inside the files: they were essentially arrays of maps. I could see the keys and some of the values in there, so the files were binary but not compressed, which gave me hope. However, since the data was variable-length, I couldn't just find the record delimiter and use offsets - I had to go through and find the various byte-value delimiters for each aspect and write a proper parser.

I opened up a couple of the files in vi sessions to compare them and, using the handy ga command, checked the byte values of the non-ASCII characters. I found that the file always starts with 0 then 3, and there appeared to be a "general" delimiter of 0, 0, 0 to break up header sections of the file and each record. Then I saw a number that was different between each file - ah-hah, the record count!

After another delimiter, I found a number followed by the name of the file (for some reason). It turned out that the number corresponded to the length of the file name, indicating that the format uses Pascal-style length-prefixed Strings. Good to know! After the name came a second instance of the record count for some reason, leaving the rest of the file to the actual records.

Conceptually, the records are easy: there should be about $record_count instances of some sort of record delimiter with the data packed in there in, essentially, key-value pairs. That's ALMOST how it is, but it got a bit weird: all records were ended with the same "general" delimiter used in the header, but they ALSO had their own two-byte record delimiter. That is, except when they used a three-byte variant for some reason. And there are a couple unknown values in there - they were different in each record and I'm sure they serve some use, but I couldn't figure out what it was. Once I figured out the hairy bits, though, the data was pretty much as I expected: there was a record index (for some reason), then the expected key-value pairs. Most of them were strings, but not all - there was an "urgent" key in some entries that had no corresponding value (though now that I back look at the code, I suspect that it was a boolean value of \1 to indicate "true").

The end result of all this was by far the most C-like Java I've ever written:

I felt pretty good after finishing that. High-level object tress are fun and all, but it's good to, from time to time, get your hands dirty with lower-level stuff like reading files as integer arrays.

PS: I don't know why my FSORecord class didn't just extend HashMap. I blame the brain haze from staring at binary files for too long. Conversely, I think I did have a reason to use arrays of int instead of byte - I think it had to do with Java byte being always signed but the data in the file being unsigned.

New Comment