How do I find out what archive format is being used ?

Joshua5
Private
Posts: 1
Joined: Mon 22 Apr 2019 08:52
Contact:

How do I find out what archive format is being used ?

Postby Joshua5 » Mon 22 Apr 2019 08:55

I'm doing a hobby project on some game data files. I would like to edit some things in them and repackage them so the game accepts the modifications.

The directories themselves were archived in a proprietary format which was easy enough to open up. The files were compressed with zlib. Now I'm stumped, because it seems there is still (at least) one more layer of archiving. The files seem to be serialized, but looking up the most common obvious answers didn't pan out. Google wasn't helpful. I didn't find any magic bytes (doesn't mean there aren't any, I just didn't find any). How do I find out what the serialization format is, if it is commercial? If it is not, how should I approach the problem?

A little background:

the file is read by a Visual C++ application on Windows
I believe the file pre-serialization was XML-like
I've decompiled the .exe, trying to step the process while data files were being read didn't work out (it reads in 7Gb of data, I couldn't locate the start of the file type I wanted to work with). Fishing for helpful strings didn't work out either.
I've tried comparing to Python pickle, marshal, VC++ MFC marshal and various archiving program formats. No luck.
Distinctive features of the serialized files:

The file end has a Table of Contents of some sort. Looks like this:

TOC0 4 bytes of offset 4 bytes of length OBJE 8 bytes of offset 8 bytes of length

and so on. The other headings in the TOC are TOPO, CHNK, CLAS, PROP, STRG, TRAN, IMPR and EXPR all followed by offset and length. Offset and length values are big-endian.

The file itself seems to be either type-length-value encoded (human-readable strings falling under the CLAS heading) or type-different type-value in 4 byte chunks. There are 4 byte blocks like AA AA AA AA, AB AB AB AB or BB BB BB BB which probably work as delimiters.

There are long parts of data where nothing changes except one byte is increased by 1. Looks like an index of sorts.

The file data may contain various data types.

I had the chance to compare two different versions of the data files. Changing int values in the unserialized file lead to very small changes in the serialized file (typically one number changed in the original lead to one hex value being changed in the resulting file).

The format is extremely space inefficient. Most everything is in 4-byte chunks and the file is compressible by a factor of 10. This and human readability of strings have lead me to believe the file is not compressed or encrypted in any way. It's just serialized somehow.

Any help is greatly appreciated. :lol: :lol: :lol:

User avatar
Mike
More than 10 000 messages. Soldier you are the leader of all armies!
Posts: 12407
Joined: Thu 20 Feb 2014 01:09
Location: Virginia, United States of America
Contact:

Re: How do I find out what archive format is being used ?

Postby Mike » Sat 1 Jun 2019 14:33

So you're trying to mod a game?
Image
Courtesy of KattiValk

User avatar
chykka
Brigadier
Posts: 3385
Joined: Wed 28 Nov 2012 14:55
Location: Canada, Alberta
Contact:

Re: How do I find out what archive format is being used ?

Postby chykka » Thu 4 Jun 2020 22:03

Sounds more like reverse engineering. The modding thread probably wont help much (but I bet the modders can). The tables can be read the modding community figured that. Anything else you will probably have to look else where.

he file is read by a Visual C++ application on Windows
I believe the file pre-serialization was XML-like

Sounds tough archive to open, I couldn't tell you. But maybe Microsoft can, as something has to be able to understand it.

Sounds also like you already decompressed it but the archives in archive might be there for a reason if not always the most interesting part for a hobbyist. Because you de-compiled the exe you may find Ghidra tool some use. Anyway, I doubt they would use LZMA or Huffman you can usually tell what it is. It would maybe require a lot of guess and check. But likely if it is a standard compression/ archive format there would be obfuscation going on is why you are stuck.

The format is extremely space inefficient. Most everything is in 4-byte chunks and the file is compressible by a factor of 10. This and human readability of strings have lead me to believe the file is not compressed or encrypted in any way. It's just serialized somehow.

I'm speculating but compressing files decreases read time so there might be frequent calls made to this file where speed is important. Is it signed in anyway? You modifying the file might change the signature and fail a signature check.

But in no way are you going to get any help pulling apart proprietary software on the developers forum lol that is cheeky bro. Almost sounds like how China ended up with a few f35 documents.
Image

Return to “Off-Topic”

Who is online

Users browsing this forum: No registered users and 19 guests