Parsing SC4 DAT and Lot files

joshua43214 · September 27, 2014, 12:31:58 PM

I am working on a Python script that can parse and edit Dat files in bulk. At first I want the script to edit DAT and lot files for CAM to make them compliant with IRM, and maybe eventually do other operations as well.

This is a simple script operation, I am running into trouble with the formatting of the DAT files. I might be simply looking at the problem wrong. I tend to think of data files as arrays or tables. So I might need some help getting my head around how data is organized in the DAT file itself.

I randomly selected the Sperny productions
http://community.simtropolis.com/files/file/17279-blam-really-dirty-industries/
to play with (no pun intended lol)

opening the file, I can see some possible options for splitting the file, most notably the "\n" newline mark.
Splitting the file between the new line gives me this as the first line

'DBPF\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x87]WE\x8b]WE\x07\x00\x00\x00A\x00\x00\x001\x81\x01\x00\x14\x05\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x03\x00\x00\x04\x00\x00\x00@\x00\x00\x00\x02\x10\x00\x00\x02\x00\x00\x00\x01\x00\x01\x00\x01\x00\x04\x00\x02\x00\x00\x00(\x00\x00\x00BlaM MX I-d1 1x1 Sperny Production Plant"\x00\x00\x00Plugin_RCI_Buildings_Prop_Modd.dat@\x00\x00\x00Sperny Production Plant-0x6534284a-0x106aa2a2-0x9261ef57.SC4Desc\x99\x01\x00\x00\x10\xfb\x00\x02\x9b\xe1EQZB1###\x87@\x00\x00\x01\x03\x16\x05\x08\x10\x11\x08\x03\x05\x0c \xe0\x0c\x80\x00\x00\x01\x07(\xeaBlaM MX I-d1 1x1 Sperny Production Plant7(\x81\'\x020\x00\x01\xe0\x01\xf0\x88J\t\tJ\xe0\x08\xa4\x08\x9b\x13fi\x00\t\xe0\x89\xc7\xed\x88\x08\x16'

which looks a bit like a header with an awful lot of tab spaces.
So I need some general help with interpretation.
Some parts don't seem to make much sense such as this mix of hex and plain text
\x87]WE\x8b]WE\x07

I work with a lot of large data sets that typically are formatted so that \n separate lines or rows. \t separates entries in a row. And then stuff inside entries are separated by whitespace, the "|" symbol or a ";". My data sets typically also have a separate meta-data file for mapping stuff in the data file to its origin.
These feel to me like they are formatted along some other line of reasoning. I know almost nothing about programing outside of Python (I know a tiny bit of Perl and HTML), and even less about database structures like SQL.

If some one could please take the time to give me a bit of an overview of how to interpret the formatting I would greatly appreciate it. Most importantly, how is the file formatted so that the game can look for something like the lot zone configuration (forget the exact name), and see that it is medium density industrial.

catty · September 27, 2014, 12:48:10 PM

Recommend reading this document in the wiki

http://www.wiki.sc4devotion.com/index.php?title=DBPF

JoeST · September 28, 2014, 01:13:44 AM

the DBPF format is basically a packaging format (like a zip (or more accurately a tar) file) for collecting a bunch of assets each with a TGI identifier, each of the files/records will have it's own format (some are jpeg/etc, others are 3d models etc) and some are even compressed to save space. They are most definitely not as easy to parse as the plain-text data sets you've come across before, since most of the data and files in the DBPF files are encoded in a binary format. Quite a few people (myself included) have taken a stab at parsing and manipulating DAT files at the low level, notably wouanagaine (datpacker, etc) and ilive (ilive's reader) come to mind. Follows are a few different options for DBPF stuff.

* wouanagaine's SC4Mapper-2013 his (map file specific) reader/writer class is written in python
* my own take on parsing in a few different languages (see the branches list) and most definitely not properly usable but a good start at various ways to try

There's also a bunch of other libraries written in java, c, c++, c#, etc, written by other people which you should be able to find around the forums.

joshua43214 · September 28, 2014, 11:03:27 PM

Thank you very much catty and JoeST, these were exactly the type of resources I was looking for.
It made sense once I saw what was going on. So I wrote a simple script that could parse though and separate the data up so I could experiment and play with it.

I ran right into compression...
The documentation is hard to follow, and my skills with C and PHP are too limited to make heads or tails out of the sample code.
I get the general gist of what is going on though. Basically look for a set of control characters, use those to advance through the string of bytes and copy stuff forward from what has been passed through.

I am stuck on figuring out what is what in the beginning of the string.
Here is the last output from a script that picked out the compressed files:
'\x1f\x03\x00\x00\x10\xfb\x00\x05m\xe6EQZT1###\r\nParentCohort=Key:{\x13\x000x0\x91@\n,\x014}\xe1ropCount\x15\x16=\xe017\r\n\x18\x0b\xe50:{"Exemplar Type"}=Uint\x03\x1a32:\x18I\x01I2\x14\x0c\x89@32\x073Nam\xe0Stri\x063ng\xe1"Sperny \x00x\xe2duction Plan\x0eBt"\xe099af\x03\'acd\xe2Bulldoze Cos\x00\x1d\x02v=S\x89\x80v64\x10\x00\r~6\xe02781\n\xb228\xe0Occu\x01Tp\xe0 Siz\x04~\xe0Floa\x04\xb3\xe33:{7.032,10.09,1\x12\xba5.\x0c;\x02;21\xe1Resource!B \x08\xf2\x1e\xf4 1\x00@\xe10x5ad0e8"I17\xe1525f38a2,T1T3\x14O\x000\xe1{"Wealth\x10B\x11\xbf8\x88\x00u\x01T3\xe0"Pur7^pos\x14&\x88@&6\x01&4\xe3Capacity Satisfi\x1e\x9bed\x00l,\x19\x87\x81\xe542\x88@\x901\x05\xe05\xe0Poll)xu\xe0at c@22^er\x00D\x00d8\xc8\x1dD5\x1d\ne\x8aB@1\x18Z\x04\x9e\xe0Powe!\xb0r\xe0nsum\x88\x00\x9a\x8e\x02+\xe129244db5\x004\xe1Flammabi\x01\xd7l\x8a\x00\xff\x11`5\xe0a499\x06+f8\xe0QuerW]y e\xe0GUID\x8b\x00e\xe12a567dc1\x0c9\xe0c8f8\x88\xc2\x91746\xe0Cate\x17jgor\x187\x00*\xe0bc17\x087\x00o@\x86&q38\x00\xccR\xa9trJ\xcdTi\x88\x01\xa2\x1108\xe0beda%^3\xe1MaxFireS\x8c\xc0+tag\x0c\xfa\xe068eeB\xca97\x88\x01\x8a\xe0radi\x87B\xa9i\x00\x1d\xe016,2 \xc7-\x04,\xe08a1cKQ3e7`\xe0DP\x96\x03\x11\x0e38c\xe0cb35\x88\xc3\x1111fD\xd1\x1d\xfbs\x00\xb1\\4\x084\xe0aa1d\'3d39\x144\xe0Grou\x87@4ph\x05L\\Q\xfa0\x87Bt1d\x06\x88\x00Je\xf57/\xbcFX:\xe0Soun\x8c\x02\x1a\xe1ea55baab\x105\xe08355\x05\xb58\xe0Cran\x03)e H\x14}4J\x0c`\xe0c8ed\'J2d8!\xd6W\x9c\x02z,\xdc\xe0e91a\x07_0b5\xe2Building val%\xadu\x95\x01R\x03<374\xfc'

It looks a lot like a section we all like to edit

the first 10 bits:
\x1f\x03\x00\x00\x10\xfb\x00\x05m\xe6
these have the numeric values:
(31, 3, 0, 0, 16, 251, 0, 5, 109, 230)
which translates into this taken on bit at a time using struct.unpack("B",<bit>) where "B" is unsigned char size 1 per https://docs.python.org/2/library/struct.html
['0x1f', '0x3', '0x0', '0x0', '0x10', '0xfb', '0x0', '0x5', '0x6d', '0xe6']
Trying to take them pairwise seemed counter intuitive since the beginning uses odd number lengths for header and the like.

is 0x1f the same as 0x1F, and 0xfb the same as 0xFB? I am having some trouble understanding this page
http://www.wiki.sc4devotion.com/index.php?title=DBPF_Compression
where it talks about things like

0xE0 - 0xFB

Code Select

CC length:      1 byte 
Num plain text: ((byte0 & 0x1F) < < 2 ) + 4
Num to copy:    0 
Copy offset:    - 

Bits: 111ppppp 
Num plain text limit: 4-128 
Num to copy limit:    0 
Maximum Offset:       -

I could use some help interpreting which parts are telling me to do what.
Thanks again for the help.
Josh

JoeST · September 28, 2014, 11:24:38 PM

I've never had much luck understanding the QFS compression stuff, but luckily Wouanagaine has done quite a lot, and another python library has this example written in python rather than c. According to the commit comment it doesn't quite do the job properly, but it might be worth looking at.

memo · September 29, 2014, 12:09:00 AM

Josh, you will have a much easier time figuring out how to call Wouanagaine's QFS code from Python than implementing the QFS decompression yourself. I am not a Python guy, so I can't tell you how to do it, but Wouanagaine has done it before (practically all his code is Python). What is more, if you want to edit DBPF files, you'll also want QFS compression eventually, which is significantly harder to implement than the decompression code.

That said, I'd recommend to stop looking at the text representation of the data. It is a binary format, so you should view it in a hex editor to display the raw bytes. 0x1f and 0x1F is indeed the same, but again, if you don't know what it means, you should probably choose a different approach.

By the way, my own DBPF library is implemented in Scala and already supports reading and writing of Exemplar files, ... but it is not Python, obviously. (Compression/Decompression)

joshua43214 · September 30, 2014, 09:17:29 AM

Thank you all again.
After messing around with it for a while, I finally got decompressing to work. Turned out to be simpler than I expected.

The above junk turned into this:
EQZT1###
ParentCohort=Key:{0x00000000,0x00000000,0x00000000}
PropCount=0x00000017
0x00000010:{"Exemplar Type"}=Uint32:0:{0x00000002}
0x00000020:{"Exemplar Name"}=String:0:{"Sperny Production Plant"}
0x099afacd:{"Bulldoze Cost"}=Sint64:0:{0x0000000000000062}
0x27812810:{"Occupant Size"}=Float32:3:{7.032,10.09,15.02}
0x27812821:{"Resource Key Type 1"}=Uint32:3:{0x5ad0e817,0x525f38a2,0x00030000}
0x27812832:{"Wealth"}=Uint8:0:{0x02}
0x27812833:{"Purpose"}=Uint8:0:{0x06}
0x27812834:{"Capacity Satisfied"}=Uint32:2:{0x00004200,0x00000010}
0x27812851:{"Pollution at center"}=Sint32:4:{0x00000025,0x0000001e,0x00000011,0x00000000}
0x27812854:{"Power Consumed"}=Uint32:0:{0x00000002}
0x29244db5:{"Flammability"}=Uint8:0:{0x50}
0x2a499f85:{"Query exemplar GUID"}=Uint32:0:{0x2a567dc1}
0x2c8f8746:{"Exemplar Category"}=Uint32:0:{0x2c8fbc17}
0x499afa38:{"Construction Time"}=Uint8:0:{0x08}
0x49beda31:{"MaxFireStage"}=Uint8:0:{0x02}
0x68ee9764:{"Pollution radii"}=Float32:4:{16,20,0,0}
0x8a1c3e72:{"Worth"}=Sint64:0:{0x000000000000008c}
0x8cb3511f:{"Occupant Types"}=Uint32:1:{0x00004200}
0xaa1dd396:{"OccupantGroups"}=Uint32:3:{0x00001002,0x00014200,0x00003000}
0xaa1dd397:{"SFX:Query Sound"}=Uint32:0:{0xea55baab}
0xaa83558f:{"Crane Hints"}=Uint8:0:{0x00}
0xc8ed2d84:{"Water Consumed"}=Uint32:0:{0x00000007}
0xe91a0b5f:{"Building value"}=Sint64:0:{0x0000000000000374}

which appears to be free of defects.
If anyone sees an issue please let me know.
I almost danced when I finally got an output that was not garbage.

A decompressed fsh converted to png blown up 350% also seems to be free of artifacts.

@memo I do have a hex editor, I just do not use it much. The Canopy IDE lets me view the hex raw format, or convert it as needed, and allows me to inspect just the first handful of elements more easily.
I hope to wrap this in a GUI and publish it eventually, so using Wouanamaine's code is problematic. Though I bet that his C code is many times faster for this than my Python code. The alternative would be to use the Mapper as a dependency. I think it will be easier to write my own script, than to figure out how to call the SC4Mapper, feed in the compressed files, get the output, tweak the output, then feed it back into Mapper for recompression, if this is even possible (which I doubt it is).

Now to work on compression...
This look like it might be tough.

Once again, thank you all for directing me to good stuff. This would have been very hard to do with out some example code.
Will keep you posted on progress.
Any tips on compression would be very welcome

catty · September 30, 2014, 09:55:31 AM

I'm not an expert not even close, but it certainly looks Ok to me

JoeST · September 30, 2014, 10:56:46 AM

Is Dav1dde's python qfs implementation just totally not helpful? is it possible to patch it with ideas from wou's code? I remember at some point making wou's thing into a standalone module so mapper itself wouldn't need to be a dependency, just the qfs implementation.

memo · September 30, 2014, 11:03:45 AM

Great to see you making progress!

Quote from: joshua43214 on September 30, 2014, 09:17:29 AM
I hope to wrap this in a GUI and publish it eventually, so using Wouanamaine's code is problematic.

I don't see a problem with that. His code is at GitHub, so it is open source. Besides, the comments in the file suggest that the code wasn't even written by him.

Quote from: joshua43214 on September 30, 2014, 09:17:29 AM
Any tips on compression would be very welcome

I found this C++ implementation very helpful in understanding the compression algorithm, including a not-so-stupid implementation (see bottom of dbpf.cpp file).

joshua43214 · October 06, 2014, 06:21:22 PM

Well, I've made some more progress with the code.
The pointers you folks have given me have been immensely helpful figuring this out.

@ JoeST: I did finally grab the .py file you linked for compression. I gave up trying to figure out how to create the control characters. I have not tried his decompression yet since, but I will get around to comparing speeds. I was avoiding his because at the beginning of the compression section it has a comment that says "#TODO: fix me." I did some minor debugging and it runs so far.

Figuring out the EQZB files has been the hardest part so far.
Some samples:

QuoteEQZB1###
0 0 0
0x10 11
0x20 FARHW-4 Straight
0x27812820 (1523640343, 3134937073L, 1578306576)

QuoteEQZB1###
0 0 0
0x10 16
0x20 BlaM MX I-d1 1x1 Sperny Production Plant
0x27812837 1
0x4a4a88f0 8
0x699b08a4 0
0x88edc789L 2
0x88edc790L (1, 1)
0x88edc791L (1879048203,)
0x88edc792L 1073741824
0x88edc793L (8, 9)
0x88edc795L (2,)
0x88edc796L (6,)
0x88edc798L (3379372343L,)
0x88edc900L (2, 0, 0, 524288, 0, 524288, 0, 0, 1048576, 1048576, 0, 3392270064L, 625082368)
0x88edc901L (0, 0, 2, 250305, 0, 495452, 19880, 3276, 480729, 987627, 0, 3392270065L, 2455891799L)
0x88edc902L (1, 0, 0, 801653, 0, 290895, 539509, 28751, 1063797, 553039, 0, 3392270066L, 133300224)
0x88edc903L (1, 0, 3, 960630, 0, 896363, 927862, 765291, 993398, 1027435, 0, 3392270071L, 687734784)
0x88edc904L (1, 0, 3, 703560, 0, 686648, 572488, 555576, 834632, 817720, 0, 3392270074L, 513409024)
0x88edc905L (1, 0, 0, 209715, 0, 1018133, 176947, 985365, 242483, 1050901, 0, 3392270076L, 693633024)
0x88fcd877L 2299228947
0xcbe243f7L 1
0xe99b068cL 1106247680

The hex stuff needs to be padded, and the various entries need to be cleaned up and shown in hex, Uint, etc still.
Let me know if they look wrong.

I have some more questions.
Does the game care about the trailing "L" in the hex entries on the left? I trim them off in the directories used in the script for indexing and the like, but the above stuff is an example of stuff that could get edited and put back into the dat. I would prefer to get rid of them since it makes matching easier when parsing.

This feels like a really silly question, but...
Where do I find the equivalent to the lot file in the DAT file? It looks like all I am finding is building stuff. The lot should contain the information about the lot itself, zone, growth stage, what props are on the lot, etc.

The companion to the above question is, where are the dependencies listed. Does the game just start drilling down the dats to find them, or are they listed explicitly somewhere so the game can just grab one file from the dat?
I know the left column in the Exemplar file listed above corresponds to the exemplar property.
For example:

Quote0x88edc793L (8, 9)

defines the LotConfigPropertyZoneTypes, having two entries does not make much sense in this case either. Also the LotConfigPropertyZoneDensityTypes is not listed above.

However

Quote0x88edc790L (1, 1)

is the LotConfigPropertySize, and 2 entries do make sense, and they seem right since this is a 1x1 lot.

Lastly,
Does anyone have the Exemplar property list in any kind of delimited text format? I can copy the list here
http://www.wiki.sc4devotion.com/index.php?title=Exemplar_properties
and paste it into Excel and clean out the headers and junk, but the list is also missing the possible values that can be taken for a given property. It would be really nice to have those extra values.

Once again, thank you all for all the hand-holding and helping me through this.
Josh

memo · October 07, 2014, 12:46:58 AM

Quote from: joshua43214 on October 06, 2014, 06:21:22 PM
Does the game care about the trailing "L" in the hex entries on the left?

I assume it would care. In EQZT you mean? The property IDs are strictly (unsigned) 32-bit integers, so the 'L' representing (signed) longs is of no use.

Quote from: joshua43214 on October 06, 2014, 06:21:22 PM
Where do I find the equivalent to the lot file in the DAT file? It looks like all I am finding is building stuff. The lot should contain the information about the lot itself, zone, growth stage, what props are on the lot, etc.

Check the exemplar type: 0x02 is buildings, 0x10 is lot configurations.

Quote from: joshua43214 on October 06, 2014, 06:21:22 PM
The companion to the above question is, where are the dependencies listed. Does the game just start drilling down the dats to find them, or are they listed explicitly somewhere so the game can just grab one file from the dat?

As far as I know, the game loads all the TGIs at start up, so that it knows which file contains which TGIs. It also loads exemplar files (i.e. parses them), possibly filtered by exemplar type. Think of it like all TGIs are organized in a hierarchical tree. For example, LotConfig exemplars have a reference to prop exemplars which refer to S3D models which refer to FSH textures. They are loaded when needed. The game does not know about one dat file depending on another dat file. It just knows whether the dependent TGI is in your plugins or not.

Quote from: joshua43214 on October 06, 2014, 06:21:22 PM
I know the left column in the Exemplar file listed above corresponds to the exemplar property.
For example:
Quote0x88edc793L (8, 9)
defines the LotConfigPropertyZoneTypes, having two entries does not make much sense in this case either. Also the LotConfigPropertyZoneDensityTypes is not listed above.

Check the explanation of that property in the Reader. It says 8 is medium density industrial zone and 9 is high density, which does make sense.

Quote from: joshua43214 on October 06, 2014, 06:21:22 PM
Does anyone have the Exemplar property list in any kind of delimited text format? I can copy the list here
http://www.wiki.sc4devotion.com/index.php?title=Exemplar_properties
and paste it into Excel and clean out the headers and junk, but the list is also missing the possible values that can be taken for a given property. It would be really nice to have those extra values.

See here. If I recall correctly, the list is taken from PIMX. Also check out the files of your Reader installation, as it contains a similar XML file.

News:

Parsing SC4 DAT and Lot files