new DBPF library for Scala: scdbpf

memo · April 23, 2014, 03:17:23 PM

Hi all,

I'd like to draw your attention to the new DBPF library that I have just published at GitHub: [scdbpf]. It is written in Scala, a programming language that runs on the JVM, and therefore has very close Java interoperability.

The core features of this library are

• portabililty, owing to the JVM; – I don't usually work on Windows, so I put great value on this.
• immutability (idiomatic in Scala), which makes it inherently thread-safe; – This addresses some of the problems I had while working with the [jDBPFX] library.
• a simple API; – This is preferable for any software. I am not sure if I succeeded, but compared to the Java library, at least when it comes to extending the library, this should be much simpler.

It is almost impossible to get into invalid state with this library, which also makes it particularly suitable for simple scripts.

The major file types are already supported: Exemplar, Cohort, FSH, S3D, SC4Paths and LTexts. Moreover, I am particularly proud of the QFS compression code, which performs significantly better than the Java code (and took me a while to wrap my head around).

Documentation:

The ScalaDocs can be found [here].

Examples:

To get started, it is easiest to start the REPL via sbt console, which loads all the dependencies and useful initial import statements. Reading a DBPF file, sorting its entries by TGI and writing back to the same file could be achieved like this:

Code Select


val dbpf = DbpfFile.read(new File("foobar.dat"))
dbpf.write(dbpf.entries.sortBy(_.tgi))

This example shifts the GIDs of all LTexts by +3:

Code Select


dbpf.write(dbpf.entries.map { e =>
  if (e.tgi matches Tgi.LText)
    e.copy(e.tgi.copy(gid = e.tgi.gid + 3))
  else
    e
})

Another example: This decodes all the `Sc4Path` entries and rotates them by 90 degree.

Code Select


val writeList = for (e <- dbpf.entries) yield {
  if (e.tgi matches Tgi.Sc4Path) {
    val be = e.toBufferedEntry.convert[Sc4Path]
    be.copy(content = be.content * RotFlip.R1F0)
  } else {
    e
  }
}
dbpf.write(writeList)

The following example finds the first entry that is an exemplar and contains a property with a specific ID. (Note the `view` to avoid unnecessary decoding if the exemplar is already among the first entries.)

Code Select


val id = UInt(0x12345678)
dbpf.entries.view.
  filter(_.tgi matches Tgi.Exemplar).
  map(_.toBufferedEntry.convert[Exemplar]).
  find(_.content.properties.contains(id))

Certainly, there are no limitations.

I hope this serves as a short introduction to the library. Please let me know your thoughts, point out essential missing features or contribute directly at GitHub.

Tropod · May 06, 2014, 01:17:56 AM

This looks pretty cool & like it has potentail

. If I had more spare time I'd take a closer look

.
For the benefit of anyone looking at utilising this, or aren't familiar with Scala, what/where would be the steps to get into it? ie I've had a quick read of Docs but doesn't seem to be detailed enough (step by step to get working etc).

memo · May 06, 2014, 07:39:49 AM

Thank you for your interest. If you are new to Scala, "Programming in Scala, 2nd Ed." by M. Odersky, L. Spoon and B. Venners is an excellent introduction to the language. I immediately fell for the language while reading that book.

Getting started with this library specifically will be a bit difficult without basic knowledge of the Scala language, I admit, but if you know a bit of Scala, it should be more or less easy. I packed Version 0.1.1 of the library yesterday, but for the sake of resolving dependencies, the preferred way of obtaining it is still to build the project yourself. For example, to try the examples above, you would:

Code Select


git clone https://github.com/memo33/scdbpf.git
cd scdbpf
sbt console

This will start the SBT console which loads all the dependencies and lets you enter the examples. You'll need the Scala-2.11 compiler and SBT (simple build tool) and GIT for this.

From there on, proceeding highly depends on what you are planning to do with this library. Essentially, the following class diagram sums up the basic architecture:

DbpfFile acts as a container for DbpfEntries that can get streamed from a file or buffered in memory. DbpfType is the super trait for the actual DBPF file types, which can be accessed only through buffered entries. This means, if you wanted to add another file type, say EffDir, all you'd need to do is extend the DbpfType trait, which only requires a byte array, at least. The concrete DBPF types still lack some documentation as they are still subject to change.

memo · October 03, 2014, 08:10:44 AM

I am proud to announce the release of version 0.1.4 of this library.

Among small fixes and additions, I figured out a way to host binary files at Google Drive, which allows me to host a private Maven repository of all my released files (which start to become multitudinous), so that all the dependencies are automagically resolved*. Just add the following to your SBT build configuration file if your project depends on this library:

Code Select


resolvers += "memo33-gdrive-repo" at "https://googledrive.com/host/0B9r6o2oTyY34ZVc4SFBWMV9yb0E/repo/releases/"

libraryDependencies += "com.github.memo33" %% "scdbpf" % "0.1.4"

______
*We need that for SC4 plugins, too.

joshua43214 · October 03, 2014, 11:39:38 AM

This is a very interesting project. It looks like I am mirroring some of your stuff in my own current project that you generously provided some advice too. In particular, I really liked your compression/decompression script. My own early attempts at compression where giving me about a 30% reduction in size, but Python is just too slow for this type of work if dealing with batches of files. That and I have not figured out the best way to generate control characters. I am running into issues compiling Wounagaines codes into my Python code and was about to turn to your code and try to figure out how to call it from Python. Alternately, Scala is so simple I might be able to simply rewrite it as Python.

How does Scala compare to C or python for speed in this type of thing? Compressing one file is no biggie, compressing thousands is another matter.

What is your goal with this project? I am looking at making a sort of general tool GUI for batch editing mods with in certain boundaries, and I am also interested in creating a library of TGI's. How much of this parallels your goals?

memo · October 04, 2014, 02:13:13 AM

My goal with this project is to provide a general-purpose DBPF library that is platform-independent, simple and immutable/thread-safe. Most DBPF tools out there do their own DBPF processing which makes it hard to build upon and test so as to assert correctness. Originally, before I started to work on this library, I had been working on a Java GUI similar to yours, but had to put it on hold because I couldn't come up with a suitable threading model in conjunction with the Java library I was using (therefore, the thread-safety of this library). The GUI project is still on hold though, as I don't need it as much anymore.

This project will never be a GUI. Instead, GUI applications are meant to use it, such as JDatPacker, for instance, or a new FSH tool of mine that is published only internally.

Further development will mostly be driven by my needs for my work with the NAM, but I am open for suggestions, too. Most of the time, I need to batch process large amounts of S3D, FSH, Exemplar or SC4Path files, which I can already achieve very comfortably, and most of the work will probably focus on those file types. I understand that not everyone writes one's entire program in Scala, but often you just need one-time scripts (which I write in Vim and run from the console), which works much better than the Lua scripting functionality in the Reader, I would say.

Regarding the QFS compression (link to previous discussion, for reference), I have tried to call the native C code via JNI to compare performances. After all the discussion of efficiency, I was curious, too, how my implementation compares to Wounagaine's C code (which is the same the Reader uses, in fact). It turns out that my implementation is about 15 times faster while still achieving a minimally better compression ratio on average. Apparently, just because Wounagaine's code is written in C, doesn't necessarily make it efficient and fast. Though, to be fair, I don't know how much of a performance impact the usage of JNI is – yet it can't be that much.

Anyway, thanks to this experiment, I won't worry about efficiency of writing in Scala instead of C, at all, anymore. My compression is certainly good enough: It compresses 6000 entries with a total size of 111 MiB in about 5 seconds at a compression ratio of about 20% (1:5). Probably, this is faster than the rest of the file IO. The ratio varies in correlation with the input, of course; for example, when applied to the entire NAM repository, it 'only' achieves about 33%.

I don't know how you would go about calling Scala from Python. If you find a way, make sure you don't start a new JVM for each entry to compress (of the thousands). If, on the other hand, you translate the code to Python, depending on how literally you can replicate it, I'd suggest to replace the recursive functions by while-loops, as Python does not have tail-call optimization, in order to avoid huge performance impacts. I would be interested to see how that turns out. Don't worry too much about performance, though – Correctness is a million times more important. And after all, you only need to re-compress entries that have been edited.

PS: If you find Scala simple, why not switch?

joshua43214 · October 04, 2014, 07:52:36 AM

Quote from: memo on October 04, 2014, 02:13:13 AM
...It turns out that my implementation is about 15 times faster...
...
...
PS: If you find Scala simple, why not switch?

That is very impressive indeed, I can see why you are proud of it.

As for switching. My degrees are in math and genetics, so I need toolbox's that can do things like genetic alignments and Matlab style numerical analysis, plus being able to easily call R. Python is rapidly becoming the go-to language in the computational biology fields because Perl is impossible to learn (try comparing a Google search for any programming problem in Perl vs. Python - Perl people can only talk to other Perl people). Given a solid background in Matlab, I went from Hello World to functional programing in less than two days. Python may be a toy language or just a current fad, but it has gained traction for professionals who need to be able to program but do not have a CSE background, and don't have time to waste deciphering cryptic responses to common programming problems.
I really like the look of Scala though, it looks a bit like C, but is pretty easy to follow most of what is going on. I will play around with calling Scala from python and let you know how it goes.

memo · October 05, 2014, 04:38:26 AM

I am majoring in maths (algebra, not numerical analysis), too, and I just love the strong static type system and functional paradigms. If used properly, it is extremely powerful. I have worked with Matlab and Python only rarely, but I agree that Python is much more of a standard than Scala. I see your reasons and I don't want to convince you, but comparing Scala with C (aside from curly braces) is on the verge of blasphemy.

News:

new DBPF library for Scala: scdbpf