Tuesday, May 19, 2009

Serialization for D part 1 of N

I'm planning on starting work on a template based serialization liberty for D. The objective is to be able to, from a single line or two of code per type, generate the needed function to stuff complex types into a file, network socket, byte buffer or whatever. At this point I'm still puzzling out what the interface will be.
  1. What will using this library with a type look like?
  2. What will invoking the code look like?
  3. What format will be generated?
  4. What will the source or destination for the data look like?
  5. How to work with polymorphic types?
  6. What types of limitations will it impose?
Taking each of these in the order I thought of them in

Usage

Ideally I want using the library to be a simple as possible, a single line of code would be best. For built in types, this is easy as the system should "just work". User defined types will be the more interesting case. I expect I will want to build several different solutions for different cases. For example, in the case where everything should be just slurped up or spit out, it should be as a simple as no arguments mixin:
 class C
{
int i;
someClass c;
mixin Serializable!();
}
Other cases, for instance where some values need to be omitted or where proper use of constructors needs to be observed, could use other approaches.
  • Process only the members on a given list
  • Process only the members not on a given list.
  • Deserialize by calling a constructor on the given values
  • Tag nodes on the way out and resolve references so that multiply referenced objects come through correctly resolved.
  • Various combinations of the above

One other bit that might work is a module level "make everything serializable" template that does default serialization of every type in the module. The problem with this is that to make it work the system as a whole will need to deal with both external and internal definitions of serialization functions. On the other hand, this will be needed anyway for built in types and maybe, 3rd party library.

One minor point is how to structure everything. Because it has a lot of redundancy in the different combinations, some way to reuse code is needed.Something like the recursive self mixin trick might do well.

Invocation

The hard part here is the naming (one of the two most difficult tasks in programming). For now I'm just going to go with the simple solution:

 MyType myType;
myType = MyType.Deserialize(source);
myType.Serialize(sink);
If anyone has a better idea what to name them I'm open to suggestions.

The other side is how to structure adding serilization for external in types. Built in types are easy, a simple template function that get called if T.Serialize doesn't exits (or the other way around). The harder cases is 3rd part types as (last I checked) getting template to overload from several sources is hard.

Format

For this there is a few interesting option classes:

  1. Machine specific binary
  2. Machine neutral binary
  3. Text: XML/Jason/YAML/CSV

For starters, I'm going to do just XML as I can read it so debugging will be easier. After that, I might go for a machine specific binary format and try to abstract out the rendering parts to reduce redundancy.

Data Source/Sink

This is one point I'm not sure on. What should be the primitive source and skink? A Phobos or Tango stream would be one option, but it would be nice to be Phobos/Tango agnostic. Also, might other systems be desirable? I think I'll try and keep the implementation dependent code for this loosely coupled and well contained so switching later is easier.

Derived Types

One major problem will be how to handle polymorphic instances. The best I can think of is to have them embed a type tag into the output stream and then use some sort of tag to delegate map to deserialize the stream. This has some interesting implications on the operation; for instance with XML, it indicates that the outermost tag of a type should, at least in part, be serialized by the child type but deserialize by the parent type. One options is, given these types

 class C { }
class D : C { int i; }
class E : C { float x;}

struct S { C ofTypeC; }

an instance of type S is serialized as:

 <S><ofTypeC typeis="D"><i>42</i></ofTypeC></S>

This would have S.Serialize output up to the attribute and then hand off via a virtual Serialize call to the member. On the other hand, S.Deserialize, would consume the full ofTypeC tag and then use the attribute to call the correct Deserialize function. This clearly ends up adding a some annoying overhead so some way to detect (or indicate) that no derived types will be used (and check to make sure that is so) would be nice.

Limitations

One thing I'm not going to handle is marching references for aliasing and slicing of arrays. Just because something is a slice of something else on one side, doesn't make it so on the other.

Aside from that, I don't known what limitations I'm going to impose, but I'm sure I'll find out as I go along.

10 comments:

  1. I like where this is going.

    At one point I was tempted to make such a thing, but this is much more well thought out already.

    Two feature requests then:
    1. bsd/png/zlib license.
    2. Some way to do delta encoding on repeated serializations, and ultimately buffer the stuff over UDP. At that point it seems like this could be an effective cheap & easy to use way to do net code for multiplayer games.

    Perhaps point 2 could be generalized and allow for some arbitrary callback to be executed every time something is serialized or deserialized.

    ReplyDelete
  2. I plan on a very liberal license; a "don't claim you wrote it, don't sue me" kind of thing.

    I think deltas would be better served on top of this. The user creates a data structure that represents the change and ships that. I think some kind of general template solution could be built for that separate from this.

    ReplyDelete
  3. If it helps, you can have a look at my own serialization code. Basically, reflection.d uses tupleof to get the names/offsets/types of all members of structs and classes, and builds custom RTTI out of it. serialize.d uses this RTTI to dump an object graph as text (XML like; it's first written to a what corresponds to a DOM tree in XML).

    Here's the code:

    http://svn.gna.org/viewcvs/lumbricus/trunk/src/utils/reflection.d?rev=757&view=auto
    http://svn.gna.org/viewcvs/lumbricus/trunk/src/utils/serialize.d?rev=754&view=auto

    ReplyDelete
  4. Sounds good BCS. Adding another template layer for deltas is fine by me. Looking forward to your serialization lib!

    ReplyDelete
  5. I must confess I didn't read the whole article, but I implemented a serialization thingy already and it works well for what I need.
    Feel free to use it in any respect you consider useful:
    http://leetless.de/gitweb?p=indiana-game-engine.git;a=tree;f=lib/indiana_lib/s11n;h=5924e151bb36283a240bd946e7e1d96df5b9ddd2;hb=HEAD

    ReplyDelete
  6. Out of curiosity, why machine-specific endianess for the binary format?

    Since you mentioned Tango, ar eyou planning this to be D1 compatible? Awesome.

    As for a sink, maybe define a particular SPI a sink must follow and provide wrapper templates for standard library streams that conform to that interface.

    If there's any way I can help, give me a holler; this looks like a great idea!

    ReplyDelete
  7. I'm looking at machine-specific endianess because it would allow for things like binary copies of data. The other option is to convert everything toand from network byte order on the way through. If the code will only ever run on one type of computer this would be a waste (and one more thing to get right).

    I expect most of the effort to be of the puzzling thing out type and that doesn't fork off well. That said, I expect I'll be looking for beta testers and code reviewers at some point. (and it would be easier to give you a holler if I knew who "Anonymous" is ;D)

    ReplyDelete
  8. I hope this is a first step to get a soap server or a soap client.

    ReplyDelete
  9. SOAP, wouldn't be to hard to add, but at this point I'm not specifically targeting it. It's naming would be a problem for one thing.

    ReplyDelete
  10. Hi BCS you might want to take a look to the serialization I did in blip ( http://dsource.org/projects/blip )

    It is quite flexible, and supports binary and json for the moment.

    I think that most of it is orthogonal to what you want to do (template magic to make it easy).

    Optionally now one can use Xpose with it to have a sort of simple serialization support like you sketched, but the use of Xpose1 is problematic with ldc

    Maybe we could collaborate...

    ReplyDelete