Serializing Java: inspecting data structure

May 12, 2017

Every deserializer needs information about types - the data structure. In my serialization the deserializer can work behind network connection so it can’t count on Reflection mechanism. That’s why the serializer has to discover and serialize data structure that could be sent over the network and understood by deserializer.

This is Serializing Java - emerging series about writing your own serializer in case you didn’t like other serializers.

Data vs data structure

It’s very important to differentiate two things. Data structure is not the data per se. My definition of data structure here is - a hierarchical description of the type which is a direct result of inspection.

Example:

class GameObject {
  Vector3 position;
  Vector3 size;
}

class Vector3 {
  float x, y, z;
}

While inspecting the GameObject type, to have a full description, we have to inspect it’s all fields: position and size. Those fields are types easy to inspect too. Both are Vector3. Then Vector3 has 3 fields: x, y, z.

So, a data structure in this case is a list of fields in GameObject and Vector3 where each field is described as a name, type and parent type. As for parent type, for instance, field size has parent type = GameObject.

Pass-by: value vs reference

Java is pass-by-value:

Java works exactly like C. You can assign a pointer, pass the pointer to a method, follow the pointer in the method and change the data that was pointed to. However, you cannot change where that pointer points.

In C++, Ada, Pascal and other languages that support pass-by-reference, you can actually change the variable that was passed.

Reference is not a pointer like in C++. We can’t manipulate reference. We have a reference or don’t - in the latter case we have null.

Based on that, we could state that serializing data in Java is all about values:

primitive types
a reference which could be some identifier in form of Integer
null - which tells us that this is neither primitive value nor a reference

In fact, internally reference is somewhat an ID with a pointer (or actually, pointers). All of this seems flat - we have byte, short, int, long and special case - null. Then there are Strings which are length + array of bytes. It’s fairly easy to serialize.

How many types you’ll deal with

However, it’s not that easy. Frankly speaking, Java complicates things a lot.

For starters, we may think that those types have to be considered only as:

primitive type
boxed primitive type (a reference or null)
class

The above is basically what you see when you code your stuff as usual. You know, implementing your software, game or whatever. Here’s some Integer, here’s some Class (having multiple fields), here’s some String.

And here’s what you’ll see when implementing a custom serializer:

enum (!)
inner class or inner enum
null value
array of primitive type
array of boxed primitive type
array of (any) objects
array of enum
collections
cycle references
references to static things
arrays of arrays…
arrays of collections

Now this is some list, isn’t it. Let’s dig into some of them.

Enum

Enum is a special case. It’s often seen as String because code is nicer. However, it’s Integer in terms of valuing it in memory. Couldn’t we just flatten enums into Integers? Well no, we want both information:

enumeration names
enumeration values

What’s worse, it’s a nullable thing so it doesn’t behave as primitive int. It’s not safe to deal with it as primitive. Why? Well, it wasn’t there since first version of Java. Here’s what enum really is:

public abstract class Enum<E extends Enum<E>>
    implements Comparable<E>, Serializable {

    // [...]

    /**
     * Returns true if the specified object is equal to this
     * enum constant.
     *
     * @param other the object to be compared for equality with this object.
     * @return  true if the specified object is equal to this
     *          enum constant.
     */
    public final boolean equals(Object other) { 
        return this==other;
    }
}

which it was introduced before Java 6 (JDK 1.5, specifically).

What’s little worse is an array of enum values.

Array of primitive and boxed types

int is a primitive, while Integer is a boxed primitive. Boxed primitive is a class instance, so it can be replaced with null .

As I would love things to be efficient in terms of network throughput, I would like to write an array of boxed Integers as a serie of primitive integers. Whether I can do this or not - depends:

int[] arr = new int[] { 0, 4, 5 };
arr[1] = null; // this line is wrong! Won't compile!

Integer[] arr = new Integer[] { 0, 4, 5 };
arr[1] = null; // this is fine!

…depends on the type of array component which in this case is int or Integer .

During serialization of data structure, this makes me to declare whether certain array is an array of primitive type or not.

Array of everything

Let’s inspect this class:

class GameState {
  Entity[] entities;
}

Very simple and popular case where component type of collection is not definite - it could be inherited. Array component types are treated covariantly. What it means, basically, is that I could instantiate an object of GameEntity class which extends Entity and put that object into entities . That makes things harder. Array of Entity (Entity[]) can’t be inspected deeply during serialization of data structure. It’s possible (or: just makes sense) to inspect type of each array element only during serialization of data.

Summary: Discovering data structure

There’s the data and a data structure.

I want to inspect only those types that are needed to transmit data over network and not the whole world of classes in a JVM process.

So, serializing all of this is a process that can’t be separated between serializing structure and data. Well, it could be, but read the sentence above or please go back to my previous articles to understand more about my needs here:

References

Artemis Entity Tracker, Daj Się Poznać, Get Noticed 2017, java, Serializing Java