Serializing Java: cyclic dependencies
Serializer can detect cyclic dependencies in two essential phases:
- inspection of data structure
- data serialization
Let’s dig into both of them.
This is Serializing Java - emerging series about writing your own serializer in case you didn’t like other serializers.
Cyclic?
Cyclic dependency is a situation when something is referred infinitely in a circle during recursion. It may be referred directly or indirectly but the (serialization) algorithm always stumble upon it on it’s path.
Going with circular data without taking special approach is futile because it won’t ever finish. To solve this, we need some algorithms to detect and break forever loops.
However, the serializer has two similar but distinct problems:
- circles in types
- circles in data
Circles in types
Those circles are to be found in data structure inspection.
The simpliest situation is a direct reference where a field of class is of the same type as the class itself:
public class CyclicClass {
public CyclicClass other;
public CyclicClass alwaysNull = null;
}
By the way, here’s a unit test for this case (written in Kotlin) for my serializer built into Artemis Entity Tracker:
@Test
fun inspect_cyclic_reference() {
val obj = CyclicClass()
obj.other = CyclicClass()
obj.other.other = obj
val model = inspector.inspect(obj.javaClass)
assertEquals(2, model.children!!.size.toLong())
assertEquals(model.id.toLong(), model.children!!.elementAt(0).id.toLong())
assert(model !== model.children!!.elementAt(0))
}
The other field could refer to the same object or an another instance (as above). It doesn’t really matter because type inspection is a separate thing to actual data. During type inspection we look for fields, then types of those fields, then types of fields of types for root fields and so on. Recursion plays a role but please note that exchanging a recursive function call to with a flat loop filling a stack makes no difference.
Detection is rather simple. As the inspection algorithm goes down through the structure of types it needs to remember everything that has been visited. So, if algorithm stumbles upon a type thas was already visited then it should stop. Instead of going deeper it should simply reference to an inspection result of the type that was done before.
The only realization we need to consider is how we reference other types in a structure. Guess what - IDs will help. Always create an ID for nodes in graphs, then you can refer to by IDs instead of references. In fact, language references are not really serializable so if you want to serialize your data structure then you have to mark nodes with IDs anyway.
To be sure that our algorithm works we would need to also test indirect cycles:
public class IndirectlyCyclicClass {
public ArrayClass arr;
}
public class ArrayClass {
public IndirectlyCyclicClass [] objs;
}
So, the IndirectlyCyclicClass
:s fields are not referring to the same type but some other type (ArrayClass
) refers back to it.
Circles in data
Those circles are to be found during the data serialization.
Reference circle tends to appear everytime when the type circle is detected. However, there are cases when type inspection algorithm won’t detect any type circles and there still will appear circles of references in the data.
public class Father {
public Object[] children;
}
Father father = new Father();
father.children = new Object[] { father };
Nasty, isn’t it? Even if this case may be useless you should imagine that undetectable-by-types situations do appear sometimes.
How to deal with that? During a call:
serializer.addObject(father);
the algorithm should work pretty similar in a concept as with type inspection. We have to remember all objects that were already serialized during this call. It’s a little sad because it can’t be super performant.
Objects don’t have IDs and there is no way to generate those since data serialization is about recreation while building type structure is about pure creation. So, checking whether we should either serialize an object or just refer to it in memory (because it was already serialized during the call) has to make a reference comparison to everything that was already serialized.
But that’s not too fast. How to optimize this?
- hash codes. Every Object in Java has built-in method getHashCode(). This method would produce a key to HashMaps which are quicker to look through when looking a certain object by it’s key (the hash code) since HashMap sorts those strings. It’s interesting but I can’t be sure about custom hashcode calculation algorithms - will all hashcodes be unique? In fact, that technique could be used but in situations of conflicts, we would have to look into a flat Object array. Another drawback is that the hashcode calculation may take too much time for smaller objects.
- omit cyclic check for fields which are impossible to contain cyclic references. This information could be cached during type inspection. I didn’t make a strong research through this idea but it feels like an option.
- JIT-like optimization. We could switch some optimization methods depending on serialization performance of certain types.
Summary
The cyclic nature of the problem is very simple in type inspection (first phase) but causes performance hits in data serialization (second phase). Both are simple to cover but the second one depends on a lot on data size (amount of references). The beneficial improvement would be to foresee situations when algorithm wouldn’t need the cycle detection in the second phase. However, if the inspection visualization is taken object by object - it shouldn’t be a problem.