A more memory-efficient way of using EqualsFilter on Coherence 3.6+

This seems just a small finger-practice and a juggling with some bytes but it can be useful for a great many people, so I post it here.

There are two main ways of filtering your data in Coherence. Using indexes or not. If you do not use indexes, then you have to examine your entries one-by-one to evaluate your criteria encapsulated within your Filter instance. During this you typically extract some data from your cached key or value and then examine your extracted value only. Then you do it for the next entry, then for the next entry, then for the next... Depending on how you extract this information and how you examine it, you may or may not have a large memory footprint.

There are two typical ways of extracting this information.
  • Deserializing your entire object and call some methods to access that part of the cached value which you are interested in. In this case you have to read through the entire binary representation and instantiate the entire Java object graph and then call some methods.
  • Use PofExtractors. In this case you need to read through the binary representation only up to the point where your extracted object ends, and instantiate only the extracted object. Obviously, this is much more efficient than deserializing the entire object hierarchy if your extracted part is close to the beginning of the entire binary representation.

You can ask, how can we make this more efficient? Do we want to make this more efficient? Possibly...

Let's look at the case when your extracted information is not trivial but a sizeable object hierarchy on its own. In this case iterating through a big candidate set would still lead to a large number of objects being instantiated. Is this really necessary? Can it be avoided?

Let's take a detour, and look at how String.substring() works in Java.

String objects internally have the following attributes relevant for storing the string itself:

  • a char[] for storing the content of the String. However the String does not necessarily start at 0 index and does not necessarily end where the array ends.
  • an int value which is an array index which stores where the String actually starts within the char[]
  • another int value which tells the length of the String

Since a String is immutable, its substring can be safely represented within the same char[], and therefore String.substring() returns another String instance which internally points to the same underlying char[] which the String instance you called substring() on points to. This is why a String does not necessarily start at the beginning of its char[]. Because of this, calling String.substring() is quite cheap as it does not involve char[] copying.

I know, this seems totally out-of-place. Stay with me, it will become clear in a moment. Let's take another detour and look at how Coherence stores binary data and how POF-extraction works.

A Coherence Binary object stores the following information (actually its superclass ByteArrayReadBuffer stores it) about the Binary:

  • byte[] containing the binary form of the data the serialized form of which is represented by the Binary object, but it does not necessarily take up the entire array, only a single contiguous section of it
  • an int value telling the index within the array where the data actually starts
  • an int value telling how many bytes the Binary object represents

A Binary is for all intents immutable, therefore it can employ the same trick which String objects employ, and lo and behold, Binary.getReadBuffer(int, int) does exactly that.

Ok, now how is this useful to us? Let's look at how POF extraction works in Coherence. Coherence stores cached data as Binary objects in its backing map. When you implement your filter as an implementation of the EntryFilter interface, you need to evaluate an entry which can be case to BinaryEntry which gives you access to the Binary representation of the key and value Coherence stores (at least in distributed caches, which is the most frequently used Coherence cache topology, and the one for which POF extraction is applicable).

PofExtractors work by accessing these Binary representations and parse them into a PofValue object hierarchy which can lazily navigate the POF data structure within the Binary, and then once they get to the requested point, they deserialize only the subset of the POF binary which the navigated-to PofValue represents. Less object creation than full deserialization. Something like the code below:

    public Object extractFromEntry(Entry entry) {
        BinaryEntry be = (BinaryEntry) entry;
        Binary binary = m_nTarget == KEY ? be.getBinaryKey() : be.getBinaryValue();
        PofValue pofValue = PofValueParser.parsePofValue(binary, (PofContext)be.getSerializer());
        PofValue childValue = m_navigator.navigate(pofValue);
        return childValue.getValue();
    }

Now what are these PofValue objects? In short, they are the POF representations of your objects in the hierarchy. And how does that look in the serialized form? They represent a contigous subsection of bytes within the POF-format binary representation of your top-level object. Sounds familiar? Yep, they store the section they represent as a Binary instance, which Binary references the byte[] of the Binary within the BinaryEntry. And it is even possible to get hold of it with casting the PofValue to AbstractPofValue and calling getSerializedValue() on it.

And why is it useful to us? Well... most of the classes are implemented in a way, where calling equals() on two objects is equivalent to comparing their binary representation. It is not necessarily true for all classes, typical cases for it not being true are classes which contain hash-based data structures where iteration order during serialization is not necessarily the same for two equalling instances unless it is somehow enforced. But for large object hierarchies where POF serialization traversal order is deterministic, it is true.

And still, why is this useful? Because if all you want is run an equality check (EqualsFilter) on your data, instead of deserializing a large extracted object hierarchy for each evaluated entry, you can serialize the comparison value once, and do a binary comparison which is much cheaper after even a very few entries.

You can even leverage this with unsorted indexes, as the footprint of the index would be lower in this case than if they held a large object hierarchy. One thing you need to take care about is that since the Binary you extract this way still shares the byte[] with the Binary in the original entry, it must be cloned before containing it in an index, as otherwise it would in certain cases (same value extracted from multiple entries, then the first entry it was extracted from is removed/changed to another value) cause a memory leak or index corruption due to holding on to that byte[] possibly longer than while Coherence otherwise would expect the index to hold on to it. The cost of a Binary object is 2 objects, and 36 bytes plus a packed-int for the type id (2-3 bytes) plus the serialized size rounded up to the next number divisible by 8. After 3 objects, your Binary representation is likely to consume less memory.

Unfortunately, if you want a sorted indexes, this is not likely going to help, as it is quite problematic to write a cheap comparator for the binary value. Interestingly enough, for serialized Strings this may be possible, but I will leave thinking that through to you.

And how do you achieve this? Very simple in fact. Just write an IndexAwareExtractor implementation which does the extraction and optionally the cloning, and write a subclass for EqualsFilter which does the serialization of the comparison value before it can be used for the equality check. A sample implementation of these is attached to this page, although no warranty that they work in all cases (they should, I just did not test it for all cases).

You may shave off the querying thread footprint a bit further by substituting the PofValueParser with implementing your own POF traversal code homing in on the point in the Binary by writing a PofHandler and using it with PofParser, but the idea is the same. Another advantage of writing a PofParser/PofHandler-based implementation is that you may implement navigation into a Map with it (you may have to subclass and hack PofParser a bit, though).


ċ
Robert Varga,
10 Nov 2012, 11:47
ċ
Robert Varga,
10 Nov 2012, 11:47
Comments