You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@uima.apache.org by Richard Eckart de Castilho <ec...@ukp.informatik.tu-darmstadt.de> on 2012/08/15 10:09:29 UTC

CAS serialization performance: XMI vs. Java serialization

Hi,

I am looking for a way to improve loading times in an application, so I did a little experiment with binary CAS serialization to see if it was superior to XMI serialization. For serialization I used the CASCompleteSerializer to serialize the type-system and heaps into the same file using Java object serialization - at least that is what I understood it should do. To read in these files, I would deserialize the CASCompleteSerializer and initialize a CAS from it using CASImpl.reinit().

96.400 files

plain text (uncompressed)      :                 581.865.593 Byte
binary (serialized java, gzip) : 0:47:02.835   3.555.449.597 Byte 
xmi (gzip)                     : 1:20:31.535   4.712.633.769 Byte

So binary takes about 60% of the time xmi serialization would need and uses about 75% of the space.
I didn't do reading experiment yet, but I suppose the improvement should be on a similar level, if not better.

I am also not sure yet about the draw-backs of binary serialization and in which scenarios they apply. The draw-backs I saw so far are:

- Type-system is stored redudantly in every output file.
- The type system configured with CASImpl.reinit() may be different from the one which was used to initialize the pipeline, CAS-based annotators relying on typeSystemInit() may not be configured with the correct types - this is a hypothesis I didn't test.
- Serialized Java objects may become due to refactoring within the UIMA framework. However, there is yet another binary CAS serialization in UIMA which uses the DataOutputStream and may be more stable.

Did anybody ever use any form of binary CAS serialization outside Vinci/UIMA-AS?

Cheers,

-- Richard

-- 
------------------------------------------------------------------- 
Richard Eckart de Castilho
Technical Lead
Ubiquitous Knowledge Processing Lab (UKP-TUD) 
FB 20 Computer Science Department      
Technische Universität Darmstadt 
Hochschulstr. 10, D-64289 Darmstadt, Germany 
phone [+49] (0)6151 16-7477, fax -5455, room S2/02/B117
eckart@ukp.informatik.tu-darmstadt.de 
www.ukp.tu-darmstadt.de 
Web Research at TU Darmstadt (WeRC) www.werc.tu-darmstadt.de
-------------------------------------------------------------------

Re: CAS serialization performance: XMI vs. Java serialization

Posted by Marshall Schor <ms...@schor.com>.

On 8/17/2012 6:33 PM, Richard Eckart de Castilho wrote:
> Thanks for the pointer Marshall. Given though that the whole process ran for about 
> 30 minutes and the setup was comparatively simple, the JIT effect should be hardly
> noticeable. Would you agree?
yep.
>
> In any case, the measure is not meant to be exact, but rather give a better idea about the
> performance improvement of binary serialization over XMI. At least I am pretty
> convinced now that I should switch from XMI to binary persistence in some scenarios.
>
> -- Richard 
>
> Am 18.08.2012 um 00:02 schrieb Marshall Schor:
>
>> One other thing I've noticed is important - because of Java's JIT, you need to
>> "warm up" things before doing measurements.  Most commonly, people run the
>> thing-being-measured multiple times, in a loop, and see a speedup - until
>> there's no more speedup.
>>
>> -Marshall
>>
>> On 8/17/2012 5:40 PM, Richard Eckart de Castilho wrote:
>>> Small update in case anybody is interested. I ran the experiment again, this time writing to a ByteArrayOutputStream (initialized with a 512kb buffer). So it's measuring encoding time now, no I/O, no GZip.
>>>
>>> bin: 0:04:17.699 	11.266.341.029 byte
>>> xmi: 0:24:40.485 	23.961.447.013 byte
>>>
>>> That's more the expected difference. Still no results for reading though.
>>>
>>>>> I am looking for a way to improve loading times in an application, so I did a little experiment with binary CAS serialization to see if it was superior to XMI serialization. For serialization I used the CASCompleteSerializer to serialize the type-system and heaps into the same file using Java object serialization - at least that is what I understood it should do. To read in these files, I would deserialize the CASCompleteSerializer and initialize a CAS from it using CASImpl.reinit().
>>>>>
>>>>> 96.400 files
>>>>>
>>>>> plain text (uncompressed)      :                 581.865.593 Byte
>>>>> binary (serialized java, gzip) : 0:47:02.835   3.555.449.597 Byte 
>>>>> xmi (gzip)                     : 1:20:31.535   4.712.633.769 Byte

Re: CAS serialization performance: XMI vs. Java serialization

Posted by Richard Eckart de Castilho <ec...@ukp.informatik.tu-darmstadt.de>.

Thanks for the pointer Marshall. Given though that the whole process ran for about 
30 minutes and the setup was comparatively simple, the JIT effect should be hardly
noticeable. Would you agree?

In any case, the measure is not meant to be exact, but rather give a better idea about the
performance improvement of binary serialization over XMI. At least I am pretty
convinced now that I should switch from XMI to binary persistence in some scenarios.

-- Richard 

Am 18.08.2012 um 00:02 schrieb Marshall Schor:

> One other thing I've noticed is important - because of Java's JIT, you need to
> "warm up" things before doing measurements.  Most commonly, people run the
> thing-being-measured multiple times, in a loop, and see a speedup - until
> there's no more speedup.
> 
> -Marshall
> 
> On 8/17/2012 5:40 PM, Richard Eckart de Castilho wrote:
>> Small update in case anybody is interested. I ran the experiment again, this time writing to a ByteArrayOutputStream (initialized with a 512kb buffer). So it's measuring encoding time now, no I/O, no GZip.
>> 
>> bin: 0:04:17.699 	11.266.341.029 byte
>> xmi: 0:24:40.485 	23.961.447.013 byte
>> 
>> That's more the expected difference. Still no results for reading though.
>> 
>>>> I am looking for a way to improve loading times in an application, so I did a little experiment with binary CAS serialization to see if it was superior to XMI serialization. For serialization I used the CASCompleteSerializer to serialize the type-system and heaps into the same file using Java object serialization - at least that is what I understood it should do. To read in these files, I would deserialize the CASCompleteSerializer and initialize a CAS from it using CASImpl.reinit().
>>>> 
>>>> 96.400 files
>>>> 
>>>> plain text (uncompressed)      :                 581.865.593 Byte
>>>> binary (serialized java, gzip) : 0:47:02.835   3.555.449.597 Byte 
>>>> xmi (gzip)                     : 1:20:31.535   4.712.633.769 Byte

-- 
------------------------------------------------------------------- 
Richard Eckart de Castilho
Technical Lead
Ubiquitous Knowledge Processing Lab (UKP-TUD) 
FB 20 Computer Science Department      
Technische Universität Darmstadt 
Hochschulstr. 10, D-64289 Darmstadt, Germany 
phone [+49] (0)6151 16-7477, fax -5455, room S2/02/B117
eckart@ukp.informatik.tu-darmstadt.de 
www.ukp.tu-darmstadt.de 
Web Research at TU Darmstadt (WeRC) www.werc.tu-darmstadt.de
-------------------------------------------------------------------

Re: CAS serialization performance: XMI vs. Java serialization

Posted by Marshall Schor <ms...@schor.com>.

One other thing I've noticed is important - because of Java's JIT, you need to
"warm up" things before doing measurements.  Most commonly, people run the
thing-being-measured multiple times, in a loop, and see a speedup - until
there's no more speedup.

-Marshall

On 8/17/2012 5:40 PM, Richard Eckart de Castilho wrote:
> Small update in case anybody is interested. I ran the experiment again, this time writing to a ByteArrayOutputStream (initialized with a 512kb buffer). So it's measuring encoding time now, no I/O, no GZip.
>
> bin: 0:04:17.699 	11.266.341.029 byte
> xmi: 0:24:40.485 	23.961.447.013 byte
>
> That's more the expected difference. Still no results for reading though.
>
> Cheers,
>
> -- Richard
>
>>> I am looking for a way to improve loading times in an application, so I did a little experiment with binary CAS serialization to see if it was superior to XMI serialization. For serialization I used the CASCompleteSerializer to serialize the type-system and heaps into the same file using Java object serialization - at least that is what I understood it should do. To read in these files, I would deserialize the CASCompleteSerializer and initialize a CAS from it using CASImpl.reinit().
>>>
>>> 96.400 files
>>>
>>> plain text (uncompressed)      :                 581.865.593 Byte
>>> binary (serialized java, gzip) : 0:47:02.835   3.555.449.597 Byte 
>>> xmi (gzip)                     : 1:20:31.535   4.712.633.769 Byte

Re: CAS serialization performance: XMI vs. Java serialization

Posted by Richard Eckart de Castilho <ec...@ukp.informatik.tu-darmstadt.de>.

Small update in case anybody is interested. I ran the experiment again, this time writing to a ByteArrayOutputStream (initialized with a 512kb buffer). So it's measuring encoding time now, no I/O, no GZip.

bin: 0:04:17.699 	11.266.341.029 byte
xmi: 0:24:40.485 	23.961.447.013 byte

That's more the expected difference. Still no results for reading though.

Cheers,

-- Richard

>> I am looking for a way to improve loading times in an application, so I did a little experiment with binary CAS serialization to see if it was superior to XMI serialization. For serialization I used the CASCompleteSerializer to serialize the type-system and heaps into the same file using Java object serialization - at least that is what I understood it should do. To read in these files, I would deserialize the CASCompleteSerializer and initialize a CAS from it using CASImpl.reinit().
>> 
>> 96.400 files
>> 
>> plain text (uncompressed)      :                 581.865.593 Byte
>> binary (serialized java, gzip) : 0:47:02.835   3.555.449.597 Byte 
>> xmi (gzip)                     : 1:20:31.535   4.712.633.769 Byte

-- 
------------------------------------------------------------------- 
Richard Eckart de Castilho
Technical Lead
Ubiquitous Knowledge Processing Lab (UKP-TUD) 
FB 20 Computer Science Department      
Technische Universität Darmstadt 
Hochschulstr. 10, D-64289 Darmstadt, Germany 
phone [+49] (0)6151 16-7477, fax -5455, room S2/02/B117
eckart@ukp.informatik.tu-darmstadt.de 
www.ukp.tu-darmstadt.de 
Web Research at TU Darmstadt (WeRC) www.werc.tu-darmstadt.de
-------------------------------------------------------------------

Re: CAS serialization performance: XMI vs. Java serialization

Posted by Marshall Schor <ms...@schor.com>.

As a side comment, in previous benchmarking I've done on other systems, I've
found that using memory mapped IO (part of Java NIO) can make a lot of difference.

Also, when we put in gzip we expected it to speed things up, but it actually
quite slowed things down.

-Marshall


On 8/15/2012 4:09 AM, Richard Eckart de Castilho wrote:
> Hi,
>
> I am looking for a way to improve loading times in an application, so I did a little experiment with binary CAS serialization to see if it was superior to XMI serialization. For serialization I used the CASCompleteSerializer to serialize the type-system and heaps into the same file using Java object serialization - at least that is what I understood it should do. To read in these files, I would deserialize the CASCompleteSerializer and initialize a CAS from it using CASImpl.reinit().
>
> 96.400 files
>
> plain text (uncompressed)      :                 581.865.593 Byte
> binary (serialized java, gzip) : 0:47:02.835   3.555.449.597 Byte 
> xmi (gzip)                     : 1:20:31.535   4.712.633.769 Byte
>
> So binary takes about 60% of the time xmi serialization would need and uses about 75% of the space.
> I didn't do reading experiment yet, but I suppose the improvement should be on a similar level, if not better.
>
> I am also not sure yet about the draw-backs of binary serialization and in which scenarios they apply. The draw-backs I saw so far are:
>
> - Type-system is stored redudantly in every output file.
> - The type system configured with CASImpl.reinit() may be different from the one which was used to initialize the pipeline, CAS-based annotators relying on typeSystemInit() may not be configured with the correct types - this is a hypothesis I didn't test.
> - Serialized Java objects may become due to refactoring within the UIMA framework. However, there is yet another binary CAS serialization in UIMA which uses the DataOutputStream and may be more stable.
>
> Did anybody ever use any form of binary CAS serialization outside Vinci/UIMA-AS?
>
> Cheers,
>
> -- Richard
>

Re: CAS serialization performance: XMI vs. Java serialization

Posted by Thilo Goetz <tw...@gmx.de>.

On 15/08/12 11:09, Richard Eckart de Castilho wrote:
> Am 15.08.2012 um 11:00 schrieb Thilo Goetz:
> 
>> However, as I recall, there was a way you could serialize the CAS
>> without the type system if you were sure you didn't need it.  Isn't that
>> the difference between the CasCompleteSerializer and the
>> NotSoCompleteSerializer (making that up here)?  On the way back, you can
>> deserialize into an existing CAS that has the right type system.
> 
> I tried the CasCompleteSerializer (in contrast to the CasSerializer) because I am not sure what
> "the right type system" means. Afaik, on configuration of the type system, type internally get assigned
> numeric IDs which are then used in the heap. I wasn't sure if these couldn't change between JVM
> runs, even though the type system is technically the same.

If you serialize many CASes from the same UIMA pipeline, you only need
to serialize the type system once.  However, you do need to have a
serialized binary version of that type system.  The assignment of codes
to types and features is not deterministic and may vary between JVMs.

> 
>> Your times above, do they include time needed to do the compression?
>> I'm surprised binary serialization is not even twice as fast.  Or is
>> this gated by the disk I/O?
> 
> It currently includes gzip compression and is limited by disk i/o, since that's the scenario I am faced with.
> For curiosity, I was planning to run the same test writing to a ByteArrayOutputStream to see how much time
> the actual encoding takes. I was also surprised that it wasn't faster and in particular that the file size
> wasn't much smaller.

The XMI compresses really well because it's mostly air ;-)  The binary
serialization is actually pretty wasteful, particularly for small CASes.
 This is because all data types other than strings are encoded as
integers and always take up 32 bits.  I don't know how well compression
handles that kind of scenario.  I also don't know how strings are
serialized in the binary serialization.  Is there a conversion to UTF-8?
 If not, it gets serialized as UTF-16, which also is a huge waste for
English text.  So I'm not so surprised by the file sizes.  But I would
have expected a bigger time difference.

--Thilo

> 
> -- Richard
>

Re: CAS serialization performance: XMI vs. Java serialization

Posted by Richard Eckart de Castilho <ec...@ukp.informatik.tu-darmstadt.de>.

Am 15.08.2012 um 11:00 schrieb Thilo Goetz:

> However, as I recall, there was a way you could serialize the CAS
> without the type system if you were sure you didn't need it.  Isn't that
> the difference between the CasCompleteSerializer and the
> NotSoCompleteSerializer (making that up here)?  On the way back, you can
> deserialize into an existing CAS that has the right type system.

I tried the CasCompleteSerializer (in contrast to the CasSerializer) because I am not sure what
"the right type system" means. Afaik, on configuration of the type system, type internally get assigned
numeric IDs which are then used in the heap. I wasn't sure if these couldn't change between JVM
runs, even though the type system is technically the same.

> Your times above, do they include time needed to do the compression?
> I'm surprised binary serialization is not even twice as fast.  Or is
> this gated by the disk I/O?

It currently includes gzip compression and is limited by disk i/o, since that's the scenario I am faced with.
For curiosity, I was planning to run the same test writing to a ByteArrayOutputStream to see how much time
the actual encoding takes. I was also surprised that it wasn't faster and in particular that the file size
wasn't much smaller.

-- Richard

-- 
------------------------------------------------------------------- 
Richard Eckart de Castilho
Technical Lead
Ubiquitous Knowledge Processing Lab (UKP-TUD) 
FB 20 Computer Science Department      
Technische Universität Darmstadt 
Hochschulstr. 10, D-64289 Darmstadt, Germany 
phone [+49] (0)6151 16-7477, fax -5455, room S2/02/B117
eckart@ukp.informatik.tu-darmstadt.de 
www.ukp.tu-darmstadt.de 
Web Research at TU Darmstadt (WeRC) www.werc.tu-darmstadt.de
-------------------------------------------------------------------

Re: CAS serialization performance: XMI vs. Java serialization

Posted by Thilo Goetz <tw...@gmx.de>.

On 15/08/12 10:09, Richard Eckart de Castilho wrote:
> Hi,
> 
> I am looking for a way to improve loading times in an application, so I did a little experiment with binary CAS serialization to see if it was superior to XMI serialization. For serialization I used the CASCompleteSerializer to serialize the type-system and heaps into the same file using Java object serialization - at least that is what I understood it should do. To read in these files, I would deserialize the CASCompleteSerializer and initialize a CAS from it using CASImpl.reinit().
> 
> 96.400 files
> 
> plain text (uncompressed)      :                 581.865.593 Byte
> binary (serialized java, gzip) : 0:47:02.835   3.555.449.597 Byte 
> xmi (gzip)                     : 1:20:31.535   4.712.633.769 Byte
> 
> So binary takes about 60% of the time xmi serialization would need and uses about 75% of the space.
> I didn't do reading experiment yet, but I suppose the improvement should be on a similar level, if not better.
> 
> I am also not sure yet about the draw-backs of binary serialization and in which scenarios they apply. The draw-backs I saw so far are:
> 
> - Type-system is stored redudantly in every output file.
> - The type system configured with CASImpl.reinit() may be different from the one which was used to initialize the pipeline, CAS-based annotators relying on typeSystemInit() may not be configured with the correct types - this is a hypothesis I didn't test.
> - Serialized Java objects may become due to refactoring within the UIMA framework. However, there is yet another binary CAS serialization in UIMA which uses the DataOutputStream and may be more stable.
> 
> Did anybody ever use any form of binary CAS serialization outside Vinci/UIMA-AS?

Not sure this will help, but we originally implemented the binary
serialization to pass the CAS between Java and C++ (in-process).  Used
that way it's blindingly fast because Java and C++ use the same heap
layout for the CAS data.  We have also used it for communication via
SOAP, but I'm not sure I would do that today.  I might prefer a format
that's at least somewhat human readable.  I do not recall ever using the
binary format for serialization to disk, just because I never had that
use case.

The thing with the binary serialization is that type information is
encoded in a binary format as well.  So you need to be sure that when
you read it back in, every type and feature gets assigned the same code,
otherwise the heap is garbage.  That's why you need to be sure to use
the correct, encoded type system as well.

However, as I recall, there was a way you could serialize the CAS
without the type system if you were sure you didn't need it.  Isn't that
the difference between the CasCompleteSerializer and the
NotSoCompleteSerializer (making that up here)?  On the way back, you can
deserialize into an existing CAS that has the right type system.

Your times above, do they include time needed to do the compression?
I'm surprised binary serialization is not even twice as fast.  Or is
this gated by the disk I/O?

--Thilo

> 
> Cheers,
> 
> -- Richard
>