You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@uima.apache.org by "Greg Holmberg (JIRA)" <de...@uima.apache.org> on 2011/07/14 22:32:59 UTC

[jira] [Commented] (UIMA-2128) Support to for gzipped XMI files

    [ https://issues.apache.org/jira/browse/UIMA-2128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13065516#comment-13065516 ] 

Greg Holmberg commented on UIMA-2128:
-------------------------------------

I'm not sure where in the code it should be implemented (which seems to be Jörn's point), but another technology option is EXI, a binary encoding for XML.  See http://www.w3.org/XML/EXI 

I've experimented with this to encode XMI between a UIMA process and a database on separate machine (both ends Java).  I used a commercial implementation, Efficient XML from AgileDelta, and did some throughput measurements, comparing it to gzipped XMI XML.  I found that EXI produces somewhat smaller data than gzipped XML text (maybe 10% or 20% smaller, if I remember correctly).  The biggest benefit to EXI was the amount of CPU time required to read and write.  It was quite a bit faster than gzip to generate the XML and parse the XML.  This is probably because it does so directly from the ContentHandler, whereas with gzipped text, you first have to write the text and then compress it.  Also, I suppose it's just more efficient to parse a binary format than to step through characters looking for certain tokens.

In my case, the improved throughput and reduction of CPU usage was most important on the receiving end (i.e. the database) since it is a central bottle-neck in the overall landscape of my system.  As the number of UIMA senders increases, the database reaches it's limits to handle more messages (no more CPU or NIC capacity on that machine).  So it was important to me to be as efficient as possible with the XMI parsing on that machine in order to minimize my hardware costs.

In my case, all annotators are local, but a similar bottle-neck situation could arise in UIMA AS if you have a remote annotator (service).  Then, making that UIMA processor as efficient as possible in terms of both CPU and network bandwidth usage becomes important.  GZip will help a lot compared to plain text, but EXI is even better, especially to reduce the CPU usage of XML generating and parsing, but also somewhat on the network bandwidth.

Some open-source implementations of EXI are listed here: http://en.wikipedia.org/wiki/Efficient_XML_Interchange

I also tried the open-source Java implementation, EXIficient.  In early 2010 it implemented the standard technically correctly, but was immature, slow (really, really slow!), and used a lot of memory.  However, it's been a year since, so maybe it's improved since then.  I talked to the developers (from Siemens) about their use of the GPL license, and they were not interested in changing to an Apache-compatible license, so that may be an issue for use in UIMA.

I have not tried the other open-source Java implementation, OpenEXI. This uses the Apache license though.  There's some discussion of gzipped XML text versus EXI here: http://openexi.sourceforge.net/#whynotgzip 

There's also an open-source C implementation, called EXIP.  I don't know anything about it.


> Support to for gzipped XMI files
> --------------------------------
>
>                 Key: UIMA-2128
>                 URL: https://issues.apache.org/jira/browse/UIMA-2128
>             Project: UIMA
>          Issue Type: Wish
>          Components: CasEditor
>            Reporter: Richard Eckart de Castilho
>
> Since XMI files tend to grow rather rapidly, it would be great if the CAS Editor supported to read and write gzipped XMI files (.xmi.gz).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: [jira] [Commented] (UIMA-2128) Support to for gzipped XMI files

Posted by Jörn Kottmann <ko...@gmail.com>.

Isn't gzip infamous for being slow? HBase is using LZO for this reason. 
Apache Thrift
also using something else.

I can recommend to use UIMA-AS and let it process data stored in HBase, 
we run
a couple of UIMA-AS services on the same machines hosting Hadoop and got 
good
results. Anyway since we are using OpenNLP, the bottleneck we hit is CPU 
power.

The first optimization I would do is to get data locality with UIMA-AS, 
then the
network bottleneck vanishes.

What kind of analysis do you run? Is it also CPU intensive?

Jörn

On 7/14/11 10:32 PM, Greg Holmberg (JIRA) wrote:
>      [ https://issues.apache.org/jira/browse/UIMA-2128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13065516#comment-13065516 ]
>
> Greg Holmberg commented on UIMA-2128:
> -------------------------------------
>
> I'm not sure where in the code it should be implemented (which seems to be Jörn's point), but another technology option is EXI, a binary encoding for XML.  See http://www.w3.org/XML/EXI
>
> I've experimented with this to encode XMI between a UIMA process and a database on separate machine (both ends Java).  I used a commercial implementation, Efficient XML from AgileDelta, and did some throughput measurements, comparing it to gzipped XMI XML.  I found that EXI produces somewhat smaller data than gzipped XML text (maybe 10% or 20% smaller, if I remember correctly).  The biggest benefit to EXI was the amount of CPU time required to read and write.  It was quite a bit faster than gzip to generate the XML and parse the XML.  This is probably because it does so directly from the ContentHandler, whereas with gzipped text, you first have to write the text and then compress it.  Also, I suppose it's just more efficient to parse a binary format than to step through characters looking for certain tokens.
>
> In my case, the improved throughput and reduction of CPU usage was most important on the receiving end (i.e. the database) since it is a central bottle-neck in the overall landscape of my system.  As the number of UIMA senders increases, the database reaches it's limits to handle more messages (no more CPU or NIC capacity on that machine).  So it was important to me to be as efficient as possible with the XMI parsing on that machine in order to minimize my hardware costs.
>
> In my case, all annotators are local, but a similar bottle-neck situation could arise in UIMA AS if you have a remote annotator (service).  Then, making that UIMA processor as efficient as possible in terms of both CPU and network bandwidth usage becomes important.  GZip will help a lot compared to plain text, but EXI is even better, especially to reduce the CPU usage of XML generating and parsing, but also somewhat on the network bandwidth.
>
> Some open-source implementations of EXI are listed here: http://en.wikipedia.org/wiki/Efficient_XML_Interchange
>
> I also tried the open-source Java implementation, EXIficient.  In early 2010 it implemented the standard technically correctly, but was immature, slow (really, really slow!), and used a lot of memory.  However, it's been a year since, so maybe it's improved since then.  I talked to the developers (from Siemens) about their use of the GPL license, and they were not interested in changing to an Apache-compatible license, so that may be an issue for use in UIMA.
>
> I have not tried the other open-source Java implementation, OpenEXI. This uses the Apache license though.  There's some discussion of gzipped XML text versus EXI here: http://openexi.sourceforge.net/#whynotgzip
>
> There's also an open-source C implementation, called EXIP.  I don't know anything about it.
>
>
>> Support to for gzipped XMI files
>> --------------------------------
>>
>>                  Key: UIMA-2128
>>                  URL: https://issues.apache.org/jira/browse/UIMA-2128
>>              Project: UIMA
>>           Issue Type: Wish
>>           Components: CasEditor
>>             Reporter: Richard Eckart de Castilho
>>
>> Since XMI files tend to grow rather rapidly, it would be great if the CAS Editor supported to read and write gzipped XMI files (.xmi.gz).
> --
> This message is automatically generated by JIRA.
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>
>

Re: [jira] [Commented] (UIMA-2128) Support to for gzipped XMI files

Posted by Jörn Kottmann <ko...@gmail.com>.

On 7/15/11 5:45 PM, Richard Eckart de Castilho wrote:
> I don't see why a pluggable zip should be necessary. Java supports ZIP (JAR) out of the box using the classes in java.util.zip. If the type system is not persisted together with the XMI, then a GZIP (Java Native) or BZIP2 (comes with Apache Ant) would be ok as well. Given that a reader cannot change the type system of a CAS, carrying a serialized type system with each XMI is questionable.

Performance of the compression algorithm is very important for the group
of people who need compression to more efficiently exchange CASes.

If you store a huge number of CASes in some kind of database, you can simply
use the compression of the DB instead of compressing CASes yourself.
We do that with for example with HBase and LZO.

Jörn

Re: [jira] [Commented] (UIMA-2128) Support to for gzipped XMI files

Posted by Richard Eckart de Castilho <ec...@tk.informatik.tu-darmstadt.de>.

I don't see why a pluggable zip should be necessary. Java supports ZIP (JAR) out of the box using the classes in java.util.zip. If the type system is not persisted together with the XMI, then a GZIP (Java Native) or BZIP2 (comes with Apache Ant) would be ok as well. Given that a reader cannot change the type system of a CAS, carrying a serialized type system with each XMI is questionable.

Cheers,

Richard

Am 15.07.2011 um 04:32 schrieb Marshall Schor:

> there seem to be lots of zip implementations - another one for instance is
> 7-zip.  I haven't studied this issue enough to have a real opinion, but if zips
> are implemented, I wonder if it would be good to have some kind of a pluggable
> mechanism to allow for different zips for different circumstances.
> 
> -Marshall
> 
> On 7/14/2011 4:32 PM, Greg Holmberg (JIRA) wrote:
>>    [ https://issues.apache.org/jira/browse/UIMA-2128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13065516#comment-13065516 ] 
>> 
>> Greg Holmberg commented on UIMA-2128:
>> -------------------------------------
>> 
>> I'm not sure where in the code it should be implemented (which seems to be Jörn's point), but another technology option is EXI, a binary encoding for XML.  See http://www.w3.org/XML/EXI 
>> 
>> I've experimented with this to encode XMI between a UIMA process and a database on separate machine (both ends Java).  I used a commercial implementation, Efficient XML from AgileDelta, and did some throughput measurements, comparing it to gzipped XMI XML.  I found that EXI produces somewhat smaller data than gzipped XML text (maybe 10% or 20% smaller, if I remember correctly).  The biggest benefit to EXI was the amount of CPU time required to read and write.  It was quite a bit faster than gzip to generate the XML and parse the XML.  This is probably because it does so directly from the ContentHandler, whereas with gzipped text, you first have to write the text and then compress it.  Also, I suppose it's just more efficient to parse a binary format than to step through characters looking for certain tokens.
>> 
>> In my case, the improved throughput and reduction of CPU usage was most important on the receiving end (i.e. the database) since it is a central bottle-neck in the overall landscape of my system.  As the number of UIMA senders increases, the database reaches it's limits to handle more messages (no more CPU or NIC capacity on that machine).  So it was important to me to be as efficient as possible with the XMI parsing on that machine in order to minimize my hardware costs.
>> 
>> In my case, all annotators are local, but a similar bottle-neck situation could arise in UIMA AS if you have a remote annotator (service).  Then, making that UIMA processor as efficient as possible in terms of both CPU and network bandwidth usage becomes important.  GZip will help a lot compared to plain text, but EXI is even better, especially to reduce the CPU usage of XML generating and parsing, but also somewhat on the network bandwidth.
>> 
>> Some open-source implementations of EXI are listed here: http://en.wikipedia.org/wiki/Efficient_XML_Interchange
>> 
>> I also tried the open-source Java implementation, EXIficient.  In early 2010 it implemented the standard technically correctly, but was immature, slow (really, really slow!), and used a lot of memory.  However, it's been a year since, so maybe it's improved since then.  I talked to the developers (from Siemens) about their use of the GPL license, and they were not interested in changing to an Apache-compatible license, so that may be an issue for use in UIMA.
>> 
>> I have not tried the other open-source Java implementation, OpenEXI. This uses the Apache license though.  There's some discussion of gzipped XML text versus EXI here: http://openexi.sourceforge.net/#whynotgzip 
>> 
>> There's also an open-source C implementation, called EXIP.  I don't know anything about it.
>> 
>> 
>>> Support to for gzipped XMI files
>>> --------------------------------
>>> 
>>>                Key: UIMA-2128
>>>                URL: https://issues.apache.org/jira/browse/UIMA-2128
>>>            Project: UIMA
>>>         Issue Type: Wish
>>>         Components: CasEditor
>>>           Reporter: Richard Eckart de Castilho
>>> 
>>> Since XMI files tend to grow rather rapidly, it would be great if the CAS Editor supported to read and write gzipped XMI files (.xmi.gz).
>> --
>> This message is automatically generated by JIRA.
>> For more information on JIRA, see: http://www.atlassian.com/software/jira
>> 
>> 
>> 

Richard Eckart de Castilho

-- 
------------------------------------------------------------------- 
Richard Eckart de Castilho
Technical Lead
Ubiquitous Knowledge Processing Lab 
FB 20 Computer Science Department      
Technische Universität Darmstadt 
Hochschulstr. 10, D-64289 Darmstadt, Germany 
phone [+49] (0)6151 16-7477, fax -5455, room S2/02/B117
eckartde@tk.informatik.tu-darmstadt.de 
www.ukp.tu-darmstadt.de 
Web Research at TU Darmstadt (WeRC) www.werc.tu-darmstadt.de
-------------------------------------------------------------------

Re: [jira] [Commented] (UIMA-2128) Support to for gzipped XMI files

Posted by Marshall Schor <ms...@schor.com>.

there seem to be lots of zip implementations - another one for instance is
7-zip.  I haven't studied this issue enough to have a real opinion, but if zips
are implemented, I wonder if it would be good to have some kind of a pluggable
mechanism to allow for different zips for different circumstances.

-Marshall

On 7/14/2011 4:32 PM, Greg Holmberg (JIRA) wrote:
>     [ https://issues.apache.org/jira/browse/UIMA-2128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13065516#comment-13065516 ] 
>
> Greg Holmberg commented on UIMA-2128:
> -------------------------------------
>
> I'm not sure where in the code it should be implemented (which seems to be Jörn's point), but another technology option is EXI, a binary encoding for XML.  See http://www.w3.org/XML/EXI 
>
> I've experimented with this to encode XMI between a UIMA process and a database on separate machine (both ends Java).  I used a commercial implementation, Efficient XML from AgileDelta, and did some throughput measurements, comparing it to gzipped XMI XML.  I found that EXI produces somewhat smaller data than gzipped XML text (maybe 10% or 20% smaller, if I remember correctly).  The biggest benefit to EXI was the amount of CPU time required to read and write.  It was quite a bit faster than gzip to generate the XML and parse the XML.  This is probably because it does so directly from the ContentHandler, whereas with gzipped text, you first have to write the text and then compress it.  Also, I suppose it's just more efficient to parse a binary format than to step through characters looking for certain tokens.
>
> In my case, the improved throughput and reduction of CPU usage was most important on the receiving end (i.e. the database) since it is a central bottle-neck in the overall landscape of my system.  As the number of UIMA senders increases, the database reaches it's limits to handle more messages (no more CPU or NIC capacity on that machine).  So it was important to me to be as efficient as possible with the XMI parsing on that machine in order to minimize my hardware costs.
>
> In my case, all annotators are local, but a similar bottle-neck situation could arise in UIMA AS if you have a remote annotator (service).  Then, making that UIMA processor as efficient as possible in terms of both CPU and network bandwidth usage becomes important.  GZip will help a lot compared to plain text, but EXI is even better, especially to reduce the CPU usage of XML generating and parsing, but also somewhat on the network bandwidth.
>
> Some open-source implementations of EXI are listed here: http://en.wikipedia.org/wiki/Efficient_XML_Interchange
>
> I also tried the open-source Java implementation, EXIficient.  In early 2010 it implemented the standard technically correctly, but was immature, slow (really, really slow!), and used a lot of memory.  However, it's been a year since, so maybe it's improved since then.  I talked to the developers (from Siemens) about their use of the GPL license, and they were not interested in changing to an Apache-compatible license, so that may be an issue for use in UIMA.
>
> I have not tried the other open-source Java implementation, OpenEXI. This uses the Apache license though.  There's some discussion of gzipped XML text versus EXI here: http://openexi.sourceforge.net/#whynotgzip 
>
> There's also an open-source C implementation, called EXIP.  I don't know anything about it.
>
>
>> Support to for gzipped XMI files
>> --------------------------------
>>
>>                 Key: UIMA-2128
>>                 URL: https://issues.apache.org/jira/browse/UIMA-2128
>>             Project: UIMA
>>          Issue Type: Wish
>>          Components: CasEditor
>>            Reporter: Richard Eckart de Castilho
>>
>> Since XMI files tend to grow rather rapidly, it would be great if the CAS Editor supported to read and write gzipped XMI files (.xmi.gz).
> --
> This message is automatically generated by JIRA.
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>
>        
>