You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@uima.apache.org by Erik Fäßler <ch...@gmx.net> on 2012/12/12 17:27:17 UTC

Flexibility of binary CAS serialization

Hi,

i am currently looking for a good approach to store a lot of CAS data. What I want to do is to annotate a lot of text with basic annotations and save that. Then, I can read the CAS objects with these basic annotations and don't have to do them over and over because they are basically never changing. However, "basic" does not necessarily mean that the computation is fast - that's why I want the storage.

No I consideres binary storage because its fast and the resulting files not very big compared to XMI serialization. But I have the requirement that I want to be able to extend the type system (add features and types) with rendering the stored CAS objects useless.

I experimented with CASCompleteSerializer which of course does not offer this flexibility (but I still wanted to see like it works). Now I was hoping, when I used CASSerializer, I would perhaps get the flexibility I want.

I serialize with

ByteArrayOutputStream baos = new ByteArrayOutputStream();
Serialization.serializeCAS(aJCas.getCas(), baos);

and  I deserialize with

byte[] casData = ...
Serialization.deserializeCAS(aCAS, new ByteArrayInputStream(casData));

What DID work is when I add a feature to a serialized type, I can use the feature after deserialization (that was not possible with CASCompleteSerializer). But when I add a new type which was not part of the serialization, something odd happens: The AnalysisEngines seem to work fine. I can read annotations which had been serialized before and I can add new ones and read them again, too.
However, when I want to store the final result as an XMI (I did this for usage with the annotationViewer), I get an error for the XMI serialization. The XMI serialization is done by

FileOutputStream out = new FileOutputStream(outFile);
XmiCasSerializer.serialize(aCas, out);
out.close();

which worked always fine. The error is

Caused by: java.lang.IndexOutOfBoundsException: Index: 59, Size: 52
	at java.util.ArrayList.RangeCheck(ArrayList.java:547)
	at java.util.ArrayList.get(ArrayList.java:322)
	at org.apache.uima.cas.impl.StringHeap.getStringForCode(StringHeap.java:150)
	at org.apache.uima.cas.impl.CASImpl.getStringForCode(CASImpl.java:2139)
	at org.apache.uima.cas.impl.XmiCasSerializer$XmiCasDocSerializer.encodeFeatures(XmiCasSerializer.java:892)
	at org.apache.uima.cas.impl.XmiCasSerializer$XmiCasDocSerializer.encodeFS(XmiCasSerializer.java:753)
	at org.apache.uima.cas.impl.XmiCasSerializer$XmiCasDocSerializer.encodeIndexed(XmiCasSerializer.java:700)
	at org.apache.uima.cas.impl.XmiCasSerializer$XmiCasDocSerializer.serialize(XmiCasSerializer.java:268)
	at org.apache.uima.cas.impl.XmiCasSerializer$XmiCasDocSerializer.access$700(XmiCasSerializer.java:108)
	at org.apache.uima.cas.impl.XmiCasSerializer.serialize(XmiCasSerializer.java:1567)
	at org.apache.uima.cas.impl.XmiCasSerializer.serialize(XmiCasSerializer.java:1638)
	at org.apache.uima.cas.impl.XmiCasSerializer.serialize(XmiCasSerializer.java:1585)
	at de.julielab.jules.consumer.CasToXmiConsumer.writeXmi(CasToXmiConsumer.java:338)
	at de.julielab.jules.consumer.CasToXmiConsumer.processCas(CasToXmiConsumer.java:288)
	at org.apache.uima.analysis_engine.impl.compatibility.CasConsumerAdapter.process(CasConsumerAdapter.java:99)
	at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:375)
	... 4 more

Is this behaviour expected or did I just miss something? I don't really need the XMI serialization in my use case but I'm not too confident in the whole storage procedure when such an error happens.

Thanks for any hints,

Erik

Re: Flexibility of binary CAS serialization

Posted by Richard Eckart de Castilho <ec...@ukp.informatik.tu-darmstadt.de>.

Am 12.12.2012 um 17:27 schrieb Erik Fäßler <ch...@gmx.net>:

> I experimented with CASCompleteSerializer which of course does not offer this flexibility (but I still wanted to see like it works). Now I was hoping, when I used CASSerializer, I would perhaps get the flexibility I want.

The CASSerializer can only be used if the CAS metadata remains exactly the same. If you do not change the type system often, you could consider doing

- deserialize using the CASCompleteSerializer
- serialize as XMI
- deserialize XMI into a CAS with the new type system
- serialize using the CASCompleteSerializer

Afaik another downside of the binary serialization is that even annotations that have been removed from the indexes (deleted) are persisted. These would also get dropped as part of the XMI detour.

Cheers,

-- Richard

-- 
------------------------------------------------------------------- 
Richard Eckart de Castilho
Technical Lead
Ubiquitous Knowledge Processing Lab (UKP-TUD) 
FB 20 Computer Science Department      
Technische Universität Darmstadt 
Hochschulstr. 10, D-64289 Darmstadt, Germany 
phone [+49] (0)6151 16-7477, fax -5455, room S2/02/B117
eckart@ukp.informatik.tu-darmstadt.de 
www.ukp.tu-darmstadt.de 
Web Research at TU Darmstadt (WeRC) www.werc.tu-darmstadt.de
-------------------------------------------------------------------

Re: Flexibility of binary CAS serialization

Posted by Renaud Richardet <re...@gmail.com>.

Hi Erik,

I have been working on a UIMA module (using MongoDB) that could fit your needs:

When repetitively processing the same documents with different UIMA
pipelines, some pipeline steps (e.g. preprocessing) are duplicated
among the runs. The MongoDb module allows to persist annotated
documents, resume their processing and add new annotations to them.
MongoDb is a high-performance NOSQL document database. In the MongoDb
module, every CAS is stored as a document, along with its annotations.
UIMA annotations and their features are explicitly mapped to MongoDb
fields, using a simple and declarative language. The mappings are used
during persistence and loading from the database. The following UIMA
components are available:
• MongoCollectionReader reads CAS from a MongoDb collection.
Optionally, one can specify a (filter) query, e.g.
– {my_db_field:{exists:true}} for the existence of a field;
– {pmid: 17} to query a specific PubMed document;
– {pmid:{$in:[12,17]}} to query a list of PubMed documents;
– {pmid:{ $gt: 8, $lt: 11 }} for a range of documents.
• RegexMongoCollectionReader is similar to MongoCollectionReader but
allows to specify a query with a regular expression on a specific
field
• MongoWriter persists a new UIMA CASes into MongoDb documents
• MongoUpdateWriter persist new annotations into an existing MongoDb docu-
ment
• MongoCollectionRemover allows to remove selected annotation types from a
MongoDb collection.
With the above components, it is possible within a single pipeline to
read an existing collection of annotated documents, perform some
further processing, add more annota- tion, and store theses
annotations back into the same MongoDb documents2. In terms of
performance, the MongoDb module has been tested with a corpus of
PubMed abstracts (approximately 22 Mio documents, throughput over 2000
docs/s) and a corpus of several million full-text papers (throughput
around 200 docs/s, bound by disk IO). It is also possible to scale
MongoDb horizontally in a cluster setup, or use SSDs to improve
performance.

Let me know if you are interested. I plan to release the code soon.


-- 
Renaud Richardet
Blue Brain Project  PhD candidate
EPFL  Station 15
CH-1015 Lausanne
phone: +41-78-675-9501
http://people.epfl.ch/renaud.richardet

Re: Flexibility of binary CAS serialization

Posted by Erik Fäßler <er...@uni-jena.de>.

Thank you both for your hints,

Jörn, this exact topic came to my mind earlier. I want to have different "annotation stages" of the same artifacts, so some kind of delta storage would make a lot of sense. Now I don't have time to write such a thing on my own (I currently don't see an easy way to do it; I want to preserver the basic annotation storage so I can experiment with the components doing the "higher" annotations). Is there anything usable out-of-the-box regarding this topic?

Thanks!

	Erik

Am 12.12.2012 um 18:28 schrieb Jörn Kottmann <ko...@gmail.com>:

> On 12/12/2012 05:27 PM, Erik Fäßler wrote:
>> i am currently looking for a good approach to store a lot of CAS data. What I want to do is to annotate a lot of text with basic annotations and save that. Then, I can read the CAS objects with these basic annotations and don't have to do them over and over because they are basically never changing. However, "basic" does not necessarily mean that the computation is fast - that's why I want the storage.
> 
> In my experiences its sometimes better to define a custom format to store the data in a database and not use CAS serialization.
> 
> CAS serialization has some disadvantages. To read a piece of the data in a CAS it is necessary to load the entire CAS,
> but this might not be necessary for all operations which need to be performed, e.g. text indexing, calculating statistics, etc.
> To add new annotations to an existing CAS you need to re-write the entire CAS data instead of just adding a few bytes.
> 
> Jörn

Re: Flexibility of binary CAS serialization

Posted by Jörn Kottmann <ko...@gmail.com>.

On 12/12/2012 05:27 PM, Erik Fäßler wrote:
> i am currently looking for a good approach to store a lot of CAS data. What I want to do is to annotate a lot of text with basic annotations and save that. Then, I can read the CAS objects with these basic annotations and don't have to do them over and over because they are basically never changing. However, "basic" does not necessarily mean that the computation is fast - that's why I want the storage.

In my experiences its sometimes better to define a custom format to 
store the data in a database and not use CAS serialization.

CAS serialization has some disadvantages. To read a piece of the data in 
a CAS it is necessary to load the entire CAS,
but this might not be necessary for all operations which need to be 
performed, e.g. text indexing, calculating statistics, etc.
To add new annotations to an existing CAS you need to re-write the 
entire CAS data instead of just adding a few bytes.

Jörn