You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@uima.apache.org by Greg Holmberg <ho...@comcast.net> on 2010/01/20 23:11:16 UTC

XMI parsing?


Hi UIMA users! 



I'm looking for advice on how to transmit data from a CAS to a non-UIMA recipient . 



I'd like to send data from a CAS over the network to a repository.  I can write any Java code I want to run in the repository server to receive the data and insert it into the repository indexes.  And no, the repository is not a SQL database, and there is no JDBC driver for it. 



I'm thinking the easiest data format to transmit from the CAS would be XMI.  I can just use the UIMA serialization methods to produce an XMI XML String, and then send that as a payload over whatever transport I want (RMI, HTTP , FTP, JSON, SOAP, whatever). 



But then how would the repository server parse the XMI XML that it receives?  Obviously, I could just use the UIMA de-serialization to re-constitute the CAS, but that's a lot of overhead (time and memory) considering I don't actually neet to run UIMA in the repository, and I just want to get the data values from the XMI and insert some records/objects in the repository index. 



Can I parse the XMI XML from UIMA without using UIMA? 



For example, is there a XSD file for XMI?  Or at least, for the UIMA "flavor" of XMI?  If so, I could feed the XSD file to JAXB to generate equivalent Java classes, then JAXB would parse and validate the XMI, producing Java objects. 



I suppose I could also parse the XMI with the XML StAX parser built into Java 6, and just bypass the creation of Java objects (directly inserting into the repository).  More work, but might use less memory and perform better. 



Or, instead of XMI, I could walk the CAS myself, and invent some data format (JSON? SOAP? RMI?) to send to the repository.  This could be binary to lessen the data on the network and ease the unmarshalling on the other end.  Performance and network bandwidth are an issue for me, since this has to scale (there will be many clients sending CAS data to the repository). 



I seem to remember that the serialization of the CAS between Java and C++ uses a fast binary format.  Would that be a possibility here?  Could I read that without re-constituting the CAS in the repository? 



What are your thoughts on these options? 



Thanks, 





Greg Holmberg 


Re: XMI parsing?

Posted by Greg Holmberg <ho...@comcast.net>.

Chris-- 



I think that's definitely an option.  RMI performance is excellent (bandwidth, CPU, memory), and the cost of development is low. 



It's just too bad that the JCas doesn't implement "Serializable"!  I would either have to copy all the data from the CAS into instances of classes that do implement Serializable, or write something similar to writeObject() methods for all the basic data types in the CAS and then traverse the CAS, calling those methods.  The second is more work, but would perform better. 



And then of course, I would have to instantiate all those objects in the receiver (i.e. the repository server), before I can insert their values into the index. 



What I really need is something like XML, but faster to generate and parse, and smaller to use less network bandwidth.  The RMI data format is close, it's just not easy to deal with if you don't actually want to reproduce the objects.  If only there was a binary equivalent of XML... 



Greg 


----- Original Message ----- 
From: "Chris Roeder" <ch...@ucdenver.edu> 
To: uima-user@incubator.apache.org 
Sent: Wednesday, January 20, 2010 3:38:11 PM GMT -08:00 US/Canada Pacific 
Subject: Re: XMI parsing? 

If it's Java on the repository side, creating Java objects of your 
choice on the UIMA side and then sending them over RMI is an option. 

-Chris 

Greg Holmberg wrote: 
> Hi UIMA users! 
> 
> 
> 
> I'm looking for advice on how to transmit data from a CAS to a non-UIMA recipient . 
> 
> 
> 
> I'd like to send data from a CAS over the network to a repository.  I can write any Java code I want to run in the repository server to receive the data and insert it into the repository indexes.  And no, the repository is not a SQL database, and there is no JDBC driver for it. 
> 
> 
> 
> I'm thinking the easiest data format to transmit from the CAS would be XMI.  I can just use the UIMA serialization methods to produce an XMI XML String, and then send that as a payload over whatever transport I want (RMI, HTTP , FTP, JSON, SOAP, whatever). 
> 
> 
> 
> But then how would the repository server parse the XMI XML that it receives?  Obviously, I could just use the UIMA de-serialization to re-constitute the CAS, but that's a lot of overhead (time and memory) considering I don't actually neet to run UIMA in the repository, and I just want to get the data values from the XMI and insert some records/objects in the repository index. 
> 
> 
> 
> Can I parse the XMI XML from UIMA without using UIMA? 
> 
> 
> 
> For example, is there a XSD file for XMI?  Or at least, for the UIMA "flavor" of XMI?  If so, I could feed the XSD file to JAXB to generate equivalent Java classes, then JAXB would parse and validate the XMI, producing Java objects. 
> 
> 
> 
> I suppose I could also parse the XMI with the XML StAX parser built into Java 6, and just bypass the creation of Java objects (directly inserting into the repository).  More work, but might use less memory and perform better. 
> 
> 
> 
> Or, instead of XMI, I could walk the CAS myself, and invent some data format (JSON? SOAP? RMI?) to send to the repository.  This could be binary to lessen the data on the network and ease the unmarshalling on the other end.  Performance and network bandwidth are an issue for me, since this has to scale (there will be many clients sending CAS data to the repository). 
> 
> 
> 
> I seem to remember that the serialization of the CAS between Java and C++ uses a fast binary format.  Would that be a possibility here?  Could I read that without re-constituting the CAS in the repository? 
> 
> 
> 
> What are your thoughts on these options? 
> 
> 
> 
> Thanks, 
> 
> 
> 
> 
> 
> Greg Holmberg 
> 
>   


Re: XMI parsing?

Posted by Chris Roeder <ch...@ucdenver.edu>.
If it's Java on the repository side, creating Java objects of your 
choice on the UIMA side and then sending them over RMI is an option.

-Chris

Greg Holmberg wrote:
> Hi UIMA users! 
>
>
>
> I'm looking for advice on how to transmit data from a CAS to a non-UIMA recipient . 
>
>
>
> I'd like to send data from a CAS over the network to a repository.  I can write any Java code I want to run in the repository server to receive the data and insert it into the repository indexes.  And no, the repository is not a SQL database, and there is no JDBC driver for it. 
>
>
>
> I'm thinking the easiest data format to transmit from the CAS would be XMI.  I can just use the UIMA serialization methods to produce an XMI XML String, and then send that as a payload over whatever transport I want (RMI, HTTP , FTP, JSON, SOAP, whatever). 
>
>
>
> But then how would the repository server parse the XMI XML that it receives?  Obviously, I could just use the UIMA de-serialization to re-constitute the CAS, but that's a lot of overhead (time and memory) considering I don't actually neet to run UIMA in the repository, and I just want to get the data values from the XMI and insert some records/objects in the repository index. 
>
>
>
> Can I parse the XMI XML from UIMA without using UIMA? 
>
>
>
> For example, is there a XSD file for XMI?  Or at least, for the UIMA "flavor" of XMI?  If so, I could feed the XSD file to JAXB to generate equivalent Java classes, then JAXB would parse and validate the XMI, producing Java objects. 
>
>
>
> I suppose I could also parse the XMI with the XML StAX parser built into Java 6, and just bypass the creation of Java objects (directly inserting into the repository).  More work, but might use less memory and perform better. 
>
>
>
> Or, instead of XMI, I could walk the CAS myself, and invent some data format (JSON? SOAP? RMI?) to send to the repository.  This could be binary to lessen the data on the network and ease the unmarshalling on the other end.  Performance and network bandwidth are an issue for me, since this has to scale (there will be many clients sending CAS data to the repository). 
>
>
>
> I seem to remember that the serialization of the CAS between Java and C++ uses a fast binary format.  Would that be a possibility here?  Could I read that without re-constituting the CAS in the repository? 
>
>
>
> What are your thoughts on these options? 
>
>
>
> Thanks, 
>
>
>
>
>
> Greg Holmberg 
>
>