You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@uima.apache.org by Samudra Banerjee <sa...@gmail.com> on 2014/02/02 02:22:02 UTC
Serializing multiple JCas objects to a single file
Hi Experts,
I have a scenario where processing a wikipedia XML dump generates a huge
number of JCas objects (~1 million), one per page. I want to serialize
these JCas objects for later use, but generating 1 million different
files will take a toll on the system. So I was wondering if there was a
way to serialize multiple JCas objects to a single file for later
retrieval. Any idea if this can be achieved?
Thanks and Regards,
Samudra
--
*Samudra Banerjee*
First Year Graduate Student
Department of Computer Science
State University of New York
Stony Brook, NY 11790
631-496-6939
Re: Serializing multiple JCas objects to a single file
Posted by Samudra Banerjee <sa...@gmail.com>.
Thanks Massimo. Great idea indeed. However in my current scenario, I
cannot use a database. So I think as of now I should stick with the idea
to store them in zipped format :)
*Samudra Banerjee*
First Year Graduate Student
Department of Computer Science
State University of New York
Stony Brook, NY 11790
631-496-6939
On 2/2/2014 1:48 AM, Massimo Nicosia wrote:
> An alternative could be serializing the XML content into a String and
> saving it in a database or a fast key-value store.
>
> I have used code like this:
>
> ByteArrayOutputStream out = new ByteArrayOutputStream(1024);
> XmiCasSerializer ser = new
> XmiCasSerializer(cas.getTypeSystem());
> ser.serialize(cas.getCas(), (new XMLSerializer(out,
> false)).getContentHandler());
> out.close();
> String xmlContent = out.toString();
>
> Best,
> Massimo
>
>
>
> On Sun, Feb 2, 2014 at 4:22 AM, Samudra Banerjee <sa...@gmail.com> wrote:
>
>> Hi Experts,
>>
>> I have a scenario where processing a wikipedia XML dump generates a huge
>> number of JCas objects (~1 million), one per page. I want to serialize
>> these JCas objects for later use, but generating 1 million different files
>> will take a toll on the system. So I was wondering if there was a way to
>> serialize multiple JCas objects to a single file for later retrieval. Any
>> idea if this can be achieved?
>>
>> Thanks and Regards,
>> Samudra
>> --
>>
>> *Samudra Banerjee*
>> First Year Graduate Student
>> Department of Computer Science
>> State University of New York
>> Stony Brook, NY 11790
>> 631-496-6939
>>
>>
Re: Serializing multiple JCas objects to a single file
Posted by Massimo Nicosia <m....@gmail.com>.
An alternative could be serializing the XML content into a String and
saving it in a database or a fast key-value store.
I have used code like this:
ByteArrayOutputStream out = new ByteArrayOutputStream(1024);
XmiCasSerializer ser = new
XmiCasSerializer(cas.getTypeSystem());
ser.serialize(cas.getCas(), (new XMLSerializer(out,
false)).getContentHandler());
out.close();
String xmlContent = out.toString();
Best,
Massimo
On Sun, Feb 2, 2014 at 4:22 AM, Samudra Banerjee <sa...@gmail.com> wrote:
> Hi Experts,
>
> I have a scenario where processing a wikipedia XML dump generates a huge
> number of JCas objects (~1 million), one per page. I want to serialize
> these JCas objects for later use, but generating 1 million different files
> will take a toll on the system. So I was wondering if there was a way to
> serialize multiple JCas objects to a single file for later retrieval. Any
> idea if this can be achieved?
>
> Thanks and Regards,
> Samudra
> --
>
> *Samudra Banerjee*
> First Year Graduate Student
> Department of Computer Science
> State University of New York
> Stony Brook, NY 11790
> 631-496-6939
>
>
Re: Serializing multiple JCas objects to a single file
Posted by Samudra Banerjee <sa...@gmail.com>.
Thanks Alexandre. This looks like a good idea. Let me try this out!
*Samudra Banerjee*
First Year Graduate Student
Department of Computer Science
State University of New York
Stony Brook, NY 11790
631-496-6939
On 2/1/2014 8:30 PM, Alexandre Patry wrote:
> On 14-02-01 08:22 PM, Samudra Banerjee wrote:
>> Hi Experts,
>>
>> I have a scenario where processing a wikipedia XML dump generates a
>> huge number of JCas objects (~1 million), one per page. I want to
>> serialize these JCas objects for later use, but generating 1 million
>> different files will take a toll on the system. So I was wondering if
>> there was a way to serialize multiple JCas objects to a single file
>> for later retrieval. Any idea if this can be achieved?
> The JDK provide classes to read and write zip files (see
> http://docs.oracle.com/javase/7/docs/api/java/util/zip/package-summary.html).
> You could serialize each JCas in an entry of a zip file.
>
> Best,
>
> Alexandre
>
Re: Serializing multiple JCas objects to a single file
Posted by Alexandre Patry <al...@keatext.com>.
On 14-02-01 08:22 PM, Samudra Banerjee wrote:
> Hi Experts,
>
> I have a scenario where processing a wikipedia XML dump generates a
> huge number of JCas objects (~1 million), one per page. I want to
> serialize these JCas objects for later use, but generating 1 million
> different files will take a toll on the system. So I was wondering if
> there was a way to serialize multiple JCas objects to a single file
> for later retrieval. Any idea if this can be achieved?
The JDK provide classes to read and write zip files (see
http://docs.oracle.com/javase/7/docs/api/java/util/zip/package-summary.html).
You could serialize each JCas in an entry of a zip file.
Best,
Alexandre
--
Alexandre Patry, Ph.D
Chercheur / Researcher
http://KeaText.com