You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@uima.apache.org by Samudra Banerjee <sa...@gmail.com> on 2014/02/02 02:22:02 UTC

Serializing multiple JCas objects to a single file

Hi Experts,

I have a scenario where processing a wikipedia XML dump generates a huge 
number of JCas objects (~1 million), one per page. I want to serialize 
these JCas objects for later use, but generating 1 million different 
files will take a toll on the system. So I was wondering if there was a 
way to serialize multiple JCas objects to a single file for later 
retrieval. Any idea if this can be achieved?

Thanks and Regards,
Samudra
-- 

*Samudra Banerjee*
First Year Graduate Student
Department of Computer Science
State University of New York
Stony Brook, NY 11790
631-496-6939


Re: Serializing multiple JCas objects to a single file

Posted by Samudra Banerjee <sa...@gmail.com>.
Thanks Massimo. Great idea indeed. However in my current scenario, I 
cannot use a database. So I think as of now I should stick with the idea 
to store them in zipped format :)

*Samudra Banerjee*
First Year Graduate Student
Department of Computer Science
State University of New York
Stony Brook, NY 11790
631-496-6939

On 2/2/2014 1:48 AM, Massimo Nicosia wrote:
> An alternative could be serializing the XML content into a String and
> saving it in a database or a fast key-value store.
>
> I have used code like this:
>
>              ByteArrayOutputStream out = new ByteArrayOutputStream(1024);
>              XmiCasSerializer ser = new
> XmiCasSerializer(cas.getTypeSystem());
>              ser.serialize(cas.getCas(), (new XMLSerializer(out,
> false)).getContentHandler());
>              out.close();
>              String xmlContent = out.toString();
>
> Best,
> Massimo
>
>
>
> On Sun, Feb 2, 2014 at 4:22 AM, Samudra Banerjee <sa...@gmail.com> wrote:
>
>> Hi Experts,
>>
>> I have a scenario where processing a wikipedia XML dump generates a huge
>> number of JCas objects (~1 million), one per page. I want to serialize
>> these JCas objects for later use, but generating 1 million different files
>> will take a toll on the system. So I was wondering if there was a way to
>> serialize multiple JCas objects to a single file for later retrieval. Any
>> idea if this can be achieved?
>>
>> Thanks and Regards,
>> Samudra
>> --
>>
>> *Samudra Banerjee*
>> First Year Graduate Student
>> Department of Computer Science
>> State University of New York
>> Stony Brook, NY 11790
>> 631-496-6939
>>
>>


Re: Serializing multiple JCas objects to a single file

Posted by Massimo Nicosia <m....@gmail.com>.
An alternative could be serializing the XML content into a String and
saving it in a database or a fast key-value store.

I have used code like this:

            ByteArrayOutputStream out = new ByteArrayOutputStream(1024);
            XmiCasSerializer ser = new
XmiCasSerializer(cas.getTypeSystem());
            ser.serialize(cas.getCas(), (new XMLSerializer(out,
false)).getContentHandler());
            out.close();
            String xmlContent = out.toString();

Best,
Massimo



On Sun, Feb 2, 2014 at 4:22 AM, Samudra Banerjee <sa...@gmail.com> wrote:

> Hi Experts,
>
> I have a scenario where processing a wikipedia XML dump generates a huge
> number of JCas objects (~1 million), one per page. I want to serialize
> these JCas objects for later use, but generating 1 million different files
> will take a toll on the system. So I was wondering if there was a way to
> serialize multiple JCas objects to a single file for later retrieval. Any
> idea if this can be achieved?
>
> Thanks and Regards,
> Samudra
> --
>
> *Samudra Banerjee*
> First Year Graduate Student
> Department of Computer Science
> State University of New York
> Stony Brook, NY 11790
> 631-496-6939
>
>

Re: Serializing multiple JCas objects to a single file

Posted by Samudra Banerjee <sa...@gmail.com>.
Thanks Alexandre. This looks like a good idea. Let me try this out!

*Samudra Banerjee*
First Year Graduate Student
Department of Computer Science
State University of New York
Stony Brook, NY 11790
631-496-6939

On 2/1/2014 8:30 PM, Alexandre Patry wrote:
> On 14-02-01 08:22 PM, Samudra Banerjee wrote:
>> Hi Experts,
>>
>> I have a scenario where processing a wikipedia XML dump generates a 
>> huge number of JCas objects (~1 million), one per page. I want to 
>> serialize these JCas objects for later use, but generating 1 million 
>> different files will take a toll on the system. So I was wondering if 
>> there was a way to serialize multiple JCas objects to a single file 
>> for later retrieval. Any idea if this can be achieved?
> The JDK provide classes to read and write zip files (see 
> http://docs.oracle.com/javase/7/docs/api/java/util/zip/package-summary.html). 
> You could serialize each JCas in an entry of a zip file.
>
> Best,
>
> Alexandre
>


Re: Serializing multiple JCas objects to a single file

Posted by Alexandre Patry <al...@keatext.com>.
On 14-02-01 08:22 PM, Samudra Banerjee wrote:
> Hi Experts,
>
> I have a scenario where processing a wikipedia XML dump generates a 
> huge number of JCas objects (~1 million), one per page. I want to 
> serialize these JCas objects for later use, but generating 1 million 
> different files will take a toll on the system. So I was wondering if 
> there was a way to serialize multiple JCas objects to a single file 
> for later retrieval. Any idea if this can be achieved?
The JDK provide classes to read and write zip files (see 
http://docs.oracle.com/javase/7/docs/api/java/util/zip/package-summary.html). 
You could serialize each JCas in an entry of a zip file.

Best,

Alexandre

-- 
Alexandre Patry, Ph.D
Chercheur / Researcher
http://KeaText.com