You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@uima.apache.org by swirl <sw...@yahoo.com> on 2013/10/07 03:19:12 UTC

Designing collection readers: Reading multiple XML files containing multiple CASes

Hi,
I am wondering if anyone has a better idea.

Requirement:
a. I have a pipeline that needs to process a bunch of XML files.
b. The XML files could be on the disk, or from a remote location (available 
via a HTTP GET call, e.g. http://example.com/inputFiles/001.xml)
c. Each XML file contain mulitple sections, each section's content should be 
parsed to produce a separate CAS
d. I need to able to parse XML of different schema. Although the assumption 
is that each pipeline run can only handle one specific XML schema. That is, I 
do not need to handle different XML schema in each pipeline run.
e. With the above, I need to be able to construct a new collection reader, 
parser based on specific needs of each application.
f. For e.g., I can specify that the XML files are in a disk folder, and to 
use parser A to decode the specific schema of the XML files. In another 
pipeline, I can specify to the collection reader a list of URLs to retrieve 
some remote XML files and parse them using parser B.

Here are what I have so far:
a. I am using cleartk's UriCollectionReader to insert URIs of files into the 
CAS from local disk folders and remote URIs. So far so good.
b. I created a AE UriToDocumentAnnotatorA that can reads the URI in the CAS 
and parse the file according to XML schema A. 
c. But the above only produce 1 CAS per XML file. Requirement c. is not 
fulfilled. I need to produce multiple CASes from a single XML file. How do I 
do this?

Thanks in advance.

Re: Designing collection readers: Reading multiple XML files containing multiple CASes

Posted by Richard Eckart de Castilho <re...@apache.org>.

The CPEBuilder from uimaFIT dissasembles the top-level AAE you put
into it and turns the AEs inside into separate CPE-level components.
This is to allow AEs to be run in parallel while CCs are run 
single-threaded.

If you want to run a pipeline with a CAS Multipier in the CPE, then
you need to wrap it in an additional AAE.

-- Richard

On 10.10.2013, at 10:21, Swirl <lr...@gmail.com> wrote:

>> 
>> For part c:
>> 
>> I imagine an algorithm that can scan the main XML file and find the 
> "sections". 
>> For each section it finds, it can produce a CAS and initialize that CAS 
> with the
>> section's information.
>> 
>> If this algorithm lives inside an analysis component, then it can use the 
> "CAS
>> Multiplier" to produce the additional CASes, one for each segment.
>> 
>> See
>> http://uima.apache.org/d/uimaj-
> 2.4.2/tutorials_and_users_guides.html#ugr.tug.cm
>> 
>> Is that what you're looking for, or is that off-base?
>> 
>> -Marshall
> 
> Yes, this was what I want.
> 
> I tried using CAS Multiplier. 
> For most part it was working (e.g. when using in a 
> SimplePipeline.runPipeline, CpePipeline.runPipeline).
> 
> But when I tried to use it in CollectionProcessingEngine, it only produced 1 
> CAS, instead of the few CASes that were supposed to be produced from 1 input 
> document.
> 
> Here are my steps:
> a. create CR description "readerDesc" to read in a text file
> b. create AnalysisEngineDescription "simpleTextSegmenterDesc" for 
> SimpleTextSegmenter.class
> create AnalysisEngineDescription "casConsumerWriterDesc" to write CAS into 
> XMI files
> c. AggregateBuilder aggregateBuilder = new AggregateBuilder();
> aggregateBuilder.add(simpleTextSegmenterDesc);
> aggregateBuilder.add(casConsumerWriterDesc);
> AnalysisEngineDescription aaeDesc = 
> aggregateBuilder.createAggregateDescription()
> aaeDesc.getAnalysisEngineMetaData() 
> .getOperationalProperties().setOutputsNew
> CASes(false);
> c. CpeBuilder builder = new CpeBuilder();
> builder.setReader(readerDesc);
> builder.setAnalysisEngine(aaeDesc);
> e. CollectionProcessingEngine cpe = 
> builder.createCpe(StatusCallbackListener);
> f. cpe.process();
> 
> I only got 1 XMI produced instead of the few that I expected.
> 
> Is CAS Multiplier usable in CPE?
> According to the documentation, I need to wrap it in a Aggregate AE with 
> 
>

Re: Designing collection readers: Reading multiple XML files containing multiple CASes

Posted by Swirl <lr...@gmail.com>.

> 
> For part c:
> 
> I imagine an algorithm that can scan the main XML file and find the 
"sections". 
> For each section it finds, it can produce a CAS and initialize that CAS 
with the
> section's information.
> 
> If this algorithm lives inside an analysis component, then it can use the 
"CAS
> Multiplier" to produce the additional CASes, one for each segment.
> 
> See
> http://uima.apache.org/d/uimaj-
2.4.2/tutorials_and_users_guides.html#ugr.tug.cm
> 
> Is that what you're looking for, or is that off-base?
> 
> -Marshall
 
Yes, this was what I want.

I tried using CAS Multiplier. 
For most part it was working (e.g. when using in a 
SimplePipeline.runPipeline, CpePipeline.runPipeline).

But when I tried to use it in CollectionProcessingEngine, it only produced 1 
CAS, instead of the few CASes that were supposed to be produced from 1 input 
document.

Here are my steps:
a. create CR description "readerDesc" to read in a text file
b. create AnalysisEngineDescription "simpleTextSegmenterDesc" for 
SimpleTextSegmenter.class
create AnalysisEngineDescription "casConsumerWriterDesc" to write CAS into 
XMI files
c. AggregateBuilder aggregateBuilder = new AggregateBuilder();
aggregateBuilder.add(simpleTextSegmenterDesc);
aggregateBuilder.add(casConsumerWriterDesc);
AnalysisEngineDescription aaeDesc = 
aggregateBuilder.createAggregateDescription()
aaeDesc.getAnalysisEngineMetaData() 
.getOperationalProperties().setOutputsNew
CASes(false);
c. CpeBuilder builder = new CpeBuilder();
builder.setReader(readerDesc);
builder.setAnalysisEngine(aaeDesc);
e. CollectionProcessingEngine cpe = 
builder.createCpe(StatusCallbackListener);
f. cpe.process();

I only got 1 XMI produced instead of the few that I expected.

Is CAS Multiplier usable in CPE?
According to the documentation, I need to wrap it in a Aggregate AE with

Re: Designing collection readers: Reading multiple XML files containing multiple CASes

Posted by Marshall Schor <ms...@schor.com>.

For part c:

I imagine an algorithm that can scan the main XML file and find the "sections". 
For each section it finds, it can produce a CAS and initialize that CAS with the
section's information.

If this algorithm lives inside an analysis component, then it can use the "CAS
Multiplier" to produce the additional CASes, one for each segment.

See
http://uima.apache.org/d/uimaj-2.4.2/tutorials_and_users_guides.html#ugr.tug.cm

Is that what you're looking for, or is that off-base?

-Marshall

On 10/6/2013 9:19 PM, swirl wrote:
> Hi,
> I am wondering if anyone has a better idea.
>
> Requirement:
> a. I have a pipeline that needs to process a bunch of XML files.
> b. The XML files could be on the disk, or from a remote location (available 
> via a HTTP GET call, e.g. http://example.com/inputFiles/001.xml)
> c. Each XML file contain mulitple sections, each section's content should be 
> parsed to produce a separate CAS
> d. I need to able to parse XML of different schema. Although the assumption 
> is that each pipeline run can only handle one specific XML schema. That is, I 
> do not need to handle different XML schema in each pipeline run.
> e. With the above, I need to be able to construct a new collection reader, 
> parser based on specific needs of each application.
> f. For e.g., I can specify that the XML files are in a disk folder, and to 
> use parser A to decode the specific schema of the XML files. In another 
> pipeline, I can specify to the collection reader a list of URLs to retrieve 
> some remote XML files and parse them using parser B.
>
> Here are what I have so far:
> a. I am using cleartk's UriCollectionReader to insert URIs of files into the 
> CAS from local disk folders and remote URIs. So far so good.
> b. I created a AE UriToDocumentAnnotatorA that can reads the URI in the CAS 
> and parse the file according to XML schema A. 
> c. But the above only produce 1 CAS per XML file. Requirement c. is not 
> fulfilled. I need to produce multiple CASes from a single XML file. How do I 
> do this?
>
> Thanks in advance.
>
>
>

Re: Designing collection readers: Reading multiple XML files containing multiple CASes

Posted by Richard Eckart de Castilho <re...@apache.org>.

In the readers of the DKPro Core collection, in most cases, a reader is responsible for a particular format, not for a kind of data source (e.g. an URI). If a format has multiple documents in the same file, then we extract a part of the data, fill the CAS, but keep the stream to that file open so that the next time we can continue where we left off. 

We tend to handle the data source abstraction via Spring resource resolvers. If we want to read from some place other that file system or classpath, then we can plug an alternative resolver into a reader, e.g. for HDFS or CIFS file systems.

Cheers,

-- Richard

On 07.10.2013, at 15:59, Thilo Goetz <tw...@gmx.de> wrote:

> I just want to point out that there is an alternative.  I never use collection readers and cas consumers myself.  Instead, I do the reading of the input and the aggregation of the output outside the framework, where I have more control over things.  Just my opinion though.
> 
> See http://uima.apache.org/d/uimaj-2.4.2/tutorials_and_users_guides.html#ugr.tug.application.using_aes
> on how to do that.
> 
> --Thilo
> 
> On 10/07/2013 03:19 AM, swirl wrote:
>> Hi,
>> I am wondering if anyone has a better idea.
>> 
>> Requirement:
>> a. I have a pipeline that needs to process a bunch of XML files.
>> b. The XML files could be on the disk, or from a remote location (available
>> via a HTTP GET call, e.g. http://example.com/inputFiles/001.xml)
>> c. Each XML file contain mulitple sections, each section's content should be
>> parsed to produce a separate CAS
>> d. I need to able to parse XML of different schema. Although the assumption
>> is that each pipeline run can only handle one specific XML schema. That is, I
>> do not need to handle different XML schema in each pipeline run.
>> e. With the above, I need to be able to construct a new collection reader,
>> parser based on specific needs of each application.
>> f. For e.g., I can specify that the XML files are in a disk folder, and to
>> use parser A to decode the specific schema of the XML files. In another
>> pipeline, I can specify to the collection reader a list of URLs to retrieve
>> some remote XML files and parse them using parser B.
>> 
>> Here are what I have so far:
>> a. I am using cleartk's UriCollectionReader to insert URIs of files into the
>> CAS from local disk folders and remote URIs. So far so good.
>> b. I created a AE UriToDocumentAnnotatorA that can reads the URI in the CAS
>> and parse the file according to XML schema A.
>> c. But the above only produce 1 CAS per XML file. Requirement c. is not
>> fulfilled. I need to produce multiple CASes from a single XML file. How do I
>> do this?
>> 
>> Thanks in advance.

Re: Designing collection readers: Reading multiple XML files containing multiple CASes

Posted by Thilo Goetz <tw...@gmx.de>.

I just want to point out that there is an alternative.  I never use 
collection readers and cas consumers myself.  Instead, I do the reading 
of the input and the aggregation of the output outside the framework, 
where I have more control over things.  Just my opinion though.

See 
http://uima.apache.org/d/uimaj-2.4.2/tutorials_and_users_guides.html#ugr.tug.application.using_aes
on how to do that.

--Thilo

On 10/07/2013 03:19 AM, swirl wrote:
> Hi,
> I am wondering if anyone has a better idea.
>
> Requirement:
> a. I have a pipeline that needs to process a bunch of XML files.
> b. The XML files could be on the disk, or from a remote location (available
> via a HTTP GET call, e.g. http://example.com/inputFiles/001.xml)
> c. Each XML file contain mulitple sections, each section's content should be
> parsed to produce a separate CAS
> d. I need to able to parse XML of different schema. Although the assumption
> is that each pipeline run can only handle one specific XML schema. That is, I
> do not need to handle different XML schema in each pipeline run.
> e. With the above, I need to be able to construct a new collection reader,
> parser based on specific needs of each application.
> f. For e.g., I can specify that the XML files are in a disk folder, and to
> use parser A to decode the specific schema of the XML files. In another
> pipeline, I can specify to the collection reader a list of URLs to retrieve
> some remote XML files and parse them using parser B.
>
> Here are what I have so far:
> a. I am using cleartk's UriCollectionReader to insert URIs of files into the
> CAS from local disk folders and remote URIs. So far so good.
> b. I created a AE UriToDocumentAnnotatorA that can reads the URI in the CAS
> and parse the file according to XML schema A.
> c. But the above only produce 1 CAS per XML file. Requirement c. is not
> fulfilled. I need to produce multiple CASes from a single XML file. How do I
> do this?
>
> Thanks in advance.
>
>

Re: Designing collection readers: Reading multiple XML files containing multiple CASes

Posted by Jens Grivolla <j+...@grivolla.net>.

It sounds to me like it would be much easier to just have a custom 
collection reader that outputs one CAS per document (i.e. multiple CASes 
per input file), rather than having a CR that outputs one CAS per file 
(with just metadata) plus an additional AE to generate the "real" CASes 
from there.

Do you have a specific reason for not simply writing a Collection Reader 
that does what you want?

Bye,
Jens

On 10/07/2013 03:19 AM, swirl wrote:
> Hi,
> I am wondering if anyone has a better idea.
>
> Requirement:
> a. I have a pipeline that needs to process a bunch of XML files.
> b. The XML files could be on the disk, or from a remote location (available
> via a HTTP GET call, e.g. http://example.com/inputFiles/001.xml)
> c. Each XML file contain mulitple sections, each section's content should be
> parsed to produce a separate CAS
> d. I need to able to parse XML of different schema. Although the assumption
> is that each pipeline run can only handle one specific XML schema. That is, I
> do not need to handle different XML schema in each pipeline run.
> e. With the above, I need to be able to construct a new collection reader,
> parser based on specific needs of each application.
> f. For e.g., I can specify that the XML files are in a disk folder, and to
> use parser A to decode the specific schema of the XML files. In another
> pipeline, I can specify to the collection reader a list of URLs to retrieve
> some remote XML files and parse them using parser B.
>
> Here are what I have so far:
> a. I am using cleartk's UriCollectionReader to insert URIs of files into the
> CAS from local disk folders and remote URIs. So far so good.
> b. I created a AE UriToDocumentAnnotatorA that can reads the URI in the CAS
> and parse the file according to XML schema A.
> c. But the above only produce 1 CAS per XML file. Requirement c. is not
> fulfilled. I need to produce multiple CASes from a single XML file. How do I
> do this?
>
> Thanks in advance.
>
>
>