You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@uima.apache.org by Kannan Chellappa <kc...@kana.com> on 2007/11/12 08:06:26 UTC

CPM batch processing

I want to process my document collection using CPM and I want to use the
batch feature.

The documentation says that the following method in the
CollectionProcessingManager

 

               void process(CollectionReader
<file:///C:\uimaj-2.2.0-incubating\apache-uima\docs\api\org\apache\uima\
collection\CollectionReader.html>  aCollectionReader,
             int aBatchSize)

             throws ResourceInitializationException
<file:///C:\uimaj-2.2.0-incubating\apache-uima\docs\api\org\apache\uima\
resource\ResourceInitializationException.html> 

 

breaks the processing into batches of size determined by the aBatchSize
parameter. Each CasConsumer will be notified at the end of the batch.

 

When I tried this method in my application, the processing stops after
processing the first batch of documents.  I was hoping that the
execution would continue to next batch of documents after each batch
processing is complete.

 

I tried the following as a test.

 

I downloaded uimaj-2.2.0  binaries into my computer and used
SimpleRunCPM in examples to perform my test

 

I modified the SimpleRunCPM.java in org.apache.uima.examples.cpe and
changed the batch size to 4 (instead of 10) and then ran the following
command line arguments

 

C:\uimaj-2.2.0-incubating\apache-uima\examples\descriptors\collection_re
ader\FileSystemCollectionReader.xml 

C:\uimaj-2.2.0-incubating\apache-uima\examples\descriptors\analysis_engi
ne\NamesAndPersonTitles_TAE.xml  

C:\uimaj-2.2.0-incubating\apache-uima\examples\descriptors\cas_consumer\
XmiWriterCasConsumer.xml

 

I modified the FileSystemCollectionReader.xml to have the default as
C:\uimaj-2.2.0-incubating\apache-uima\examples\data

 

The input folder has 8 text files, but the processing completes after 4
documents.

Is this the expected behavior? If not is there anything I need to change
in the code to get the multiple batches to work?

 

Thanks in advance for any help

 

-kannan


Re: CPM batch processing

Posted by Marshall Schor <ms...@schor.com>.
Hi Kannan -

I think you have come across a "partially implemented" feature, which
has never been completed.

One work-around is to implement batching yourself in your Cas
Consumer(s), by passing in a batch-size parameter in your Cas Consumer
descriptor and then having each consumer that wants to, count the # of
documents processed, until the batch size is reached, and then do the
end of batch processing.

If your Cas Consumers are scaled out via being replicated, be aware that
they will not "see" every CAS that is flowing in the system.  You can
specify if you want a Cas Consumer to be replicated or not, using the
<operationalProperties> <multipleDeploymentAllowed> true|false
</multipleDeploymentAllowed> </operationalProperties> XML specification;
see section 2.4.1.9 in this part of the reference manual: 
http://incubator.apache.org/uima/downloads/releaseDocs/2.2.0-incubating/docs/html/references/references.html#ugr.ref.xml.component_descriptor.aes.primitive

-Marshall

Kannan Chellappa wrote:
> I want to process my document collection using CPM and I want to use the
> batch feature.
>
> The documentation says that the following method in the
> CollectionProcessingManager
>
>  
>
>                void process(CollectionReader
> <file:///C:\uimaj-2.2.0-incubating\apache-uima\docs\api\org\apache\uima\
> collection\CollectionReader.html>  aCollectionReader,
>              int aBatchSize)
>
>              throws ResourceInitializationException
> <file:///C:\uimaj-2.2.0-incubating\apache-uima\docs\api\org\apache\uima\
> resource\ResourceInitializationException.html> 
>
>  
>
> breaks the processing into batches of size determined by the aBatchSize
> parameter. Each CasConsumer will be notified at the end of the batch.
>
>  
>
> When I tried this method in my application, the processing stops after
> processing the first batch of documents.  I was hoping that the
> execution would continue to next batch of documents after each batch
> processing is complete.
>
>  
>
> I tried the following as a test.
>
>  
>
> I downloaded uimaj-2.2.0  binaries into my computer and used
> SimpleRunCPM in examples to perform my test
>
>  
>
> I modified the SimpleRunCPM.java in org.apache.uima.examples.cpe and
> changed the batch size to 4 (instead of 10) and then ran the following
> command line arguments
>
>  
>
> C:\uimaj-2.2.0-incubating\apache-uima\examples\descriptors\collection_re
> ader\FileSystemCollectionReader.xml 
>
> C:\uimaj-2.2.0-incubating\apache-uima\examples\descriptors\analysis_engi
> ne\NamesAndPersonTitles_TAE.xml  
>
> C:\uimaj-2.2.0-incubating\apache-uima\examples\descriptors\cas_consumer\
> XmiWriterCasConsumer.xml
>
>  
>
> I modified the FileSystemCollectionReader.xml to have the default as
> C:\uimaj-2.2.0-incubating\apache-uima\examples\data
>
>  
>
> The input folder has 8 text files, but the processing completes after 4
> documents.
>
> Is this the expected behavior? If not is there anything I need to change
> in the code to get the multiple batches to work?
>
>  
>
> Thanks in advance for any help
>
>  
>
> -kannan
>
>
>