You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@uima.apache.org by Oliver Christ <oc...@EBSCO.COM> on 2014/05/29 15:11:07 UTC

Collection Readers and File Format Filtering

Hi,

>From my (still very limited) UIMA experience it seems that collection readers address how to retrieve documents from some location and how to import (or filter) that document into the CAS.

Filtering (i.e. file format-specific processing) can be seen as independent of where the data is retrieved from. I'm wondering whether there's a "UIMA way" to separate the two aspects, i.e. a model consisting of two components; one which abstracts storage and retrieval, and the second addressing file format filtering.

Thanks!

Cheers, Oli


Re: Collection Readers and File Format Filtering

Posted by Richard Eckart de Castilho <re...@apache.org>.
Hello Oli,

I know of two strategies:

1) READER+AE: use a reader to control from where the data is retrieved. The reader reads the raw data format, e.g. a PDF file. Then a subsequent analysis engine converts the raw data into what is actually to be processed, e.g. extracting the text from the PDF. I think that ClearTK [1] is going into this direction nowadays.

2) READER+PLUGIN: use a reader to perform the data conversion. The reader may be configured with a strategy that controls from where the data is obtained. DKPro Core [2] is going into that direction. Most readers can be configured with a custom Spring ResourcePatternResolver, e.g. to access files from a HDFS (afaik a corresponding ResourcePatternResolver is included in Spring for Apache Hadoop [3]). I also did a proof-of-concept ResourcePatternResolver for Samba shares once. 

I guess it boils down to whether you consider it important to have the raw data in the CAS. Some people may see that as a benefit, others may consider it a waste of memory.

In the olden times, there was a thing called CasInitializer [4] which appears to have been a plugin that a reader could use to extract information from the raw data and fill it into the CAS. Sounds like approach 2) mentioned above. However, the CasInitializer has been deprecated for quite some time now and its Javadoc says to use different views instead (sounds like approach 1). Maybe somebody else can provide some detail as to why the CasInitializer was deprecated - I never used it, but I always thought it sounded like a quite useful concept.

Cheers,

-- Richard

[1] http://cleartk.googlecode.com
[2] https://code.google.com/p/dkpro-core-asl/
[3] http://projects.spring.io/spring-hadoop/
[4] http://uima.apache.org/downloads/releaseDocs/2.3.0-incubating/docs/api/org/apache/uima/collection/CasInitializer.html

P.S.: none of the mentioned projects are ASF projects. I am affiliated with the DKPro Core project.

On 29.05.2014, at 15:11, Oliver Christ <oc...@EBSCO.COM> wrote:

> Hi,
> 
> From my (still very limited) UIMA experience it seems that collection readers address how to retrieve documents from some location and how to import (or filter) that document into the CAS.
> 
> Filtering (i.e. file format-specific processing) can be seen as independent of where the data is retrieved from. I'm wondering whether there's a "UIMA way" to separate the two aspects, i.e. a model consisting of two components; one which abstracts storage and retrieval, and the second addressing file format filtering.
> 
> Thanks!
> 
> Cheers, Oli