You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@creadur.apache.org by Robert Burrell Donkin <ro...@blueyonder.co.uk> on 2013/08/05 16:11:09 UTC

[RAT] Pipelines...

Essentially, Rat is simple.

A source (perhaps a file system or a compressed archive) is walked, 
producing documents. Each document (perhaps a file in a file system, or 
a resources in an archive) flows through a pipeline - a series of 
processing steps, enriching with various meta-data. An end point 
collates the data.

It seems to me that the current code fails to express this

...

At the moment, IDocumentAnalyser[1] is implemented by most steps in the 
pipeline (and other stuff too), wired together in a potentially flexible 
fashion. This now seems over-engineered to me.

I think a concrete Pipeline would be more obvious, with controlled 
extension points at each step of the processing.

Opinions...?
Objections...?

Robert
[1] 
http://svn.apache.org/viewvc/creadur/rat/trunk/apache-rat-core/src/main/java/org/apache/rat/document/IDocumentAnalyser.java?view=markup

Re: [RAT] Pipelines...

Posted by Robert Burrell Donkin <rd...@apache.org>.

On 08/05/13 15:47, Marshall Schor wrote:

<snip>

> It may be overkill ( :-) ), however, the Apache UIMA project has this very idea
> of enabling assembly of components in a pipeline, and passing a thing (called
> the CAS - Common Annotation Structure/System) to each "annotator" component,
> which may add arbitrary metadata info to the CAS.
>
> For intro, see the getting started parts of the documentation at uima.apache.org.

Quite possibly overkill but interesting :-)

Thanks for the link, Marshall, and glad to see UIMA seems to be going 
strong :-)

Robert

Re: [RAT] Pipelines...

Posted by Marshall Schor <ms...@schor.com>.

On 8/5/2013 10:11 AM, Robert Burrell Donkin wrote:
> Essentially, Rat is simple.
>
> A source (perhaps a file system or a compressed archive) is walked, producing
> documents. Each document (perhaps a file in a file system, or a resources in
> an archive) flows through a pipeline - a series of processing steps, enriching
> with various meta-data. An end point collates the data.
>
> It seems to me that the current code fails to express this
>
> ...
>
> At the moment, IDocumentAnalyser[1] is implemented by most steps in the
> pipeline (and other stuff too), wired together in a potentially flexible
> fashion. This now seems over-engineered to me.
>
> I think a concrete Pipeline would be more obvious, with controlled extension
> points at each step of the processing.
>
> Opinions...?
> Objections...?

Hi,

It may be overkill ( :-) ), however, the Apache UIMA project has this very idea
of enabling assembly of components in a pipeline, and passing a thing (called
the CAS - Common Annotation Structure/System) to each "annotator" component,
which may add arbitrary metadata info to the CAS.

For intro, see the getting started parts of the documentation at uima.apache.org.

-Marshall Schor

>
> Robert
> [1]
> http://svn.apache.org/viewvc/creadur/rat/trunk/apache-rat-core/src/main/java/org/apache/rat/document/IDocumentAnalyser.java?view=markup
>