You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@uima.apache.org by Mathaeus Dejori <de...@gmail.com> on 2013/08/07 00:10:06 UTC

Processing a List of Strings with UIMA Addons components

Hi,

I'd like to use UIMA AS to annotate a large list of text segments. Instead
of passing each text segment individually to the AnalysisEngine I'd like to
pass the entire list at once.

As far as I understand I can use the cas.setSofaDataArray() to pass a list
of Strings and get back Annotations that refer to particular segments.
However, in doing so I won't be able to use any of the existing Annotators
(e.g. Concept Mapper) as their process(cas, spec) function expects the
cas.getDocumentText().

Is there a design pattern for uima to consume a list of strings, pass
individual elements to specific Annotators and combine all the results at
the end?

Thanks, Mathaeus

Re: AW: Processing a List of Strings with UIMA Addons components

Posted by Marshall Schor <ms...@schor.com>.

On 8/7/2013 2:33 AM, Armin.Wegner@bka.bund.de wrote:
> Dear Marshall,
>
> Consider an input text from which only some parts should be processed. After
processing the text should be there in one piece again. Let A denote parts of no
interest and let b denote parts to analyse further. XAX is split up into X, A,
and X. There is nothing to do for the X segments. A has to be put into the
pipeline. I only know how to use the CAS Multiplier if every segment has to be
processed. But in this case some segments have to be left out. Is there a way to
bypass the pipeline for the X segments? How to do the splitting and combining?
>
> Cheers,
> Armin
There are lots of ways to do this.  If the splitter annotator is written as a
CAS Multiplier and splits things up into X, A, and X, and sends these along, a
custom flow controller could look at an "extra" bit of control info the splitter
puts into the CAS which would act as a flag to the flow controller to either
route the CAS  through the processing pipeline, or bypass it.

After all split-out parts are done, the splitter annotator could send a "final"
CAS which would have a flag that the flow controller would use to bypass
processing, but that same flag would serve to signal the "recombiner" annotator
that the parts processing was finished, and it should recombine things.

------------------

Another way: Inside the splitter annotator, you can instantiate a brand-new,
completely independent UIMA pipeline.  Then within that annotator, it can do all
the work of splitting, sending something through that sub-pipeline, and
retrieving the results back into the original CAS in whatever way makes sense.

Because the sub-pipeline is independent, it can even have a different type
system.  You would write whatever transformation / copying code is needed
(there's a CasCopier class that can help to copy things between CASes.).

HTH. -Marshall
>
>
> -----Ursprüngliche Nachricht-----
> Von: Marshall Schor [mailto:msa@schor.com]
> Gesendet: Mittwoch, 7. August 2013 02:51
> An: user@uima.apache.org
> Betreff: Re: Processing a List of Strings with UIMA Addons components
>
>
> On 8/6/2013 6:10 PM, Mathaeus Dejori wrote:
>> Hi,
>>
>> I'd like to use UIMA AS to annotate a large list of text segments.
>> Instead of passing each text segment individually to the
>> AnalysisEngine I'd like to pass the entire list at once.
>>
>> As far as I understand I can use the cas.setSofaDataArray() to pass a
>> list of Strings and get back Annotations that refer to particular segments.
>> However, in doing so I won't be able to use any of the existing
>> Annotators (e.g. Concept Mapper) as their process(cas, spec) function
>> expects the cas.getDocumentText().
>>
>> Is there a design pattern for uima to consume a list of strings, pass
>> individual elements to specific Annotators and combine all the results
>> at the end?
> If what you are trying to do is to take an input CAS which has a bunch of
"strings" and send each one thru a pipeline,  the normal UIMA design pattern for
that is to use a CAS Multiplier at the start which gets as input the CAS with
all the strings, and then puts each one into another CAS and send it through the
> pipeline.   If the combining you want to do is to combine all the results into
> another CAS, then you can use another CAS Multiplier at the end which receives
the individual string CASes, and accumulates results until all the parts are
done, and then outputs a "result" CAS with the combined result.
>
> See
http://uima.apache.org/d/uimaj-2.4.1/tutorials_and_users_guides.html#ugr.tug.cm
>
> -Marshall

Re: Processing a List of Strings with UIMA Addons components

Posted by Richard Eckart de Castilho <ri...@gmail.com>.

You could write a custom flow controller which checks if CAS represents a segment is A or X and depending on that forwards the corresponding CAS either to the processing components or directly to the end of the pipeline. 

-- Richard

Am 07.08.2013 um 08:33 schrieb <Ar...@bka.bund.de>:

> Dear Marshall,
> 
> Consider an input text from which only some parts should be processed. After processing the text should be there in one piece again. Let A denote parts of no interest and let b denote parts to analyse further. XAX is split up into X, A, and X. There is nothing to do for the X segments. A has to be put into the pipeline. I only know how to use the CAS Multiplier if every segment has to be processed. But in this case some segments have to be left out. Is there a way to bypass the pipeline for the X segments? How to do the splitting and combining?
> 
> Cheers,
> Armin
> 
> 
> -----Ursprüngliche Nachricht-----
> Von: Marshall Schor [mailto:msa@schor.com] 
> Gesendet: Mittwoch, 7. August 2013 02:51
> An: user@uima.apache.org
> Betreff: Re: Processing a List of Strings with UIMA Addons components
> 
> 
> On 8/6/2013 6:10 PM, Mathaeus Dejori wrote:
>> Hi,
>> 
>> I'd like to use UIMA AS to annotate a large list of text segments. 
>> Instead of passing each text segment individually to the 
>> AnalysisEngine I'd like to pass the entire list at once.
>> 
>> As far as I understand I can use the cas.setSofaDataArray() to pass a 
>> list of Strings and get back Annotations that refer to particular segments.
>> However, in doing so I won't be able to use any of the existing 
>> Annotators (e.g. Concept Mapper) as their process(cas, spec) function 
>> expects the cas.getDocumentText().
>> 
>> Is there a design pattern for uima to consume a list of strings, pass 
>> individual elements to specific Annotators and combine all the results 
>> at the end?
> If what you are trying to do is to take an input CAS which has a bunch of "strings" and send each one thru a pipeline,  the normal UIMA design pattern for that is to use a CAS Multiplier at the start which gets as input the CAS with all the strings, and then puts each one into another CAS and send it through the
> pipeline.   If the combining you want to do is to combine all the results into
> another CAS, then you can use another CAS Multiplier at the end which receives the individual string CASes, and accumulates results until all the parts are done, and then outputs a "result" CAS with the combined result.
> 
> See http://uima.apache.org/d/uimaj-2.4.1/tutorials_and_users_guides.html#ugr.tug.cm
> 
> -Marshall

AW: Processing a List of Strings with UIMA Addons components

Posted by Ar...@bka.bund.de.

Dear Marshall,

Consider an input text from which only some parts should be processed. After processing the text should be there in one piece again. Let A denote parts of no interest and let b denote parts to analyse further. XAX is split up into X, A, and X. There is nothing to do for the X segments. A has to be put into the pipeline. I only know how to use the CAS Multiplier if every segment has to be processed. But in this case some segments have to be left out. Is there a way to bypass the pipeline for the X segments? How to do the splitting and combining?

Cheers,
Armin


-----Ursprüngliche Nachricht-----
Von: Marshall Schor [mailto:msa@schor.com] 
Gesendet: Mittwoch, 7. August 2013 02:51
An: user@uima.apache.org
Betreff: Re: Processing a List of Strings with UIMA Addons components


On 8/6/2013 6:10 PM, Mathaeus Dejori wrote:
> Hi,
>
> I'd like to use UIMA AS to annotate a large list of text segments. 
> Instead of passing each text segment individually to the 
> AnalysisEngine I'd like to pass the entire list at once.
>
> As far as I understand I can use the cas.setSofaDataArray() to pass a 
> list of Strings and get back Annotations that refer to particular segments.
> However, in doing so I won't be able to use any of the existing 
> Annotators (e.g. Concept Mapper) as their process(cas, spec) function 
> expects the cas.getDocumentText().
>
> Is there a design pattern for uima to consume a list of strings, pass 
> individual elements to specific Annotators and combine all the results 
> at the end?
If what you are trying to do is to take an input CAS which has a bunch of "strings" and send each one thru a pipeline,  the normal UIMA design pattern for that is to use a CAS Multiplier at the start which gets as input the CAS with all the strings, and then puts each one into another CAS and send it through the
pipeline.   If the combining you want to do is to combine all the results into
another CAS, then you can use another CAS Multiplier at the end which receives the individual string CASes, and accumulates results until all the parts are done, and then outputs a "result" CAS with the combined result.

See http://uima.apache.org/d/uimaj-2.4.1/tutorials_and_users_guides.html#ugr.tug.cm

-Marshall

Re: Processing a List of Strings with UIMA Addons components

Posted by Burn Lewis <bu...@gmail.com>.

With a custom flow controller you can avoid the need for a CasMultiplier as
the final component ... it could be just an annotator that accumulates the
results from each of the child CASes and put them in the input CAS when it
arrives, and the flow controller could be designed to send the input CAS
straight to the final component. So in a 3-component aggregate of CM+AE+CC
the input CAS would skip the AE and the child CASes would be dropped after
the AE+CC so only the filled-out input CAS would exit.

~Burn


On Sat, Aug 24, 2013 at 10:54 AM, Marshall Schor <ms...@schor.com> wrote:

>
> On 8/23/2013 11:11 AM, harshal patni wrote:
> > Hello Marshall,
> >                      Thank you for the suggestion! This works for us! As
> > per your suggestion, we have now created an Aggregate Analysis Engine
> that
> > contains CAS Multiplier (Splitter), our original aggregate engine and CAS
> > Merger (to merge the results into one CAS at the end).
> >
> > But the final merged CAS contains the child CAS'es (created in the
> > splitter) and the parent CAS as well. Is this expected? Any idea why?
> This is under the control of the "flow controller" being used in the
> aggregate.
> If you haven't written your own (where you can explicitly control what
> happens),
> then you're probably using one of the pre-built ones, whose behavior is
> documented here:
>
>
> http://uima.apache.org/d/uimaj-2.4.2/tutorials_and_users_guides.html#ugr.tug.cm.cm_and_fc
>
> I've copied a bit of this below:
>
>
>       7.3.2. CAS Multipliers and Flow Control
>
> CAS Multipliers are only supported in the context of Fixed Flow or custom
> Flow
> Control. If you use the built-in "Fixed Flow" for your Aggregate Analysis
> Engine, you can position the CAS Multiplier anywhere in that flow.
> Processing
> then works as follows: When a CAS is input to the Aggregate AE, that CAS is
> routed to the components in the order specified by the Fixed Flow, until
> that
> CAS reaches a CAS Multiplier.
>
> Upon reaching a CAS Multiplier, if that CAS Multiplier produces new output
> CASes, then each output CAS from that CAS Multiplier will continue through
> the
> flow, starting at the node immediately after the CAS Multiplier in the
> Fixed
> Flow. No further processing will be done on the original input CAS after
> it has
> reached a CAS Multiplier -- it will /not/ continue in the flow.
>
> If the CAS Multiplier does /not/ produce any output CASes for a given
> input CAS,
> then that input CAS /will/ continue in the flow. This behavior is
> appropriate,
> for example, for a CAS Multiplier that may segment an input CAS into
> pieces but
> only does so if the input CAS is larger than a certain size.
>
>
> ---------
>
> Does this help?
>
> -Marshall
>
> >
> > We used CAS splitter and merger for a synchronous UIMA pipeline as well.
> > That does not give us the parent CAS in the final result (Merged CAS).
> Why
> > the difference?
> >
> > Harshal
> >
> >
> >
> >
> >
> > On Wed, Aug 7, 2013 at 6:20 AM, Marshall Schor <ms...@schor.com> wrote:
> >
> >> On 8/6/2013 6:10 PM, Mathaeus Dejori wrote:
> >>> Hi,
> >>>
> >>> I'd like to use UIMA AS to annotate a large list of text segments.
> >> Instead
> >>> of passing each text segment individually to the AnalysisEngine I'd
> like
> >> to
> >>> pass the entire list at once.
> >>>
> >>> As far as I understand I can use the cas.setSofaDataArray() to pass a
> >> list
> >>> of Strings and get back Annotations that refer to particular segments.
> >>> However, in doing so I won't be able to use any of the existing
> >> Annotators
> >>> (e.g. Concept Mapper) as their process(cas, spec) function expects the
> >>> cas.getDocumentText().
> >>>
> >>> Is there a design pattern for uima to consume a list of strings, pass
> >>> individual elements to specific Annotators and combine all the results
> at
> >>> the end?
> >> If what you are trying to do is to take an input CAS which has a bunch
> of
> >> "strings" and send each one thru a pipeline,  the normal UIMA design
> >> pattern for
> >> that is to use a CAS Multiplier at the start which gets as input the CAS
> >> with
> >> all the strings, and then puts each one into another CAS and send it
> >> through the
> >> pipeline.   If the combining you want to do is to combine all the
> results
> >> into
> >> another CAS, then you can use another CAS Multiplier at the end which
> >> receives
> >> the individual string CASes, and accumulates results until all the parts
> >> are
> >> done, and then outputs a "result" CAS with the combined result.
> >>
> >> See
> >>
> http://uima.apache.org/d/uimaj-2.4.1/tutorials_and_users_guides.html#ugr.tug.cm
> >>
> >> -Marshall
> >>
>
>

Re: Processing a List of Strings with UIMA Addons components

Posted by Marshall Schor <ms...@schor.com>.

On 8/23/2013 11:11 AM, harshal patni wrote:
> Hello Marshall,
>                      Thank you for the suggestion! This works for us! As
> per your suggestion, we have now created an Aggregate Analysis Engine that
> contains CAS Multiplier (Splitter), our original aggregate engine and CAS
> Merger (to merge the results into one CAS at the end).
>
> But the final merged CAS contains the child CAS'es (created in the
> splitter) and the parent CAS as well. Is this expected? Any idea why?
This is under the control of the "flow controller" being used in the aggregate. 
If you haven't written your own (where you can explicitly control what happens),
then you're probably using one of the pre-built ones, whose behavior is
documented here:

http://uima.apache.org/d/uimaj-2.4.2/tutorials_and_users_guides.html#ugr.tug.cm.cm_and_fc

I've copied a bit of this below:


      7.3.2. CAS Multipliers and Flow Control

CAS Multipliers are only supported in the context of Fixed Flow or custom Flow
Control. If you use the built-in "Fixed Flow" for your Aggregate Analysis
Engine, you can position the CAS Multiplier anywhere in that flow. Processing
then works as follows: When a CAS is input to the Aggregate AE, that CAS is
routed to the components in the order specified by the Fixed Flow, until that
CAS reaches a CAS Multiplier.

Upon reaching a CAS Multiplier, if that CAS Multiplier produces new output
CASes, then each output CAS from that CAS Multiplier will continue through the
flow, starting at the node immediately after the CAS Multiplier in the Fixed
Flow. No further processing will be done on the original input CAS after it has
reached a CAS Multiplier -- it will /not/ continue in the flow.

If the CAS Multiplier does /not/ produce any output CASes for a given input CAS,
then that input CAS /will/ continue in the flow. This behavior is appropriate,
for example, for a CAS Multiplier that may segment an input CAS into pieces but
only does so if the input CAS is larger than a certain size.


---------

Does this help?

-Marshall

>
> We used CAS splitter and merger for a synchronous UIMA pipeline as well.
> That does not give us the parent CAS in the final result (Merged CAS). Why
> the difference?
>
> Harshal
>
>
>
>
>
> On Wed, Aug 7, 2013 at 6:20 AM, Marshall Schor <ms...@schor.com> wrote:
>
>> On 8/6/2013 6:10 PM, Mathaeus Dejori wrote:
>>> Hi,
>>>
>>> I'd like to use UIMA AS to annotate a large list of text segments.
>> Instead
>>> of passing each text segment individually to the AnalysisEngine I'd like
>> to
>>> pass the entire list at once.
>>>
>>> As far as I understand I can use the cas.setSofaDataArray() to pass a
>> list
>>> of Strings and get back Annotations that refer to particular segments.
>>> However, in doing so I won't be able to use any of the existing
>> Annotators
>>> (e.g. Concept Mapper) as their process(cas, spec) function expects the
>>> cas.getDocumentText().
>>>
>>> Is there a design pattern for uima to consume a list of strings, pass
>>> individual elements to specific Annotators and combine all the results at
>>> the end?
>> If what you are trying to do is to take an input CAS which has a bunch of
>> "strings" and send each one thru a pipeline,  the normal UIMA design
>> pattern for
>> that is to use a CAS Multiplier at the start which gets as input the CAS
>> with
>> all the strings, and then puts each one into another CAS and send it
>> through the
>> pipeline.   If the combining you want to do is to combine all the results
>> into
>> another CAS, then you can use another CAS Multiplier at the end which
>> receives
>> the individual string CASes, and accumulates results until all the parts
>> are
>> done, and then outputs a "result" CAS with the combined result.
>>
>> See
>> http://uima.apache.org/d/uimaj-2.4.1/tutorials_and_users_guides.html#ugr.tug.cm
>>
>> -Marshall
>>

Re: Processing a List of Strings with UIMA Addons components

Posted by harshal patni <pa...@gmail.com>.

Hello Marshall,
                     Thank you for the suggestion! This works for us! As
per your suggestion, we have now created an Aggregate Analysis Engine that
contains CAS Multiplier (Splitter), our original aggregate engine and CAS
Merger (to merge the results into one CAS at the end).

But the final merged CAS contains the child CAS'es (created in the
splitter) and the parent CAS as well. Is this expected? Any idea why?

We used CAS splitter and merger for a synchronous UIMA pipeline as well.
That does not give us the parent CAS in the final result (Merged CAS). Why
the difference?

Harshal





On Wed, Aug 7, 2013 at 6:20 AM, Marshall Schor <ms...@schor.com> wrote:

>
> On 8/6/2013 6:10 PM, Mathaeus Dejori wrote:
> > Hi,
> >
> > I'd like to use UIMA AS to annotate a large list of text segments.
> Instead
> > of passing each text segment individually to the AnalysisEngine I'd like
> to
> > pass the entire list at once.
> >
> > As far as I understand I can use the cas.setSofaDataArray() to pass a
> list
> > of Strings and get back Annotations that refer to particular segments.
> > However, in doing so I won't be able to use any of the existing
> Annotators
> > (e.g. Concept Mapper) as their process(cas, spec) function expects the
> > cas.getDocumentText().
> >
> > Is there a design pattern for uima to consume a list of strings, pass
> > individual elements to specific Annotators and combine all the results at
> > the end?
> If what you are trying to do is to take an input CAS which has a bunch of
> "strings" and send each one thru a pipeline,  the normal UIMA design
> pattern for
> that is to use a CAS Multiplier at the start which gets as input the CAS
> with
> all the strings, and then puts each one into another CAS and send it
> through the
> pipeline.   If the combining you want to do is to combine all the results
> into
> another CAS, then you can use another CAS Multiplier at the end which
> receives
> the individual string CASes, and accumulates results until all the parts
> are
> done, and then outputs a "result" CAS with the combined result.
>
> See
> http://uima.apache.org/d/uimaj-2.4.1/tutorials_and_users_guides.html#ugr.tug.cm
>
> -Marshall
>

Re: Processing a List of Strings with UIMA Addons components

Posted by Marshall Schor <ms...@schor.com>.

On 8/6/2013 6:10 PM, Mathaeus Dejori wrote:
> Hi,
>
> I'd like to use UIMA AS to annotate a large list of text segments. Instead
> of passing each text segment individually to the AnalysisEngine I'd like to
> pass the entire list at once.
>
> As far as I understand I can use the cas.setSofaDataArray() to pass a list
> of Strings and get back Annotations that refer to particular segments.
> However, in doing so I won't be able to use any of the existing Annotators
> (e.g. Concept Mapper) as their process(cas, spec) function expects the
> cas.getDocumentText().
>
> Is there a design pattern for uima to consume a list of strings, pass
> individual elements to specific Annotators and combine all the results at
> the end?
If what you are trying to do is to take an input CAS which has a bunch of
"strings" and send each one thru a pipeline,  the normal UIMA design pattern for
that is to use a CAS Multiplier at the start which gets as input the CAS with
all the strings, and then puts each one into another CAS and send it through the
pipeline.   If the combining you want to do is to combine all the results into
another CAS, then you can use another CAS Multiplier at the end which receives
the individual string CASes, and accumulates results until all the parts are
done, and then outputs a "result" CAS with the combined result.

See http://uima.apache.org/d/uimaj-2.4.1/tutorials_and_users_guides.html#ugr.tug.cm

-Marshall