You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@uima.apache.org by Petr Baudis <pa...@ucw.cz> on 2015/09/06 16:11:24 UTC

CAS merger/multiplier N:M mapping

  Hi!

  I'm currently struggling to perform a complex flow transformation with
UIMA.  I have multiple (N) CASes with some fulltext search results.
I chop these search results to sentences and would like to pick the top
M sentences from the search results collected and build CASes from them
to do further analysis.  So, I'd like to copy subsets (document text
wise and annotation wise) of N input CASes to M output CASes.  I don't
know how to do this technically.  I tried two non-workable ideas so far:

  (i) Keep around references to the respective views of input CASes
and use them as CasCopier sources when the time comes to produce
the new CASes.  Turns out the input CASes are (unsurprisingly) recycled
and the references I kept around at process() time aren't valid when
next() is called much later.

  (ii) Use an internal "intermediary" CAS instance in process() to which
I append my sentences, then use it as a source of output CASes.  Turns
out (surprisingly) that I can't append to a sofa documenttext ("Data for
Sofa feature setLocalSofaData() has already been set." - not sure about
the reason for this restriction).

  I think the only choice except downright unmaintainable hacks (like
programatically generated M views) is to just give up on preserving my
annotations and carry over just the sentence texts.  Am I missing
something?


  (I'm somewhat tempted to cut my losses short (much too late) and
abandon UIMA flow control altogether, using only simple pipelines and
having custom glue code to connect these together, as it seems like
getting the flow to work in interesting cases is a huge time sink and in
retrospect, it could never pay off any abstract advantage of easier
distributed processing (where you probably end up having to chop up the
pipeline manually anyway).  I would probably never recommend new UIMA
users to strive for a single pipeline with CAS multipliers/mergers and
begin to consider these features an evolutionary dead end rather than
advantageous.  Not sure if there even *are* any other real users using
advanced flows besides me and DeepQA.  I'll be glad to hear any opinions
on this!)

-- 
				Petr Baudis
	If you have good ideas, good data and fast computers,
	you can do almost anything. -- Geoffrey Hinton

Re: CAS merger/multiplier N:M mapping

Posted by Eddie Epstein <ea...@gmail.com>.
Petr,


> > >   (I'm somewhat tempted to cut my losses short (much too late) and
> > > abandon UIMA flow control altogether, using only simple pipelines and
> > > having custom glue code to connect these together, as it seems like
> > > getting the flow to work in interesting cases is a huge time sink and
> in
> > > retrospect, it could never pay off any abstract advantage of easier
> > > distributed processing (where you probably end up having to chop up the
> > > pipeline manually anyway).  I would probably never recommend new UIMA
> > > users to strive for a single pipeline with CAS multipliers/mergers and
> > > begin to consider these features an evolutionary dead end rather than
> > > advantageous.  Not sure if there even *are* any other real users using
> > > advanced flows besides me and DeepQA.  I'll be glad to hear any
> opinions
> > > on this!)
> > >
> > >
> > Definitely the advantage to encapsulating analytics in standard UIMA
> > components is easy scalability via the vertical and horizontal scale out
> > options offered by UIMA-AS and DUCC. Flexibility in chopping up a
> > pipeline into services as needed is another advantage.
>
>   But as far as I understand, you need to explicitly define and deploy
> AEs that are to be run on different machines anyway.  So I'm not sure if
> the extra value is really that large in the end?
>
>
Well, yes. But with DUCC only the definition needs be explicitly done;
the deployment and replicated scale out of all components are done
automatically.

Eddie

Re: CAS merger/multiplier N:M mapping

Posted by Petr Baudis <pa...@ucw.cz>.
  Hi!

On Sun, Sep 06, 2015 at 10:58:44AM -0400, Eddie Epstein wrote:
> On Sun, Sep 6, 2015 at 10:11 AM, Petr Baudis <pa...@ucw.cz> wrote:
> >   (ii) Use an internal "intermediary" CAS instance in process() to which
> > I append my sentences, then use it as a source of output CASes.  Turns
> > out (surprisingly) that I can't append to a sofa documenttext ("Data for
> > Sofa feature setLocalSofaData() has already been set." - not sure about
> > the reason for this restriction).
> >
> 
> The Sofa data for a view is immutable, otherwise existing annotations
> could become invalid.

  But in my case, I'd only append to the end, so this concern is moot.

  It's rather easy anyway to make your annotations go invalid if you use
CasCopier a bit.

> >   I think the only choice except downright unmaintainable hacks (like
> > programatically generated M views) is to just give up on preserving my
> > annotations and carry over just the sentence texts.  Am I missing
> > something?
> >
> 
> Creating a new view in the intermediate CAS for each of the N input CASes
> would work. A new output CAS Sofa would be comprised of data from
> multiple views and of course the annotation end points adjusted as when
> added to the new output CAS.

  I guess that .getViewIterator() would make this not so frustrating,
so I'll try this route, thanks for the tip!

> One problem there is that the intermediate CAS would continue to grow
> in size, so there would need to be some point when it could be reset.

  Indeed, well, when you output all M CASes is a good point.
I assume .release() would accomplish this.

> >   (I'm somewhat tempted to cut my losses short (much too late) and
> > abandon UIMA flow control altogether, using only simple pipelines and
> > having custom glue code to connect these together, as it seems like
> > getting the flow to work in interesting cases is a huge time sink and in
> > retrospect, it could never pay off any abstract advantage of easier
> > distributed processing (where you probably end up having to chop up the
> > pipeline manually anyway).  I would probably never recommend new UIMA
> > users to strive for a single pipeline with CAS multipliers/mergers and
> > begin to consider these features an evolutionary dead end rather than
> > advantageous.  Not sure if there even *are* any other real users using
> > advanced flows besides me and DeepQA.  I'll be glad to hear any opinions
> > on this!)
> >
> >
> Definitely the advantage to encapsulating analytics in standard UIMA
> components is easy scalability via the vertical and horizontal scale out
> options offered by UIMA-AS and DUCC. Flexibility in chopping up a
> pipeline into services as needed is another advantage.

  But as far as I understand, you need to explicitly define and deploy
AEs that are to be run on different machines anyway.  So I'm not sure if
the extra value is really that large in the end?

> The previously mentioned GALE multimodal application also converted
> sequences of N input CASes to M output CASes. In that case the input
> CASes represented 2 minutes worth of speech-to-text transcription of
> broadcast news, and each output CAS represented a single news story.
> The story-CASes then went thru a pipeline that identified the story and
> updated a pre-existing summarization for each story.

  Interesting (and good to hear), thanks!

-- 
				Petr Baudis
	If you have good ideas, good data and fast computers,
	you can do almost anything. -- Geoffrey Hinton

Re: CAS merger/multiplier N:M mapping

Posted by Eddie Epstein <ea...@gmail.com>.
Hi Petr

On Sun, Sep 6, 2015 at 10:11 AM, Petr Baudis <pa...@ucw.cz> wrote:

>   Hi!
>
>   I'm currently struggling to perform a complex flow transformation with
> UIMA.  I have multiple (N) CASes with some fulltext search results.
> I chop these search results to sentences and would like to pick the top
> M sentences from the search results collected and build CASes from them
> to do further analysis.  So, I'd like to copy subsets (document text
> wise and annotation wise) of N input CASes to M output CASes.  I don't
> know how to do this technically.  I tried two non-workable ideas so far:
>
>   (i) Keep around references to the respective views of input CASes
> and use them as CasCopier sources when the time comes to produce
> the new CASes.  Turns out the input CASes are (unsurprisingly) recycled
> and the references I kept around at process() time aren't valid when
> next() is called much later.
>
>   (ii) Use an internal "intermediary" CAS instance in process() to which
> I append my sentences, then use it as a source of output CASes.  Turns
> out (surprisingly) that I can't append to a sofa documenttext ("Data for
> Sofa feature setLocalSofaData() has already been set." - not sure about
> the reason for this restriction).
>

The Sofa data for a view is immutable, otherwise existing annotations
could become invalid.


>
>   I think the only choice except downright unmaintainable hacks (like
> programatically generated M views) is to just give up on preserving my
> annotations and carry over just the sentence texts.  Am I missing
> something?
>

Creating a new view in the intermediate CAS for each of the N input CASes
would work. A new output CAS Sofa would be comprised of data from
multiple views and of course the annotation end points adjusted as when
added to the new output CAS.

One problem there is that the intermediate CAS would continue to grow
in size, so there would need to be some point when it could be reset.


>
>   (I'm somewhat tempted to cut my losses short (much too late) and
> abandon UIMA flow control altogether, using only simple pipelines and
> having custom glue code to connect these together, as it seems like
> getting the flow to work in interesting cases is a huge time sink and in
> retrospect, it could never pay off any abstract advantage of easier
> distributed processing (where you probably end up having to chop up the
> pipeline manually anyway).  I would probably never recommend new UIMA
> users to strive for a single pipeline with CAS multipliers/mergers and
> begin to consider these features an evolutionary dead end rather than
> advantageous.  Not sure if there even *are* any other real users using
> advanced flows besides me and DeepQA.  I'll be glad to hear any opinions
> on this!)
>
>
Definitely the advantage to encapsulating analytics in standard UIMA
components is easy scalability via the vertical and horizontal scale out
options offered by UIMA-AS and DUCC. Flexibility in chopping up a
pipeline into services as needed is another advantage.

The previously mentioned GALE multimodal application also converted
sequences of N input CASes to M output CASes. In that case the input
CASes represented 2 minutes worth of speech-to-text transcription of
broadcast news, and each output CAS represented a single news story.
The story-CASes then went thru a pipeline that identified the story and
updated a pre-existing summarization for each story.

Eddie

--
>                                 Petr Baudis
>         If you have good ideas, good data and fast computers,
>         you can do almost anything. -- Geoffrey Hinton
>