You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@uima.apache.org by Nicolas Hernandez <ni...@gmail.com> on 2013/11/14 18:19:09 UTC

Good practice for using and saving a resource built by previous annotators

Dear All

Let say I want to count the occurrences of each word in a document
collection and to use these counters (possibly in the same workflow).
I am in the situation where I have a CAS per document and I want to
scale out the workflow.

To scale out the workflow I use a resource to store the counters of
each word. The resource is accessed in writing mode by several
instances of an annotator which process in parallel distinct CASes.

Here are my questions :
* I believe I cannot be sure that when a successive annotator in the
same workflow will use the resource, the resource would not still be
modified after that (by running counter annotators which are still
processing remaining CAS). Right ? In other words, I do not have a way
to run (to delay the run of) an annotator depending the state of a
resource ?
* So, I may use two worflows: one to build the resource, the other one
to use it.  But how can I export/save the resource ? I cannot access
the resource in the collectionProcessComplete method of an AE, can I ?

The solution I imagine was inspired of the use of the CAS multiplier
to merge CAS. It is to use two workflows with one of them dedicated to
build the resource. In this workflow, I define an annotator  (without
scaling out, so a cas consumer). In that annotator, I check the
SourceDocumentInformation Feature Structure in the CAS to see if its
lastSegment feature is set to true, in that case I can export the
resource. I know this it not a guarantee that all CAS have been
processed. I may also have a special counter resource in that
annotator to count the processed cas and eventually export the desired
resource when all CAS would have been processed. In that case, I would
need a way to communicate to the "exporter" annotator the number of
CAS which will be processed... This is not the main problem.

After writing that, I realize that to do it in a single workflow, I
could have written a CAS multiplier to save each CAS until all have
been processed, then create again as many CAS as the ones saved...

These solutions are very complex...

Any suggestions... ? A uimaFIT trick =) ?

Thanks for your ideas

/Nicolas

Re: Good practice for using and saving a resource built by previous annotators

Posted by Richard Eckart de Castilho <re...@apache.org>.

If I understood UIMA-AS right, when scaling out across multiple machines, then each
machine has its own instance of the shared resource. So either it would be 
necessary to manually synchronize the resource across the cluster, aggregate the 
counts form all machines manually in the end, or make sure that at least all CASes
are routed through one counting instance in the end.

-- Richard

On 14.11.2013, at 21:23, Nicolas Hernandez <ni...@gmail.com> wrote:

> Indeed, you can access a shared resource in the
> collectionProcessComplete method. I wonder why I thought I could not.
> 
> So I was talking about shared resources and using uima-as to scale out.
> 
> Thanks Richard for your answer
> 
> On Thu, Nov 14, 2013 at 9:02 PM, Richard Eckart de Castilho
> <re...@apache.org> wrote:
>> On 14.11.2013, at 18:19, Nicolas Hernandez <ni...@gmail.com> wrote:
>> 
>>> Dear All
>>> 
>>> Let say I want to count the occurrences of each word in a document
>>> collection and to use these counters (possibly in the same workflow).
>>> I am in the situation where I have a CAS per document and I want to
>>> scale out the workflow.
>> 
>> How do you scale it out?
>> 
>>> To scale out the workflow I use a resource to store the counters of
>>> each word. The resource is accessed in writing mode by several
>>> instances of an annotator which process in parallel distinct CASes.
>> 
>> What kind of resource do you use?
>> 
>>> Here are my questions :
>>> * I believe I cannot be sure that when a successive annotator in the
>>> same workflow will use the resource, the resource would not still be
>>> modified after that (by running counter annotators which are still
>>> processing remaining CAS). Right ? In other words, I do not have a way
>>> to run (to delay the run of) an annotator depending the state of a
>>> resource ?
>> 
>> You can customize the flow by writing your own workflow controller.
>> But if that is supported depends on how you do your scaling.
>> 
>>> * So, I may use two worflows: one to build the resource, the other one
>>> to use it.  But how can I export/save the resource ? I cannot access
>>> the resource in the collectionProcessComplete method of an AE, can I ?
>> 
>> I would personally use the two workflows. Why do you believe that you cannot
>> access the resource in collectionProcessComplete?
>> 
>>> The solution I imagine was inspired of the use of the CAS multiplier
>>> to merge CAS. It is to use two workflows with one of them dedicated to
>>> build the resource. In this workflow, I define an annotator  (without
>>> scaling out, so a cas consumer). In that annotator, I check the
>>> SourceDocumentInformation Feature Structure in the CAS to see if its
>>> lastSegment feature is set to true, in that case I can export the
>>> resource. I know this it not a guarantee that all CAS have been
>>> processed. I may also have a special counter resource in that
>>> annotator to count the processed cas and eventually export the desired
>>> resource when all CAS would have been processed. In that case, I would
>>> need a way to communicate to the "exporter" annotator the number of
>>> CAS which will be processed... This is not the main problem.
>>> 
>>> After writing that, I realize that to do it in a single workflow, I
>>> could have written a CAS multiplier to save each CAS until all have
>>> been processed, then create again as many CAS as the ones saved...
>>> 
>>> These solutions are very complex...
>>> 
>>> Any suggestions... ? A uimaFIT trick =) ?
>> 
>> Well, to do small-scale scaling using a CPE, I'd do this:
>> 
>> - build an aggregate which generates the word counts
>> - use a custom shared resource to do the counting
>> - in the collectionProcessComplete call some synchronized "save" method on the resource
>> - if "save" is called the second time, it does nothing
>> 
>> - build an aggregate which uses the word counts
>> 
>> Run both workflows, one after the other using the CpePipeline of uimaFIT.
>> 
>> -- Richard
> 
> -- 
> Dr. Nicolas Hernandez
> Associate Professor (Maître de Conférences)
> Université de Nantes - LINA CNRS UMR 6241
> http://enicolashernandez.blogspot.com
> http://www.univ-nantes.fr/hernandez-n
> +33 (0)2 51 12 53 94
> +33 (0)2 40 30 60 67

Re: Good practice for using and saving a resource built by previous annotators

Posted by Nicolas Hernandez <ni...@gmail.com>.

Indeed, you can access a shared resource in the
collectionProcessComplete method. I wonder why I thought I could not.

So I was talking about shared resources and using uima-as to scale out.

Thanks Richard for your answer

On Thu, Nov 14, 2013 at 9:02 PM, Richard Eckart de Castilho
<re...@apache.org> wrote:
> On 14.11.2013, at 18:19, Nicolas Hernandez <ni...@gmail.com> wrote:
>
>> Dear All
>>
>> Let say I want to count the occurrences of each word in a document
>> collection and to use these counters (possibly in the same workflow).
>> I am in the situation where I have a CAS per document and I want to
>> scale out the workflow.
>
> How do you scale it out?
>
>> To scale out the workflow I use a resource to store the counters of
>> each word. The resource is accessed in writing mode by several
>> instances of an annotator which process in parallel distinct CASes.
>
> What kind of resource do you use?
>
>> Here are my questions :
>> * I believe I cannot be sure that when a successive annotator in the
>> same workflow will use the resource, the resource would not still be
>> modified after that (by running counter annotators which are still
>> processing remaining CAS). Right ? In other words, I do not have a way
>> to run (to delay the run of) an annotator depending the state of a
>> resource ?
>
> You can customize the flow by writing your own workflow controller.
> But if that is supported depends on how you do your scaling.
>
>> * So, I may use two worflows: one to build the resource, the other one
>> to use it.  But how can I export/save the resource ? I cannot access
>> the resource in the collectionProcessComplete method of an AE, can I ?
>
> I would personally use the two workflows. Why do you believe that you cannot
> access the resource in collectionProcessComplete?
>
>> The solution I imagine was inspired of the use of the CAS multiplier
>> to merge CAS. It is to use two workflows with one of them dedicated to
>> build the resource. In this workflow, I define an annotator  (without
>> scaling out, so a cas consumer). In that annotator, I check the
>> SourceDocumentInformation Feature Structure in the CAS to see if its
>> lastSegment feature is set to true, in that case I can export the
>> resource. I know this it not a guarantee that all CAS have been
>> processed. I may also have a special counter resource in that
>> annotator to count the processed cas and eventually export the desired
>> resource when all CAS would have been processed. In that case, I would
>> need a way to communicate to the "exporter" annotator the number of
>> CAS which will be processed... This is not the main problem.
>>
>> After writing that, I realize that to do it in a single workflow, I
>> could have written a CAS multiplier to save each CAS until all have
>> been processed, then create again as many CAS as the ones saved...
>>
>> These solutions are very complex...
>>
>> Any suggestions... ? A uimaFIT trick =) ?
>
> Well, to do small-scale scaling using a CPE, I'd do this:
>
> - build an aggregate which generates the word counts
> - use a custom shared resource to do the counting
> - in the collectionProcessComplete call some synchronized "save" method on the resource
> - if "save" is called the second time, it does nothing
>
> - build an aggregate which uses the word counts
>
> Run both workflows, one after the other using the CpePipeline of uimaFIT.
>
> -- Richard
>



-- 
Dr. Nicolas Hernandez
Associate Professor (Maître de Conférences)
Université de Nantes - LINA CNRS UMR 6241
http://enicolashernandez.blogspot.com
http://www.univ-nantes.fr/hernandez-n
+33 (0)2 51 12 53 94
+33 (0)2 40 30 60 67

Re: Good practice for using and saving a resource built by previous annotators

Posted by Richard Eckart de Castilho <re...@apache.org>.

On 14.11.2013, at 18:19, Nicolas Hernandez <ni...@gmail.com> wrote:

> Dear All
> 
> Let say I want to count the occurrences of each word in a document
> collection and to use these counters (possibly in the same workflow).
> I am in the situation where I have a CAS per document and I want to
> scale out the workflow.

How do you scale it out?

> To scale out the workflow I use a resource to store the counters of
> each word. The resource is accessed in writing mode by several
> instances of an annotator which process in parallel distinct CASes.

What kind of resource do you use?

> Here are my questions :
> * I believe I cannot be sure that when a successive annotator in the
> same workflow will use the resource, the resource would not still be
> modified after that (by running counter annotators which are still
> processing remaining CAS). Right ? In other words, I do not have a way
> to run (to delay the run of) an annotator depending the state of a
> resource ?

You can customize the flow by writing your own workflow controller.
But if that is supported depends on how you do your scaling.

> * So, I may use two worflows: one to build the resource, the other one
> to use it.  But how can I export/save the resource ? I cannot access
> the resource in the collectionProcessComplete method of an AE, can I ?

I would personally use the two workflows. Why do you believe that you cannot
access the resource in collectionProcessComplete?

> The solution I imagine was inspired of the use of the CAS multiplier
> to merge CAS. It is to use two workflows with one of them dedicated to
> build the resource. In this workflow, I define an annotator  (without
> scaling out, so a cas consumer). In that annotator, I check the
> SourceDocumentInformation Feature Structure in the CAS to see if its
> lastSegment feature is set to true, in that case I can export the
> resource. I know this it not a guarantee that all CAS have been
> processed. I may also have a special counter resource in that
> annotator to count the processed cas and eventually export the desired
> resource when all CAS would have been processed. In that case, I would
> need a way to communicate to the "exporter" annotator the number of
> CAS which will be processed... This is not the main problem.
> 
> After writing that, I realize that to do it in a single workflow, I
> could have written a CAS multiplier to save each CAS until all have
> been processed, then create again as many CAS as the ones saved...
> 
> These solutions are very complex...
> 
> Any suggestions... ? A uimaFIT trick =) ?

Well, to do small-scale scaling using a CPE, I'd do this:

- build an aggregate which generates the word counts
- use a custom shared resource to do the counting
- in the collectionProcessComplete call some synchronized "save" method on the resource
- if "save" is called the second time, it does nothing

- build an aggregate which uses the word counts

Run both workflows, one after the other using the CpePipeline of uimaFIT.

-- Richard