You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@oozie.apache.org by Serega Sheypak <se...@gmail.com> on 2013/07/16 08:12:59 UTC

Don't understand how does concurrency, throttle work. Need to materialize cood one-by-one

Hi, I have a hyge amount of data partitioned by hour:
my/data/archive/yyyy/MM/dd/HH
The problem is that this data can't process this data in parallel.

For example If I want to process

my/data/archive/2013/07/16/01 I need to process
my/data/archive/2013/07/16/00 at first.

I've wrote a coordinator with settings:
    <controls>
        <timeout>15</timeout>
        <concurrency>1</concurrency>
        <throttle>0</throttle>
    </controls>
I suppose that I would have 1 running materizlization at one time and the
next materialization won't be created until the first one is finished.
but it's not true.
Coordinator start to run and it creates materializations with status
"READY". Then these materialization die because the first one
materialization didn't prepare data for the next one.

I want my coordinator to materialize actions one by one. When the first one
is finished, then the next one can become READY and then RUNNING.

What do I do wrong?

Re: Don't understand how does concurrency, throttle work. Need to materialize cood one-by-one

Posted by Serega Sheypak <se...@gmail.com>.

I did "manual" launches of scripts to start my analytical coordinator from
the past. Only two first materializations do have derived data to run.
The third materialization is in WAITING state because of missing derived
dataset from the second materialization.
Thanks' I'll try.


2013/7/16 Mona Chitnis <ch...@yahoo-inc.com>

> Hi,
>
> Concurrency = how many actions can be 'RUNNING' in parallel. Default value
> = 1
> Throttle = how many actions can be materialized and set to WAITING to
> check for dependencies, at the same time. Default value = 12
>
> So for your case, both values should be 1. You don't need to set
> concurrency explicitly since default is 1, but throttle you can change
> from 0 to 1.
>
> Question,
>
> If second action depended on output from first action, how did second
> action become 'READY' before the first action was done and succeeded?
>
>
> On 7/15/13 11:12 PM, "Serega Sheypak" <se...@gmail.com> wrote:
>
> >Hi, I have a hyge amount of data partitioned by hour:
> >my/data/archive/yyyy/MM/dd/HH
> >The problem is that this data can't process this data in parallel.
> >
> >For example If I want to process
> >
> >my/data/archive/2013/07/16/01 I need to process
> >my/data/archive/2013/07/16/00 at first.
> >
> >I've wrote a coordinator with settings:
> >    <controls>
> >        <timeout>15</timeout>
> >        <concurrency>1</concurrency>
> >        <throttle>0</throttle>
> >    </controls>
> >I suppose that I would have 1 running materizlization at one time and the
> >next materialization won't be created until the first one is finished.
> >but it's not true.
> >Coordinator start to run and it creates materializations with status
> >"READY". Then these materialization die because the first one
> >materialization didn't prepare data for the next one.
> >
> >I want my coordinator to materialize actions one by one. When the first
> >one
> >is finished, then the next one can become READY and then RUNNING.
> >
> >What do I do wrong?
>
>

Re: Don't understand how does concurrency, throttle work. Need to materialize cood one-by-one

Posted by Mona Chitnis <ch...@yahoo-inc.com>.

Hi,

Concurrency = how many actions can be 'RUNNING' in parallel. Default value
= 1
Throttle = how many actions can be materialized and set to WAITING to
check for dependencies, at the same time. Default value = 12

So for your case, both values should be 1. You don't need to set
concurrency explicitly since default is 1, but throttle you can change
from 0 to 1.

Question,

If second action depended on output from first action, how did second
action become 'READY' before the first action was done and succeeded?

On 7/15/13 11:12 PM, "Serega Sheypak" <se...@gmail.com> wrote:

>Hi, I have a hyge amount of data partitioned by hour:
>my/data/archive/yyyy/MM/dd/HH
>The problem is that this data can't process this data in parallel.
>
>For example If I want to process
>
>my/data/archive/2013/07/16/01 I need to process
>my/data/archive/2013/07/16/00 at first.
>
>I've wrote a coordinator with settings:
>    <controls>
>        <timeout>15</timeout>
>        <concurrency>1</concurrency>
>        <throttle>0</throttle>
>    </controls>
>I suppose that I would have 1 running materizlization at one time and the
>next materialization won't be created until the first one is finished.
>but it's not true.
>Coordinator start to run and it creates materializations with status
>"READY". Then these materialization die because the first one
>materialization didn't prepare data for the next one.
>
>I want my coordinator to materialize actions one by one. When the first
>one
>is finished, then the next one can become READY and then RUNNING.
>
>What do I do wrong?