You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@oozie.apache.org by Per Ullberg <pe...@klarna.com> on 2016/05/03 09:15:55 UTC

Limit the number of concurrent containers for an oozie workflow

Hi,

We have an oozie workflow that imports data table by table from a RDBMS
using sqoop. One action per table. The sqoop commands use "split by column"
and spread out on a number of mappers.

We fork all the actions so basically all sqoop jobs are launched at once.

The RDBMS can only accept a fixed number of connections and if this is
exceeded, the sqoop action will fail and eventually the whole oozie
workflow will fail.

We use the yarn capacity scheduler (2.6.0) and have set up a specific queue
for this job to throttle the maximum number of concurrent containers.
However, this setup is hard to manage because all configurations in the
capacity scheduler are relative to the max amount of vcores of the cluster
and as we add machines or otherwise tune the cluster, the actual number of
containers granted to the oozie job changes and at times we hit the
connection roof.

So, is there another way to throttle the number of concurrent containers
for an oozie job? I guess you would have to be able to throttle both
launchers and map-reduce containers?

best regards
/Pelle


-- 

*Per Ullberg*
Tech Lead
Odin - Uppsala

Klarna AB
Sveavägen 46, 111 34 Stockholm
Tel: +46 8 120 120 00
Reg no: 556737-0431
klarna.com

Re: Limit the number of concurrent containers for an oozie workflow

Posted by Per Ullberg <pe...@klarna.com>.
A shared connection pool in sqoop would do the trick for me, but looking
through that codebase it does not look like it has a pluggable connection
pool API. Maybe a connection proxy, but that assumes sqoop actions could
wait infinitely for a connection. I have not delved further on that subject
yet.

/Pelle

On Wed, Jun 1, 2016 at 11:31 AM, Harsh J <ha...@cloudera.com> wrote:

> Given each job is an independent application on YARN, there's no way to do
> that outside of a Scheduler level config.
>
> On Wed, 1 Jun 2016 at 14:49 Per Ullberg <pe...@klarna.com> wrote:
>
> > If I understand that feature correctly it would only limit one sqoop job
> to
> > a certain number of mappers. I want to cap multiple concurrent sqoop jobs
> > to a total number of mappers.
> >
> > regards
> > /Pelle
> >
> > On Fri, May 27, 2016 at 8:08 AM, Harsh J <ha...@cloudera.com> wrote:
> >
> > > Perhaps the feature of
> > > https://issues.apache.org/jira/browse/MAPREDUCE-5583 is
> > > what you are looking for.
> > >
> > > On Fri, 27 May 2016 at 00:04 Per Ullberg <pe...@klarna.com>
> wrote:
> > >
> > > > The fair scheduler would solve this issue, but we need the capacity
> > > > scheduler for other reason. Would it be possible to run multiple
> > > schedulers
> > > > in parallel?
> > > >
> > > > /Pelle
> > > >
> > > > On Thursday, May 26, 2016, David Morel <dm...@amakuru.net> wrote:
> > > >
> > > > > Le 26 mai 2016 9:04 AM, "Per Ullberg" <per.ullberg@klarna.com
> > > > > <javascript:;>> a écrit :
> > > > > >
> > > > > > The split is skewed. Just running one sqoop action will cause
> some
> > > > > > containers to finish early and others to finish late. If we run
> the
> > > > > actions
> > > > > > concurrently, the early finishers will be idle until all
> containers
> > > for
> > > > > > that action is done and the next action can commence. By running
> > the
> > > > > > actions in parallel, we will finish earlier in total and also
> > utilize
> > > > our
> > > > > > cluster resources better.
> > > > >
> > > > > I used the FairScheduler for exactly this scenario at my previous
> > job.
> > > > >
> > > > > David
> > > > >
> > > > > > regards
> > > > > > /Pelle
> > > > > >
> > > > > > On Thu, May 26, 2016 at 3:09 AM, Robert Kanter <
> > rkanter@cloudera.com
> > > > > <javascript:;>>
> > > > > wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > If you want to only run one of the Sqoop Actions at a time, why
> > not
> > > > > simply
> > > > > > > remove the fork and run the Sqoop Actions sequentially?
> > > > > > >
> > > > > > > - Robert
> > > > > > >
> > > > > > > On Tue, May 3, 2016 at 12:15 AM, Per Ullberg <
> > > per.ullberg@klarna.com
> > > > > <javascript:;>>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > We have an oozie workflow that imports data table by table
> > from a
> > > > > RDBMS
> > > > > > > > using sqoop. One action per table. The sqoop commands use
> > "split
> > > by
> > > > > > > column"
> > > > > > > > and spread out on a number of mappers.
> > > > > > > >
> > > > > > > > We fork all the actions so basically all sqoop jobs are
> > launched
> > > at
> > > > > once.
> > > > > > > >
> > > > > > > > The RDBMS can only accept a fixed number of connections and
> if
> > > this
> > > > > is
> > > > > > > > exceeded, the sqoop action will fail and eventually the whole
> > > oozie
> > > > > > > > workflow will fail.
> > > > > > > >
> > > > > > > > We use the yarn capacity scheduler (2.6.0) and have set up a
> > > > specific
> > > > > > > queue
> > > > > > > > for this job to throttle the maximum number of concurrent
> > > > containers.
> > > > > > > > However, this setup is hard to manage because all
> > configurations
> > > in
> > > > > the
> > > > > > > > capacity scheduler are relative to the max amount of vcores
> of
> > > the
> > > > > > > cluster
> > > > > > > > and as we add machines or otherwise tune the cluster, the
> > actual
> > > > > number
> > > > > > > of
> > > > > > > > containers granted to the oozie job changes and at times we
> hit
> > > the
> > > > > > > > connection roof.
> > > > > > > >
> > > > > > > > So, is there another way to throttle the number of concurrent
> > > > > containers
> > > > > > > > for an oozie job? I guess you would have to be able to
> throttle
> > > > both
> > > > > > > > launchers and map-reduce containers?
> > > > > > > >
> > > > > > > > best regards
> > > > > > > > /Pelle
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > >
> > > > > > > > *Per Ullberg*
> > > > > > > > Tech Lead
> > > > > > > > Odin - Uppsala
> > > > > > > >
> > > > > > > > Klarna AB
> > > > > > > > Sveavägen 46, 111 34 Stockholm
> > > > > > > > Tel: +46 8 120 120 00
> > > > > > > > Reg no: 556737-0431
> > > > > > > > klarna.com
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > >
> > > > > > *Per Ullberg*
> > > > > > Tech Lead
> > > > > > Odin - Uppsala
> > > > > >
> > > > > > Klarna AB
> > > > > > Sveavägen 46, 111 34 Stockholm
> > > > > > Tel: +46 8 120 120 00
> > > > > > Reg no: 556737-0431
> > > > > > klarna.com
> > > > >
> > > >
> > > >
> > > > --
> > > >
> > > > *Per Ullberg*
> > > > Tech Lead
> > > > Odin - Uppsala
> > > >
> > > > Klarna AB
> > > > Sveavägen 46, 111 34 Stockholm
> > > > Tel: +46 8 120 120 00
> > > > Reg no: 556737-0431
> > > > klarna.com
> > > >
> > >
> >
> >
> >
> > --
> >
> > *Per Ullberg*
> > Tech Lead
> > Odin - Uppsala
> >
> > Klarna AB
> > Sveavägen 46, 111 34 Stockholm
> > Tel: +46 8 120 120 00
> > Reg no: 556737-0431
> > klarna.com
> >
>



-- 

*Per Ullberg*
Tech Lead
Odin - Uppsala

Klarna AB
Sveavägen 46, 111 34 Stockholm
Tel: +46 8 120 120 00
Reg no: 556737-0431
klarna.com

Re: Limit the number of concurrent containers for an oozie workflow

Posted by Harsh J <ha...@cloudera.com>.
Given each job is an independent application on YARN, there's no way to do
that outside of a Scheduler level config.

On Wed, 1 Jun 2016 at 14:49 Per Ullberg <pe...@klarna.com> wrote:

> If I understand that feature correctly it would only limit one sqoop job to
> a certain number of mappers. I want to cap multiple concurrent sqoop jobs
> to a total number of mappers.
>
> regards
> /Pelle
>
> On Fri, May 27, 2016 at 8:08 AM, Harsh J <ha...@cloudera.com> wrote:
>
> > Perhaps the feature of
> > https://issues.apache.org/jira/browse/MAPREDUCE-5583 is
> > what you are looking for.
> >
> > On Fri, 27 May 2016 at 00:04 Per Ullberg <pe...@klarna.com> wrote:
> >
> > > The fair scheduler would solve this issue, but we need the capacity
> > > scheduler for other reason. Would it be possible to run multiple
> > schedulers
> > > in parallel?
> > >
> > > /Pelle
> > >
> > > On Thursday, May 26, 2016, David Morel <dm...@amakuru.net> wrote:
> > >
> > > > Le 26 mai 2016 9:04 AM, "Per Ullberg" <per.ullberg@klarna.com
> > > > <javascript:;>> a écrit :
> > > > >
> > > > > The split is skewed. Just running one sqoop action will cause some
> > > > > containers to finish early and others to finish late. If we run the
> > > > actions
> > > > > concurrently, the early finishers will be idle until all containers
> > for
> > > > > that action is done and the next action can commence. By running
> the
> > > > > actions in parallel, we will finish earlier in total and also
> utilize
> > > our
> > > > > cluster resources better.
> > > >
> > > > I used the FairScheduler for exactly this scenario at my previous
> job.
> > > >
> > > > David
> > > >
> > > > > regards
> > > > > /Pelle
> > > > >
> > > > > On Thu, May 26, 2016 at 3:09 AM, Robert Kanter <
> rkanter@cloudera.com
> > > > <javascript:;>>
> > > > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > If you want to only run one of the Sqoop Actions at a time, why
> not
> > > > simply
> > > > > > remove the fork and run the Sqoop Actions sequentially?
> > > > > >
> > > > > > - Robert
> > > > > >
> > > > > > On Tue, May 3, 2016 at 12:15 AM, Per Ullberg <
> > per.ullberg@klarna.com
> > > > <javascript:;>>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > We have an oozie workflow that imports data table by table
> from a
> > > > RDBMS
> > > > > > > using sqoop. One action per table. The sqoop commands use
> "split
> > by
> > > > > > column"
> > > > > > > and spread out on a number of mappers.
> > > > > > >
> > > > > > > We fork all the actions so basically all sqoop jobs are
> launched
> > at
> > > > once.
> > > > > > >
> > > > > > > The RDBMS can only accept a fixed number of connections and if
> > this
> > > > is
> > > > > > > exceeded, the sqoop action will fail and eventually the whole
> > oozie
> > > > > > > workflow will fail.
> > > > > > >
> > > > > > > We use the yarn capacity scheduler (2.6.0) and have set up a
> > > specific
> > > > > > queue
> > > > > > > for this job to throttle the maximum number of concurrent
> > > containers.
> > > > > > > However, this setup is hard to manage because all
> configurations
> > in
> > > > the
> > > > > > > capacity scheduler are relative to the max amount of vcores of
> > the
> > > > > > cluster
> > > > > > > and as we add machines or otherwise tune the cluster, the
> actual
> > > > number
> > > > > > of
> > > > > > > containers granted to the oozie job changes and at times we hit
> > the
> > > > > > > connection roof.
> > > > > > >
> > > > > > > So, is there another way to throttle the number of concurrent
> > > > containers
> > > > > > > for an oozie job? I guess you would have to be able to throttle
> > > both
> > > > > > > launchers and map-reduce containers?
> > > > > > >
> > > > > > > best regards
> > > > > > > /Pelle
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > >
> > > > > > > *Per Ullberg*
> > > > > > > Tech Lead
> > > > > > > Odin - Uppsala
> > > > > > >
> > > > > > > Klarna AB
> > > > > > > Sveavägen 46, 111 34 Stockholm
> > > > > > > Tel: +46 8 120 120 00
> > > > > > > Reg no: 556737-0431
> > > > > > > klarna.com
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > >
> > > > > *Per Ullberg*
> > > > > Tech Lead
> > > > > Odin - Uppsala
> > > > >
> > > > > Klarna AB
> > > > > Sveavägen 46, 111 34 Stockholm
> > > > > Tel: +46 8 120 120 00
> > > > > Reg no: 556737-0431
> > > > > klarna.com
> > > >
> > >
> > >
> > > --
> > >
> > > *Per Ullberg*
> > > Tech Lead
> > > Odin - Uppsala
> > >
> > > Klarna AB
> > > Sveavägen 46, 111 34 Stockholm
> > > Tel: +46 8 120 120 00
> > > Reg no: 556737-0431
> > > klarna.com
> > >
> >
>
>
>
> --
>
> *Per Ullberg*
> Tech Lead
> Odin - Uppsala
>
> Klarna AB
> Sveavägen 46, 111 34 Stockholm
> Tel: +46 8 120 120 00
> Reg no: 556737-0431
> klarna.com
>

Re: Limit the number of concurrent containers for an oozie workflow

Posted by Per Ullberg <pe...@klarna.com>.
If I understand that feature correctly it would only limit one sqoop job to
a certain number of mappers. I want to cap multiple concurrent sqoop jobs
to a total number of mappers.

regards
/Pelle

On Fri, May 27, 2016 at 8:08 AM, Harsh J <ha...@cloudera.com> wrote:

> Perhaps the feature of
> https://issues.apache.org/jira/browse/MAPREDUCE-5583 is
> what you are looking for.
>
> On Fri, 27 May 2016 at 00:04 Per Ullberg <pe...@klarna.com> wrote:
>
> > The fair scheduler would solve this issue, but we need the capacity
> > scheduler for other reason. Would it be possible to run multiple
> schedulers
> > in parallel?
> >
> > /Pelle
> >
> > On Thursday, May 26, 2016, David Morel <dm...@amakuru.net> wrote:
> >
> > > Le 26 mai 2016 9:04 AM, "Per Ullberg" <per.ullberg@klarna.com
> > > <javascript:;>> a écrit :
> > > >
> > > > The split is skewed. Just running one sqoop action will cause some
> > > > containers to finish early and others to finish late. If we run the
> > > actions
> > > > concurrently, the early finishers will be idle until all containers
> for
> > > > that action is done and the next action can commence. By running the
> > > > actions in parallel, we will finish earlier in total and also utilize
> > our
> > > > cluster resources better.
> > >
> > > I used the FairScheduler for exactly this scenario at my previous job.
> > >
> > > David
> > >
> > > > regards
> > > > /Pelle
> > > >
> > > > On Thu, May 26, 2016 at 3:09 AM, Robert Kanter <rkanter@cloudera.com
> > > <javascript:;>>
> > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > If you want to only run one of the Sqoop Actions at a time, why not
> > > simply
> > > > > remove the fork and run the Sqoop Actions sequentially?
> > > > >
> > > > > - Robert
> > > > >
> > > > > On Tue, May 3, 2016 at 12:15 AM, Per Ullberg <
> per.ullberg@klarna.com
> > > <javascript:;>>
> > > > > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > We have an oozie workflow that imports data table by table from a
> > > RDBMS
> > > > > > using sqoop. One action per table. The sqoop commands use "split
> by
> > > > > column"
> > > > > > and spread out on a number of mappers.
> > > > > >
> > > > > > We fork all the actions so basically all sqoop jobs are launched
> at
> > > once.
> > > > > >
> > > > > > The RDBMS can only accept a fixed number of connections and if
> this
> > > is
> > > > > > exceeded, the sqoop action will fail and eventually the whole
> oozie
> > > > > > workflow will fail.
> > > > > >
> > > > > > We use the yarn capacity scheduler (2.6.0) and have set up a
> > specific
> > > > > queue
> > > > > > for this job to throttle the maximum number of concurrent
> > containers.
> > > > > > However, this setup is hard to manage because all configurations
> in
> > > the
> > > > > > capacity scheduler are relative to the max amount of vcores of
> the
> > > > > cluster
> > > > > > and as we add machines or otherwise tune the cluster, the actual
> > > number
> > > > > of
> > > > > > containers granted to the oozie job changes and at times we hit
> the
> > > > > > connection roof.
> > > > > >
> > > > > > So, is there another way to throttle the number of concurrent
> > > containers
> > > > > > for an oozie job? I guess you would have to be able to throttle
> > both
> > > > > > launchers and map-reduce containers?
> > > > > >
> > > > > > best regards
> > > > > > /Pelle
> > > > > >
> > > > > >
> > > > > > --
> > > > > >
> > > > > > *Per Ullberg*
> > > > > > Tech Lead
> > > > > > Odin - Uppsala
> > > > > >
> > > > > > Klarna AB
> > > > > > Sveavägen 46, 111 34 Stockholm
> > > > > > Tel: +46 8 120 120 00
> > > > > > Reg no: 556737-0431
> > > > > > klarna.com
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > >
> > > > *Per Ullberg*
> > > > Tech Lead
> > > > Odin - Uppsala
> > > >
> > > > Klarna AB
> > > > Sveavägen 46, 111 34 Stockholm
> > > > Tel: +46 8 120 120 00
> > > > Reg no: 556737-0431
> > > > klarna.com
> > >
> >
> >
> > --
> >
> > *Per Ullberg*
> > Tech Lead
> > Odin - Uppsala
> >
> > Klarna AB
> > Sveavägen 46, 111 34 Stockholm
> > Tel: +46 8 120 120 00
> > Reg no: 556737-0431
> > klarna.com
> >
>



-- 

*Per Ullberg*
Tech Lead
Odin - Uppsala

Klarna AB
Sveavägen 46, 111 34 Stockholm
Tel: +46 8 120 120 00
Reg no: 556737-0431
klarna.com

Re: Limit the number of concurrent containers for an oozie workflow

Posted by Harsh J <ha...@cloudera.com>.
Perhaps the feature of https://issues.apache.org/jira/browse/MAPREDUCE-5583 is
what you are looking for.

On Fri, 27 May 2016 at 00:04 Per Ullberg <pe...@klarna.com> wrote:

> The fair scheduler would solve this issue, but we need the capacity
> scheduler for other reason. Would it be possible to run multiple schedulers
> in parallel?
>
> /Pelle
>
> On Thursday, May 26, 2016, David Morel <dm...@amakuru.net> wrote:
>
> > Le 26 mai 2016 9:04 AM, "Per Ullberg" <per.ullberg@klarna.com
> > <javascript:;>> a écrit :
> > >
> > > The split is skewed. Just running one sqoop action will cause some
> > > containers to finish early and others to finish late. If we run the
> > actions
> > > concurrently, the early finishers will be idle until all containers for
> > > that action is done and the next action can commence. By running the
> > > actions in parallel, we will finish earlier in total and also utilize
> our
> > > cluster resources better.
> >
> > I used the FairScheduler for exactly this scenario at my previous job.
> >
> > David
> >
> > > regards
> > > /Pelle
> > >
> > > On Thu, May 26, 2016 at 3:09 AM, Robert Kanter <rkanter@cloudera.com
> > <javascript:;>>
> > wrote:
> > >
> > > > Hi,
> > > >
> > > > If you want to only run one of the Sqoop Actions at a time, why not
> > simply
> > > > remove the fork and run the Sqoop Actions sequentially?
> > > >
> > > > - Robert
> > > >
> > > > On Tue, May 3, 2016 at 12:15 AM, Per Ullberg <per.ullberg@klarna.com
> > <javascript:;>>
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > We have an oozie workflow that imports data table by table from a
> > RDBMS
> > > > > using sqoop. One action per table. The sqoop commands use "split by
> > > > column"
> > > > > and spread out on a number of mappers.
> > > > >
> > > > > We fork all the actions so basically all sqoop jobs are launched at
> > once.
> > > > >
> > > > > The RDBMS can only accept a fixed number of connections and if this
> > is
> > > > > exceeded, the sqoop action will fail and eventually the whole oozie
> > > > > workflow will fail.
> > > > >
> > > > > We use the yarn capacity scheduler (2.6.0) and have set up a
> specific
> > > > queue
> > > > > for this job to throttle the maximum number of concurrent
> containers.
> > > > > However, this setup is hard to manage because all configurations in
> > the
> > > > > capacity scheduler are relative to the max amount of vcores of the
> > > > cluster
> > > > > and as we add machines or otherwise tune the cluster, the actual
> > number
> > > > of
> > > > > containers granted to the oozie job changes and at times we hit the
> > > > > connection roof.
> > > > >
> > > > > So, is there another way to throttle the number of concurrent
> > containers
> > > > > for an oozie job? I guess you would have to be able to throttle
> both
> > > > > launchers and map-reduce containers?
> > > > >
> > > > > best regards
> > > > > /Pelle
> > > > >
> > > > >
> > > > > --
> > > > >
> > > > > *Per Ullberg*
> > > > > Tech Lead
> > > > > Odin - Uppsala
> > > > >
> > > > > Klarna AB
> > > > > Sveavägen 46, 111 34 Stockholm
> > > > > Tel: +46 8 120 120 00
> > > > > Reg no: 556737-0431
> > > > > klarna.com
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > >
> > > *Per Ullberg*
> > > Tech Lead
> > > Odin - Uppsala
> > >
> > > Klarna AB
> > > Sveavägen 46, 111 34 Stockholm
> > > Tel: +46 8 120 120 00
> > > Reg no: 556737-0431
> > > klarna.com
> >
>
>
> --
>
> *Per Ullberg*
> Tech Lead
> Odin - Uppsala
>
> Klarna AB
> Sveavägen 46, 111 34 Stockholm
> Tel: +46 8 120 120 00
> Reg no: 556737-0431
> klarna.com
>

Re: Limit the number of concurrent containers for an oozie workflow

Posted by Per Ullberg <pe...@klarna.com>.
The fair scheduler would solve this issue, but we need the capacity
scheduler for other reason. Would it be possible to run multiple schedulers
in parallel?

/Pelle

On Thursday, May 26, 2016, David Morel <dm...@amakuru.net> wrote:

> Le 26 mai 2016 9:04 AM, "Per Ullberg" <per.ullberg@klarna.com
> <javascript:;>> a écrit :
> >
> > The split is skewed. Just running one sqoop action will cause some
> > containers to finish early and others to finish late. If we run the
> actions
> > concurrently, the early finishers will be idle until all containers for
> > that action is done and the next action can commence. By running the
> > actions in parallel, we will finish earlier in total and also utilize our
> > cluster resources better.
>
> I used the FairScheduler for exactly this scenario at my previous job.
>
> David
>
> > regards
> > /Pelle
> >
> > On Thu, May 26, 2016 at 3:09 AM, Robert Kanter <rkanter@cloudera.com
> <javascript:;>>
> wrote:
> >
> > > Hi,
> > >
> > > If you want to only run one of the Sqoop Actions at a time, why not
> simply
> > > remove the fork and run the Sqoop Actions sequentially?
> > >
> > > - Robert
> > >
> > > On Tue, May 3, 2016 at 12:15 AM, Per Ullberg <per.ullberg@klarna.com
> <javascript:;>>
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > We have an oozie workflow that imports data table by table from a
> RDBMS
> > > > using sqoop. One action per table. The sqoop commands use "split by
> > > column"
> > > > and spread out on a number of mappers.
> > > >
> > > > We fork all the actions so basically all sqoop jobs are launched at
> once.
> > > >
> > > > The RDBMS can only accept a fixed number of connections and if this
> is
> > > > exceeded, the sqoop action will fail and eventually the whole oozie
> > > > workflow will fail.
> > > >
> > > > We use the yarn capacity scheduler (2.6.0) and have set up a specific
> > > queue
> > > > for this job to throttle the maximum number of concurrent containers.
> > > > However, this setup is hard to manage because all configurations in
> the
> > > > capacity scheduler are relative to the max amount of vcores of the
> > > cluster
> > > > and as we add machines or otherwise tune the cluster, the actual
> number
> > > of
> > > > containers granted to the oozie job changes and at times we hit the
> > > > connection roof.
> > > >
> > > > So, is there another way to throttle the number of concurrent
> containers
> > > > for an oozie job? I guess you would have to be able to throttle both
> > > > launchers and map-reduce containers?
> > > >
> > > > best regards
> > > > /Pelle
> > > >
> > > >
> > > > --
> > > >
> > > > *Per Ullberg*
> > > > Tech Lead
> > > > Odin - Uppsala
> > > >
> > > > Klarna AB
> > > > Sveavägen 46, 111 34 Stockholm
> > > > Tel: +46 8 120 120 00
> > > > Reg no: 556737-0431
> > > > klarna.com
> > > >
> > >
> >
> >
> >
> > --
> >
> > *Per Ullberg*
> > Tech Lead
> > Odin - Uppsala
> >
> > Klarna AB
> > Sveavägen 46, 111 34 Stockholm
> > Tel: +46 8 120 120 00
> > Reg no: 556737-0431
> > klarna.com
>


-- 

*Per Ullberg*
Tech Lead
Odin - Uppsala

Klarna AB
Sveavägen 46, 111 34 Stockholm
Tel: +46 8 120 120 00
Reg no: 556737-0431
klarna.com

Re: Limit the number of concurrent containers for an oozie workflow

Posted by David Morel <dm...@amakuru.net>.
Le 26 mai 2016 9:04 AM, "Per Ullberg" <pe...@klarna.com> a écrit :
>
> The split is skewed. Just running one sqoop action will cause some
> containers to finish early and others to finish late. If we run the
actions
> concurrently, the early finishers will be idle until all containers for
> that action is done and the next action can commence. By running the
> actions in parallel, we will finish earlier in total and also utilize our
> cluster resources better.

I used the FairScheduler for exactly this scenario at my previous job.

David

> regards
> /Pelle
>
> On Thu, May 26, 2016 at 3:09 AM, Robert Kanter <rk...@cloudera.com>
wrote:
>
> > Hi,
> >
> > If you want to only run one of the Sqoop Actions at a time, why not
simply
> > remove the fork and run the Sqoop Actions sequentially?
> >
> > - Robert
> >
> > On Tue, May 3, 2016 at 12:15 AM, Per Ullberg <pe...@klarna.com>
> > wrote:
> >
> > > Hi,
> > >
> > > We have an oozie workflow that imports data table by table from a
RDBMS
> > > using sqoop. One action per table. The sqoop commands use "split by
> > column"
> > > and spread out on a number of mappers.
> > >
> > > We fork all the actions so basically all sqoop jobs are launched at
once.
> > >
> > > The RDBMS can only accept a fixed number of connections and if this is
> > > exceeded, the sqoop action will fail and eventually the whole oozie
> > > workflow will fail.
> > >
> > > We use the yarn capacity scheduler (2.6.0) and have set up a specific
> > queue
> > > for this job to throttle the maximum number of concurrent containers.
> > > However, this setup is hard to manage because all configurations in
the
> > > capacity scheduler are relative to the max amount of vcores of the
> > cluster
> > > and as we add machines or otherwise tune the cluster, the actual
number
> > of
> > > containers granted to the oozie job changes and at times we hit the
> > > connection roof.
> > >
> > > So, is there another way to throttle the number of concurrent
containers
> > > for an oozie job? I guess you would have to be able to throttle both
> > > launchers and map-reduce containers?
> > >
> > > best regards
> > > /Pelle
> > >
> > >
> > > --
> > >
> > > *Per Ullberg*
> > > Tech Lead
> > > Odin - Uppsala
> > >
> > > Klarna AB
> > > Sveavägen 46, 111 34 Stockholm
> > > Tel: +46 8 120 120 00
> > > Reg no: 556737-0431
> > > klarna.com
> > >
> >
>
>
>
> --
>
> *Per Ullberg*
> Tech Lead
> Odin - Uppsala
>
> Klarna AB
> Sveavägen 46, 111 34 Stockholm
> Tel: +46 8 120 120 00
> Reg no: 556737-0431
> klarna.com

Re: Limit the number of concurrent containers for an oozie workflow

Posted by Per Ullberg <pe...@klarna.com>.
The split is skewed. Just running one sqoop action will cause some
containers to finish early and others to finish late. If we run the actions
concurrently, the early finishers will be idle until all containers for
that action is done and the next action can commence. By running the
actions in parallel, we will finish earlier in total and also utilize our
cluster resources better.

regards
/Pelle

On Thu, May 26, 2016 at 3:09 AM, Robert Kanter <rk...@cloudera.com> wrote:

> Hi,
>
> If you want to only run one of the Sqoop Actions at a time, why not simply
> remove the fork and run the Sqoop Actions sequentially?
>
> - Robert
>
> On Tue, May 3, 2016 at 12:15 AM, Per Ullberg <pe...@klarna.com>
> wrote:
>
> > Hi,
> >
> > We have an oozie workflow that imports data table by table from a RDBMS
> > using sqoop. One action per table. The sqoop commands use "split by
> column"
> > and spread out on a number of mappers.
> >
> > We fork all the actions so basically all sqoop jobs are launched at once.
> >
> > The RDBMS can only accept a fixed number of connections and if this is
> > exceeded, the sqoop action will fail and eventually the whole oozie
> > workflow will fail.
> >
> > We use the yarn capacity scheduler (2.6.0) and have set up a specific
> queue
> > for this job to throttle the maximum number of concurrent containers.
> > However, this setup is hard to manage because all configurations in the
> > capacity scheduler are relative to the max amount of vcores of the
> cluster
> > and as we add machines or otherwise tune the cluster, the actual number
> of
> > containers granted to the oozie job changes and at times we hit the
> > connection roof.
> >
> > So, is there another way to throttle the number of concurrent containers
> > for an oozie job? I guess you would have to be able to throttle both
> > launchers and map-reduce containers?
> >
> > best regards
> > /Pelle
> >
> >
> > --
> >
> > *Per Ullberg*
> > Tech Lead
> > Odin - Uppsala
> >
> > Klarna AB
> > Sveavägen 46, 111 34 Stockholm
> > Tel: +46 8 120 120 00
> > Reg no: 556737-0431
> > klarna.com
> >
>



-- 

*Per Ullberg*
Tech Lead
Odin - Uppsala

Klarna AB
Sveavägen 46, 111 34 Stockholm
Tel: +46 8 120 120 00
Reg no: 556737-0431
klarna.com

Re: Limit the number of concurrent containers for an oozie workflow

Posted by Robert Kanter <rk...@cloudera.com>.
Hi,

If you want to only run one of the Sqoop Actions at a time, why not simply
remove the fork and run the Sqoop Actions sequentially?

- Robert

On Tue, May 3, 2016 at 12:15 AM, Per Ullberg <pe...@klarna.com> wrote:

> Hi,
>
> We have an oozie workflow that imports data table by table from a RDBMS
> using sqoop. One action per table. The sqoop commands use "split by column"
> and spread out on a number of mappers.
>
> We fork all the actions so basically all sqoop jobs are launched at once.
>
> The RDBMS can only accept a fixed number of connections and if this is
> exceeded, the sqoop action will fail and eventually the whole oozie
> workflow will fail.
>
> We use the yarn capacity scheduler (2.6.0) and have set up a specific queue
> for this job to throttle the maximum number of concurrent containers.
> However, this setup is hard to manage because all configurations in the
> capacity scheduler are relative to the max amount of vcores of the cluster
> and as we add machines or otherwise tune the cluster, the actual number of
> containers granted to the oozie job changes and at times we hit the
> connection roof.
>
> So, is there another way to throttle the number of concurrent containers
> for an oozie job? I guess you would have to be able to throttle both
> launchers and map-reduce containers?
>
> best regards
> /Pelle
>
>
> --
>
> *Per Ullberg*
> Tech Lead
> Odin - Uppsala
>
> Klarna AB
> Sveavägen 46, 111 34 Stockholm
> Tel: +46 8 120 120 00
> Reg no: 556737-0431
> klarna.com
>