You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@oozie.apache.org by Per Ullberg <pe...@klarna.com> on 2016/06/01 09:19:01 UTC

Re: Limit the number of concurrent containers for an oozie workflow

If I understand that feature correctly it would only limit one sqoop job to
a certain number of mappers. I want to cap multiple concurrent sqoop jobs
to a total number of mappers.

regards
/Pelle

On Fri, May 27, 2016 at 8:08 AM, Harsh J <ha...@cloudera.com> wrote:

> Perhaps the feature of
> https://issues.apache.org/jira/browse/MAPREDUCE-5583 is
> what you are looking for.
>
> On Fri, 27 May 2016 at 00:04 Per Ullberg <pe...@klarna.com> wrote:
>
> > The fair scheduler would solve this issue, but we need the capacity
> > scheduler for other reason. Would it be possible to run multiple
> schedulers
> > in parallel?
> >
> > /Pelle
> >
> > On Thursday, May 26, 2016, David Morel <dm...@amakuru.net> wrote:
> >
> > > Le 26 mai 2016 9:04 AM, "Per Ullberg" <per.ullberg@klarna.com
> > > <javascript:;>> a écrit :
> > > >
> > > > The split is skewed. Just running one sqoop action will cause some
> > > > containers to finish early and others to finish late. If we run the
> > > actions
> > > > concurrently, the early finishers will be idle until all containers
> for
> > > > that action is done and the next action can commence. By running the
> > > > actions in parallel, we will finish earlier in total and also utilize
> > our
> > > > cluster resources better.
> > >
> > > I used the FairScheduler for exactly this scenario at my previous job.
> > >
> > > David
> > >
> > > > regards
> > > > /Pelle
> > > >
> > > > On Thu, May 26, 2016 at 3:09 AM, Robert Kanter <rkanter@cloudera.com
> > > <javascript:;>>
> > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > If you want to only run one of the Sqoop Actions at a time, why not
> > > simply
> > > > > remove the fork and run the Sqoop Actions sequentially?
> > > > >
> > > > > - Robert
> > > > >
> > > > > On Tue, May 3, 2016 at 12:15 AM, Per Ullberg <
> per.ullberg@klarna.com
> > > <javascript:;>>
> > > > > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > We have an oozie workflow that imports data table by table from a
> > > RDBMS
> > > > > > using sqoop. One action per table. The sqoop commands use "split
> by
> > > > > column"
> > > > > > and spread out on a number of mappers.
> > > > > >
> > > > > > We fork all the actions so basically all sqoop jobs are launched
> at
> > > once.
> > > > > >
> > > > > > The RDBMS can only accept a fixed number of connections and if
> this
> > > is
> > > > > > exceeded, the sqoop action will fail and eventually the whole
> oozie
> > > > > > workflow will fail.
> > > > > >
> > > > > > We use the yarn capacity scheduler (2.6.0) and have set up a
> > specific
> > > > > queue
> > > > > > for this job to throttle the maximum number of concurrent
> > containers.
> > > > > > However, this setup is hard to manage because all configurations
> in
> > > the
> > > > > > capacity scheduler are relative to the max amount of vcores of
> the
> > > > > cluster
> > > > > > and as we add machines or otherwise tune the cluster, the actual
> > > number
> > > > > of
> > > > > > containers granted to the oozie job changes and at times we hit
> the
> > > > > > connection roof.
> > > > > >
> > > > > > So, is there another way to throttle the number of concurrent
> > > containers
> > > > > > for an oozie job? I guess you would have to be able to throttle
> > both
> > > > > > launchers and map-reduce containers?
> > > > > >
> > > > > > best regards
> > > > > > /Pelle
> > > > > >
> > > > > >
> > > > > > --
> > > > > >
> > > > > > *Per Ullberg*
> > > > > > Tech Lead
> > > > > > Odin - Uppsala
> > > > > >
> > > > > > Klarna AB
> > > > > > Sveavägen 46, 111 34 Stockholm
> > > > > > Tel: +46 8 120 120 00
> > > > > > Reg no: 556737-0431
> > > > > > klarna.com
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > >
> > > > *Per Ullberg*
> > > > Tech Lead
> > > > Odin - Uppsala
> > > >
> > > > Klarna AB
> > > > Sveavägen 46, 111 34 Stockholm
> > > > Tel: +46 8 120 120 00
> > > > Reg no: 556737-0431
> > > > klarna.com
> > >
> >
> >
> > --
> >
> > *Per Ullberg*
> > Tech Lead
> > Odin - Uppsala
> >
> > Klarna AB
> > Sveavägen 46, 111 34 Stockholm
> > Tel: +46 8 120 120 00
> > Reg no: 556737-0431
> > klarna.com
> >
>



-- 

*Per Ullberg*
Tech Lead
Odin - Uppsala

Klarna AB
Sveavägen 46, 111 34 Stockholm
Tel: +46 8 120 120 00
Reg no: 556737-0431
klarna.com

Re: Limit the number of concurrent containers for an oozie workflow

Posted by Per Ullberg <pe...@klarna.com>.
A shared connection pool in sqoop would do the trick for me, but looking
through that codebase it does not look like it has a pluggable connection
pool API. Maybe a connection proxy, but that assumes sqoop actions could
wait infinitely for a connection. I have not delved further on that subject
yet.

/Pelle

On Wed, Jun 1, 2016 at 11:31 AM, Harsh J <ha...@cloudera.com> wrote:

> Given each job is an independent application on YARN, there's no way to do
> that outside of a Scheduler level config.
>
> On Wed, 1 Jun 2016 at 14:49 Per Ullberg <pe...@klarna.com> wrote:
>
> > If I understand that feature correctly it would only limit one sqoop job
> to
> > a certain number of mappers. I want to cap multiple concurrent sqoop jobs
> > to a total number of mappers.
> >
> > regards
> > /Pelle
> >
> > On Fri, May 27, 2016 at 8:08 AM, Harsh J <ha...@cloudera.com> wrote:
> >
> > > Perhaps the feature of
> > > https://issues.apache.org/jira/browse/MAPREDUCE-5583 is
> > > what you are looking for.
> > >
> > > On Fri, 27 May 2016 at 00:04 Per Ullberg <pe...@klarna.com>
> wrote:
> > >
> > > > The fair scheduler would solve this issue, but we need the capacity
> > > > scheduler for other reason. Would it be possible to run multiple
> > > schedulers
> > > > in parallel?
> > > >
> > > > /Pelle
> > > >
> > > > On Thursday, May 26, 2016, David Morel <dm...@amakuru.net> wrote:
> > > >
> > > > > Le 26 mai 2016 9:04 AM, "Per Ullberg" <per.ullberg@klarna.com
> > > > > <javascript:;>> a écrit :
> > > > > >
> > > > > > The split is skewed. Just running one sqoop action will cause
> some
> > > > > > containers to finish early and others to finish late. If we run
> the
> > > > > actions
> > > > > > concurrently, the early finishers will be idle until all
> containers
> > > for
> > > > > > that action is done and the next action can commence. By running
> > the
> > > > > > actions in parallel, we will finish earlier in total and also
> > utilize
> > > > our
> > > > > > cluster resources better.
> > > > >
> > > > > I used the FairScheduler for exactly this scenario at my previous
> > job.
> > > > >
> > > > > David
> > > > >
> > > > > > regards
> > > > > > /Pelle
> > > > > >
> > > > > > On Thu, May 26, 2016 at 3:09 AM, Robert Kanter <
> > rkanter@cloudera.com
> > > > > <javascript:;>>
> > > > > wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > If you want to only run one of the Sqoop Actions at a time, why
> > not
> > > > > simply
> > > > > > > remove the fork and run the Sqoop Actions sequentially?
> > > > > > >
> > > > > > > - Robert
> > > > > > >
> > > > > > > On Tue, May 3, 2016 at 12:15 AM, Per Ullberg <
> > > per.ullberg@klarna.com
> > > > > <javascript:;>>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > We have an oozie workflow that imports data table by table
> > from a
> > > > > RDBMS
> > > > > > > > using sqoop. One action per table. The sqoop commands use
> > "split
> > > by
> > > > > > > column"
> > > > > > > > and spread out on a number of mappers.
> > > > > > > >
> > > > > > > > We fork all the actions so basically all sqoop jobs are
> > launched
> > > at
> > > > > once.
> > > > > > > >
> > > > > > > > The RDBMS can only accept a fixed number of connections and
> if
> > > this
> > > > > is
> > > > > > > > exceeded, the sqoop action will fail and eventually the whole
> > > oozie
> > > > > > > > workflow will fail.
> > > > > > > >
> > > > > > > > We use the yarn capacity scheduler (2.6.0) and have set up a
> > > > specific
> > > > > > > queue
> > > > > > > > for this job to throttle the maximum number of concurrent
> > > > containers.
> > > > > > > > However, this setup is hard to manage because all
> > configurations
> > > in
> > > > > the
> > > > > > > > capacity scheduler are relative to the max amount of vcores
> of
> > > the
> > > > > > > cluster
> > > > > > > > and as we add machines or otherwise tune the cluster, the
> > actual
> > > > > number
> > > > > > > of
> > > > > > > > containers granted to the oozie job changes and at times we
> hit
> > > the
> > > > > > > > connection roof.
> > > > > > > >
> > > > > > > > So, is there another way to throttle the number of concurrent
> > > > > containers
> > > > > > > > for an oozie job? I guess you would have to be able to
> throttle
> > > > both
> > > > > > > > launchers and map-reduce containers?
> > > > > > > >
> > > > > > > > best regards
> > > > > > > > /Pelle
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > >
> > > > > > > > *Per Ullberg*
> > > > > > > > Tech Lead
> > > > > > > > Odin - Uppsala
> > > > > > > >
> > > > > > > > Klarna AB
> > > > > > > > Sveavägen 46, 111 34 Stockholm
> > > > > > > > Tel: +46 8 120 120 00
> > > > > > > > Reg no: 556737-0431
> > > > > > > > klarna.com
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > >
> > > > > > *Per Ullberg*
> > > > > > Tech Lead
> > > > > > Odin - Uppsala
> > > > > >
> > > > > > Klarna AB
> > > > > > Sveavägen 46, 111 34 Stockholm
> > > > > > Tel: +46 8 120 120 00
> > > > > > Reg no: 556737-0431
> > > > > > klarna.com
> > > > >
> > > >
> > > >
> > > > --
> > > >
> > > > *Per Ullberg*
> > > > Tech Lead
> > > > Odin - Uppsala
> > > >
> > > > Klarna AB
> > > > Sveavägen 46, 111 34 Stockholm
> > > > Tel: +46 8 120 120 00
> > > > Reg no: 556737-0431
> > > > klarna.com
> > > >
> > >
> >
> >
> >
> > --
> >
> > *Per Ullberg*
> > Tech Lead
> > Odin - Uppsala
> >
> > Klarna AB
> > Sveavägen 46, 111 34 Stockholm
> > Tel: +46 8 120 120 00
> > Reg no: 556737-0431
> > klarna.com
> >
>



-- 

*Per Ullberg*
Tech Lead
Odin - Uppsala

Klarna AB
Sveavägen 46, 111 34 Stockholm
Tel: +46 8 120 120 00
Reg no: 556737-0431
klarna.com

Re: Limit the number of concurrent containers for an oozie workflow

Posted by Harsh J <ha...@cloudera.com>.
Given each job is an independent application on YARN, there's no way to do
that outside of a Scheduler level config.

On Wed, 1 Jun 2016 at 14:49 Per Ullberg <pe...@klarna.com> wrote:

> If I understand that feature correctly it would only limit one sqoop job to
> a certain number of mappers. I want to cap multiple concurrent sqoop jobs
> to a total number of mappers.
>
> regards
> /Pelle
>
> On Fri, May 27, 2016 at 8:08 AM, Harsh J <ha...@cloudera.com> wrote:
>
> > Perhaps the feature of
> > https://issues.apache.org/jira/browse/MAPREDUCE-5583 is
> > what you are looking for.
> >
> > On Fri, 27 May 2016 at 00:04 Per Ullberg <pe...@klarna.com> wrote:
> >
> > > The fair scheduler would solve this issue, but we need the capacity
> > > scheduler for other reason. Would it be possible to run multiple
> > schedulers
> > > in parallel?
> > >
> > > /Pelle
> > >
> > > On Thursday, May 26, 2016, David Morel <dm...@amakuru.net> wrote:
> > >
> > > > Le 26 mai 2016 9:04 AM, "Per Ullberg" <per.ullberg@klarna.com
> > > > <javascript:;>> a écrit :
> > > > >
> > > > > The split is skewed. Just running one sqoop action will cause some
> > > > > containers to finish early and others to finish late. If we run the
> > > > actions
> > > > > concurrently, the early finishers will be idle until all containers
> > for
> > > > > that action is done and the next action can commence. By running
> the
> > > > > actions in parallel, we will finish earlier in total and also
> utilize
> > > our
> > > > > cluster resources better.
> > > >
> > > > I used the FairScheduler for exactly this scenario at my previous
> job.
> > > >
> > > > David
> > > >
> > > > > regards
> > > > > /Pelle
> > > > >
> > > > > On Thu, May 26, 2016 at 3:09 AM, Robert Kanter <
> rkanter@cloudera.com
> > > > <javascript:;>>
> > > > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > If you want to only run one of the Sqoop Actions at a time, why
> not
> > > > simply
> > > > > > remove the fork and run the Sqoop Actions sequentially?
> > > > > >
> > > > > > - Robert
> > > > > >
> > > > > > On Tue, May 3, 2016 at 12:15 AM, Per Ullberg <
> > per.ullberg@klarna.com
> > > > <javascript:;>>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > We have an oozie workflow that imports data table by table
> from a
> > > > RDBMS
> > > > > > > using sqoop. One action per table. The sqoop commands use
> "split
> > by
> > > > > > column"
> > > > > > > and spread out on a number of mappers.
> > > > > > >
> > > > > > > We fork all the actions so basically all sqoop jobs are
> launched
> > at
> > > > once.
> > > > > > >
> > > > > > > The RDBMS can only accept a fixed number of connections and if
> > this
> > > > is
> > > > > > > exceeded, the sqoop action will fail and eventually the whole
> > oozie
> > > > > > > workflow will fail.
> > > > > > >
> > > > > > > We use the yarn capacity scheduler (2.6.0) and have set up a
> > > specific
> > > > > > queue
> > > > > > > for this job to throttle the maximum number of concurrent
> > > containers.
> > > > > > > However, this setup is hard to manage because all
> configurations
> > in
> > > > the
> > > > > > > capacity scheduler are relative to the max amount of vcores of
> > the
> > > > > > cluster
> > > > > > > and as we add machines or otherwise tune the cluster, the
> actual
> > > > number
> > > > > > of
> > > > > > > containers granted to the oozie job changes and at times we hit
> > the
> > > > > > > connection roof.
> > > > > > >
> > > > > > > So, is there another way to throttle the number of concurrent
> > > > containers
> > > > > > > for an oozie job? I guess you would have to be able to throttle
> > > both
> > > > > > > launchers and map-reduce containers?
> > > > > > >
> > > > > > > best regards
> > > > > > > /Pelle
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > >
> > > > > > > *Per Ullberg*
> > > > > > > Tech Lead
> > > > > > > Odin - Uppsala
> > > > > > >
> > > > > > > Klarna AB
> > > > > > > Sveavägen 46, 111 34 Stockholm
> > > > > > > Tel: +46 8 120 120 00
> > > > > > > Reg no: 556737-0431
> > > > > > > klarna.com
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > >
> > > > > *Per Ullberg*
> > > > > Tech Lead
> > > > > Odin - Uppsala
> > > > >
> > > > > Klarna AB
> > > > > Sveavägen 46, 111 34 Stockholm
> > > > > Tel: +46 8 120 120 00
> > > > > Reg no: 556737-0431
> > > > > klarna.com
> > > >
> > >
> > >
> > > --
> > >
> > > *Per Ullberg*
> > > Tech Lead
> > > Odin - Uppsala
> > >
> > > Klarna AB
> > > Sveavägen 46, 111 34 Stockholm
> > > Tel: +46 8 120 120 00
> > > Reg no: 556737-0431
> > > klarna.com
> > >
> >
>
>
>
> --
>
> *Per Ullberg*
> Tech Lead
> Odin - Uppsala
>
> Klarna AB
> Sveavägen 46, 111 34 Stockholm
> Tel: +46 8 120 120 00
> Reg no: 556737-0431
> klarna.com
>