You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@oozie.apache.org by Mike Grimes <ma...@mtu.edu> on 2016/04/13 23:16:22 UTC

Running SparkSubmit on certain nodes

Hi,

Apologies if this has been answered before - I did some searching but
wasn't able to come up with anything helpful.

I'm attempting to run a pyspark job in yarn-cluster mode using Oozie on a
medium sized cluster, where Spark is only installed on the master node.
Normally the SparkSubmit class, when used in this mode, handles
distributing work to the slave nodes via YARN.

My issue stems from the fact that when running such a spark job via Oozie,
the container that is launched for SparkSubmit is ran on a random slave
node, which assumes has Spark (and therefore, necessary configuration such
as spark-defaults.conf, and spark-env.sh) is installed locally, leading to
issues like the one seen in https://issues.apache.org/jira/browse/OOZIE-2482
.

Is this an issue anyone else has ran into before? Is there work being done
in Oozie being done to address this assumption which places unnecessary
constraints on the user? I have yet to come across a method of ensuring a
specific job gets launched on a particular node, or seen any documentation
referring to this problem.

If I can provide more information, please don't hesitate to ask.

Best,

Re: Running SparkSubmit on certain nodes

Posted by Jaydeep Vishwakarma <ja...@inmobi.com>.

If you are running job using spark-action the job will randomly launch to
any node. At same time spark action use spark libraries to run spark job.
Spark libraries have all set of jar what is essential to run any spark job.

If you want to run job on specific node, I would recommend you to run your
job using ssh action and using spark submit command.

Regards,
Jaydeep

On Thu, Apr 14, 2016 at 10:51 AM, Grimes, Michael <gr...@amazon.com>
wrote:

> Jaydeep,
>
> Yes, this is using the spark action.
>
> > On Apr 13, 2016, at 9:22 PM, Jaydeep Vishwakarma <
> jaydeep.vishwakarma@inmobi.com> wrote:
> >
> > Are you using spark-action to run your spark job?
> >
> >> On Thu, Apr 14, 2016 at 2:46 AM, Mike Grimes <ma...@mtu.edu> wrote:
> >>
> >> Hi,
> >>
> >> Apologies if this has been answered before - I did some searching but
> >> wasn't able to come up with anything helpful.
> >>
> >> I'm attempting to run a pyspark job in yarn-cluster mode using Oozie on
> a
> >> medium sized cluster, where Spark is only installed on the master node.
> >> Normally the SparkSubmit class, when used in this mode, handles
> >> distributing work to the slave nodes via YARN.
> >>
> >> My issue stems from the fact that when running such a spark job via
> Oozie,
> >> the container that is launched for SparkSubmit is ran on a random slave
> >> node, which assumes has Spark (and therefore, necessary configuration
> such
> >> as spark-defaults.conf, and spark-env.sh) is installed locally, leading
> to
> >> issues like the one seen in
> >> https://issues.apache.org/jira/browse/OOZIE-2482
> >> .
> >>
> >> Is this an issue anyone else has ran into before? Is there work being
> done
> >> in Oozie being done to address this assumption which places unnecessary
> >> constraints on the user? I have yet to come across a method of ensuring
> a
> >> specific job gets launched on a particular node, or seen any
> documentation
> >> referring to this problem.
> >>
> >> If I can provide more information, please don't hesitate to ask.
> >>
> >> Best,
> >
> > --
> > _____________________________________________________________
> > The information contained in this communication is intended solely for
> the
> > use of the individual or entity to whom it is addressed and others
> > authorized to receive it. It may contain confidential or legally
> privileged
> > information. If you are not the intended recipient you are hereby
> notified
> > that any disclosure, copying, distribution or taking any action in
> reliance
> > on the contents of this information is strictly prohibited and may be
> > unlawful. If you have received this communication in error, please notify
> > us immediately by responding to this email and then delete it from your
> > system. The firm is neither liable for the proper and complete
> transmission
> > of the information contained in this communication nor for any delay in
> its
> > receipt.
>

-- 
_____________________________________________________________
The information contained in this communication is intended solely for the 
use of the individual or entity to whom it is addressed and others 
authorized to receive it. It may contain confidential or legally privileged 
information. If you are not the intended recipient you are hereby notified 
that any disclosure, copying, distribution or taking any action in reliance 
on the contents of this information is strictly prohibited and may be 
unlawful. If you have received this communication in error, please notify 
us immediately by responding to this email and then delete it from your 
system. The firm is neither liable for the proper and complete transmission 
of the information contained in this communication nor for any delay in its 
receipt.

Re: Running SparkSubmit on certain nodes

Posted by "Grimes, Michael" <gr...@amazon.com>.

Jaydeep,

Yes, this is using the spark action.

> On Apr 13, 2016, at 9:22 PM, Jaydeep Vishwakarma <ja...@inmobi.com> wrote:
> 
> Are you using spark-action to run your spark job?
> 
>> On Thu, Apr 14, 2016 at 2:46 AM, Mike Grimes <ma...@mtu.edu> wrote:
>> 
>> Hi,
>> 
>> Apologies if this has been answered before - I did some searching but
>> wasn't able to come up with anything helpful.
>> 
>> I'm attempting to run a pyspark job in yarn-cluster mode using Oozie on a
>> medium sized cluster, where Spark is only installed on the master node.
>> Normally the SparkSubmit class, when used in this mode, handles
>> distributing work to the slave nodes via YARN.
>> 
>> My issue stems from the fact that when running such a spark job via Oozie,
>> the container that is launched for SparkSubmit is ran on a random slave
>> node, which assumes has Spark (and therefore, necessary configuration such
>> as spark-defaults.conf, and spark-env.sh) is installed locally, leading to
>> issues like the one seen in
>> https://issues.apache.org/jira/browse/OOZIE-2482
>> .
>> 
>> Is this an issue anyone else has ran into before? Is there work being done
>> in Oozie being done to address this assumption which places unnecessary
>> constraints on the user? I have yet to come across a method of ensuring a
>> specific job gets launched on a particular node, or seen any documentation
>> referring to this problem.
>> 
>> If I can provide more information, please don't hesitate to ask.
>> 
>> Best,
> 
> -- 
> _____________________________________________________________
> The information contained in this communication is intended solely for the 
> use of the individual or entity to whom it is addressed and others 
> authorized to receive it. It may contain confidential or legally privileged 
> information. If you are not the intended recipient you are hereby notified 
> that any disclosure, copying, distribution or taking any action in reliance 
> on the contents of this information is strictly prohibited and may be 
> unlawful. If you have received this communication in error, please notify 
> us immediately by responding to this email and then delete it from your 
> system. The firm is neither liable for the proper and complete transmission 
> of the information contained in this communication nor for any delay in its 
> receipt.

Re: Running SparkSubmit on certain nodes

Posted by Jaydeep Vishwakarma <ja...@inmobi.com>.

Are you using spark-action to run your spark job?

On Thu, Apr 14, 2016 at 2:46 AM, Mike Grimes <ma...@mtu.edu> wrote:

> Hi,
>
> Apologies if this has been answered before - I did some searching but
> wasn't able to come up with anything helpful.
>
> I'm attempting to run a pyspark job in yarn-cluster mode using Oozie on a
> medium sized cluster, where Spark is only installed on the master node.
> Normally the SparkSubmit class, when used in this mode, handles
> distributing work to the slave nodes via YARN.
>
> My issue stems from the fact that when running such a spark job via Oozie,
> the container that is launched for SparkSubmit is ran on a random slave
> node, which assumes has Spark (and therefore, necessary configuration such
> as spark-defaults.conf, and spark-env.sh) is installed locally, leading to
> issues like the one seen in
> https://issues.apache.org/jira/browse/OOZIE-2482
> .
>
> Is this an issue anyone else has ran into before? Is there work being done
> in Oozie being done to address this assumption which places unnecessary
> constraints on the user? I have yet to come across a method of ensuring a
> specific job gets launched on a particular node, or seen any documentation
> referring to this problem.
>
> If I can provide more information, please don't hesitate to ask.
>
> Best,
>

-- 
_____________________________________________________________
The information contained in this communication is intended solely for the 
use of the individual or entity to whom it is addressed and others 
authorized to receive it. It may contain confidential or legally privileged 
information. If you are not the intended recipient you are hereby notified 
that any disclosure, copying, distribution or taking any action in reliance 
on the contents of this information is strictly prohibited and may be 
unlawful. If you have received this communication in error, please notify 
us immediately by responding to this email and then delete it from your 
system. The firm is neither liable for the proper and complete transmission 
of the information contained in this communication nor for any delay in its 
receipt.