You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@predictionio.apache.org by Shane Johnson <sh...@liftiq.com> on 2018/07/29 20:18:20 UTC

PIO Spark Application - Slaves are registered but slave executors are not being used?

Hi Team,

I am working through and issue and am hoping someone can point me in the
right direction or has encountered this already. I have setup a 2 node
spark cluster as a Test. I have the Master and 1 Slave. I have
password-less SSH working and am able to run the sbin/start-all.sh command
to start the Master and the Slave.

config/masters

***.**.**.12

config/slaves

***.**.**.12
***.**.**.236

Everything looks like it is starting as expected but when I submit the PIO
Application I notice that only the executors on the Master (***.**.**.12)
or Driver machine are actually running the tasks. The cluster UI shows both
slaves registered and Running but the Application UI only shows the 3
executors and 15 cores vs. what I am expecting to see as 6 executors (3
from ***.**.**.12 and 3 from (***.**.**236) and 30 cores.

Here is the command to submit the pio application
pio train -- --master spark://ip-172-31-40-12.us-west-2.compute.internal:7077
--executor-cores 5 --executor-memory 19g --num-executors 6 --driver-memory
28g


*Can someone help me find ways to troubleshoot this? Why are the executors
on the non-Master server (Slave 2 in diagram) not doing any work even
though they say they are registered and being used? *

Thank you for you help! Here are a few screenshots of what I am seeing.






One thing I tried was to remove the Slave on the Master machine to try and
force the Slave machine (***.**.**.236) to engage but then I got the
following error. Again the non-master machine executor look to be
registered and have resources but they don't run in the application:

*Test to force non-master machine to run tasks:*

pio train -- --master
spark://ip-172-31-40-12.us-west-2.compute.internal:7077 --executor-cores 5
--executor-memory 19g --num-executors 3 --driver-memory 48g

[WARN] [TaskSchedulerImpl] Initial job has not accepted any resources;
check your cluster UI to ensure that workers are registered and have
sufficient resources







*Shane Johnson | LIFT IQ*
*Founder | CEO*

*www.liftiq.com <http://www.liftiq.com/>* or *shane@liftiq.com
<sh...@liftiq.com>*
mobile: (801) 360-3350
LinkedIn <https://www.linkedin.com/in/shanewjohnson/>  |  Twitter
<https://twitter.com/SWaldenJ> |  Facebook
<https://www.facebook.com/shane.johnson.71653>

Re: PIO Spark Application - Slaves are registered but slave executors are not being used?

Posted by Donald Szeto <do...@apache.org>.
Hey Shane,

Glad that you have found a solution and appreciate for posting the answer.

Regards,
Donald

On Sun, Jul 29, 2018 at 7:03 PM Shane Johnson <sh...@liftiq.com> wrote:

> Hi all,
>
> Thanks for being a sounding board as I go through the learning curve of
> moving into AWS and from a single machine to a cluster. I found this
> article and realized I made a rookie mistake. For anyone else that runs
> into this thinking that it is a resource or not having the workers
> registered, it could very well also be the firewall. I was able to get the
> slaves to accept jobs by opening up all traffic to confirm that it was a
> firewall issue. Now I need to figure out how to handle the random ports
> that spark uses for the jobs.
>
>
> https://stackoverflow.com/questions/36126513/spark-standalone-mode-not-distributing-job-to-other-worker-node
>
>
>
> Thanks again
>
>
> *Shane Johnson | LIFT IQ*
> *Founder | CEO*
>
> *www.liftiq.com <http://www.liftiq.com/>* or *shane@liftiq.com
> <sh...@liftiq.com>*
> mobile: (801) 360-3350
> LinkedIn <https://www.linkedin.com/in/shanewjohnson/>  |  Twitter
> <https://twitter.com/SWaldenJ> |  Facebook
> <https://www.facebook.com/shane.johnson.71653>
>
>
>
> On Sun, Jul 29, 2018 at 2:18 PM, Shane Johnson <sh...@liftiq.com> wrote:
>
>> Hi Team,
>>
>> I am working through and issue and am hoping someone can point me in the
>> right direction or has encountered this already. I have setup a 2 node
>> spark cluster as a Test. I have the Master and 1 Slave. I have
>> password-less SSH working and am able to run the sbin/start-all.sh command
>> to start the Master and the Slave.
>>
>> config/masters
>>
>> ***.**.**.12
>>
>> config/slaves
>>
>> ***.**.**.12
>> ***.**.**.236
>>
>> Everything looks like it is starting as expected but when I submit the
>> PIO Application I notice that only the executors on the Master
>> (***.**.**.12) or Driver machine are actually running the tasks. The
>> cluster UI shows both slaves registered and Running but the Application UI
>> only shows the 3 executors and 15 cores vs. what I am expecting to see as 6
>> executors (3 from ***.**.**.12 and 3 from (***.**.**236) and 30 cores.
>>
>> Here is the command to submit the pio application
>> pio train -- --master
>> spark://ip-172-31-40-12.us-west-2.compute.internal:7077 --executor-cores 5
>> --executor-memory 19g --num-executors 6 --driver-memory 28g
>>
>>
>> *Can someone help me find ways to troubleshoot this? Why are the
>> executors on the non-Master server (Slave 2 in diagram) not doing any work
>> even though they say they are registered and being used? *
>>
>> Thank you for you help! Here are a few screenshots of what I am seeing.
>>
>>
>>
>>
>>
>>
>> One thing I tried was to remove the Slave on the Master machine to try
>> and force the Slave machine (***.**.**.236) to engage but then I got the
>> following error. Again the non-master machine executor look to be
>> registered and have resources but they don't run in the application:
>>
>> *Test to force non-master machine to run tasks:*
>>
>> pio train -- --master
>> spark://ip-172-31-40-12.us-west-2.compute.internal:7077 --executor-cores 5
>> --executor-memory 19g --num-executors 3 --driver-memory 48g
>>
>> [WARN] [TaskSchedulerImpl] Initial job has not accepted any resources;
>> check your cluster UI to ensure that workers are registered and have
>> sufficient resources
>>
>>
>>
>>
>>
>>
>>
>> *Shane Johnson | LIFT IQ*
>> *Founder | CEO*
>>
>> *www.liftiq.com <http://www.liftiq.com/>* or *shane@liftiq.com
>> <sh...@liftiq.com>*
>> mobile: (801) 360-3350
>> LinkedIn <https://www.linkedin.com/in/shanewjohnson/>  |  Twitter
>> <https://twitter.com/SWaldenJ> |  Facebook
>> <https://www.facebook.com/shane.johnson.71653>
>>
>>
>>
>

Re: PIO Spark Application - Slaves are registered but slave executors are not being used?

Posted by Shane Johnson <sh...@liftiq.com>.
Hi all,

Thanks for being a sounding board as I go through the learning curve of
moving into AWS and from a single machine to a cluster. I found this
article and realized I made a rookie mistake. For anyone else that runs
into this thinking that it is a resource or not having the workers
registered, it could very well also be the firewall. I was able to get the
slaves to accept jobs by opening up all traffic to confirm that it was a
firewall issue. Now I need to figure out how to handle the random ports
that spark uses for the jobs.

https://stackoverflow.com/questions/36126513/spark-standalone-mode-not-distributing-job-to-other-worker-node



Thanks again


*Shane Johnson | LIFT IQ*
*Founder | CEO*

*www.liftiq.com <http://www.liftiq.com/>* or *shane@liftiq.com
<sh...@liftiq.com>*
mobile: (801) 360-3350
LinkedIn <https://www.linkedin.com/in/shanewjohnson/>  |  Twitter
<https://twitter.com/SWaldenJ> |  Facebook
<https://www.facebook.com/shane.johnson.71653>



On Sun, Jul 29, 2018 at 2:18 PM, Shane Johnson <sh...@liftiq.com> wrote:

> Hi Team,
>
> I am working through and issue and am hoping someone can point me in the
> right direction or has encountered this already. I have setup a 2 node
> spark cluster as a Test. I have the Master and 1 Slave. I have
> password-less SSH working and am able to run the sbin/start-all.sh command
> to start the Master and the Slave.
>
> config/masters
>
> ***.**.**.12
>
> config/slaves
>
> ***.**.**.12
> ***.**.**.236
>
> Everything looks like it is starting as expected but when I submit the PIO
> Application I notice that only the executors on the Master (***.**.**.12)
> or Driver machine are actually running the tasks. The cluster UI shows both
> slaves registered and Running but the Application UI only shows the 3
> executors and 15 cores vs. what I am expecting to see as 6 executors (3
> from ***.**.**.12 and 3 from (***.**.**236) and 30 cores.
>
> Here is the command to submit the pio application
> pio train -- --master spark://ip-172-31-40-12.us-west-2.compute.internal:7077
> --executor-cores 5 --executor-memory 19g --num-executors 6 --driver-memory
> 28g
>
>
> *Can someone help me find ways to troubleshoot this? Why are the executors
> on the non-Master server (Slave 2 in diagram) not doing any work even
> though they say they are registered and being used? *
>
> Thank you for you help! Here are a few screenshots of what I am seeing.
>
>
>
>
>
>
> One thing I tried was to remove the Slave on the Master machine to try and
> force the Slave machine (***.**.**.236) to engage but then I got the
> following error. Again the non-master machine executor look to be
> registered and have resources but they don't run in the application:
>
> *Test to force non-master machine to run tasks:*
>
> pio train -- --master spark://ip-172-31-40-12.us-west-2.compute.internal:7077
> --executor-cores 5 --executor-memory 19g --num-executors 3 --driver-memory
> 48g
>
> [WARN] [TaskSchedulerImpl] Initial job has not accepted any resources;
> check your cluster UI to ensure that workers are registered and have
> sufficient resources
>
>
>
>
>
>
>
> *Shane Johnson | LIFT IQ*
> *Founder | CEO*
>
> *www.liftiq.com <http://www.liftiq.com/>* or *shane@liftiq.com
> <sh...@liftiq.com>*
> mobile: (801) 360-3350
> LinkedIn <https://www.linkedin.com/in/shanewjohnson/>  |  Twitter
> <https://twitter.com/SWaldenJ> |  Facebook
> <https://www.facebook.com/shane.johnson.71653>
>
>
>