You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Joe Olson (JIRA)" <ji...@apache.org> on 2017/02/28 17:42:45 UTC
[jira] [Updated] (SPARK-19770) Running Example SparkPi Job on Mesos
[ https://issues.apache.org/jira/browse/SPARK-19770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joe Olson updated SPARK-19770:
------------------------------
Description:
I am trying to submit the example SparkPi job to a 9 node Spark cluster running on Mesos. My spark-submit statement:
{quote}
./bin/spark-submit \
--name "Test01:" \
--class org.apache.spark.examples.SparkPi \
--master mesos://<IP Address>:7078 \
--deploy-mode cluster \
--executor-memory 16G \
--executor-cores 1 \
--driver-cores 10 \
--driver-memory 48G \
--num-executors 1 \
file://mnt/ocarchive1/sstables/jars/spark-examples_2.11-2.1.0.jar \
1000
{quote}
When I do this, the job completes successfully. I can go into the stdout file on the driver machine, and see the "Pi is roughly...." output.
However, on most of the slave machines, If I go into the stderr file for that same job, I see the following exceptions:
{quote}
17/02/28 10:20:59 ERROR RetryingBlockFetcher: Exception while beginning fetch of 1 outstanding blocks
java.io.IOException: Failed to connect to <machine name>/fe80:0:0:0:ec4:7aff:fea4:82e1%5:44791
{quote}
There appears to be a connectivity issue between some of the 9 nodes, some of the time. It does not seem to be consistent on the routes between machines. (Sometimes #2 and #7 talk, sometimes they do not).
How can I troubleshoot this? Network behavior seems normal otherwise.
ALSO - sometimes in the logs I'll see
{quote}
17/02/28 10:20:54 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
{quote}
However, my resource count in Mesos (via the UI) is accurate.
was:
I am trying to submit the example SparkPi job to a 9 node Spark cluster running on Mesos. My spark-submit statement:
./bin/spark-submit \
--name "Test01:" \
--class org.apache.spark.examples.SparkPi \
--master mesos://<IP Address>:7078 \
--deploy-mode cluster \
--executor-memory 16G \
--executor-cores 1 \
--driver-cores 10 \
--driver-memory 48G \
--num-executors 1 \
file://mnt/ocarchive1/sstables/jars/spark-examples_2.11-2.1.0.jar \
1000
When I do this, the job completes successfully. I can go into the stdout file on the driver machine, and see the "Pi is roughly...." output.
However, on most of the slave machines, If I go into the stderr file for that same job, I see the following exceptions:
17/02/28 10:20:59 ERROR RetryingBlockFetcher: Exception while beginning fetch of 1 outstanding blocks
java.io.IOException: Failed to connect to <machine name>/fe80:0:0:0:ec4:7aff:fea4:82e1%5:44791
There appears to be a connectivity issue between some of the 9 nodes, some of the time. It does not seem to be consistent on the routes between machines. (Sometimes #2 and #7 talk, sometimes they do not).
How can I troubleshoot this? Network behavior seems normal otherwise.
ALSO - sometimes in the logs I'll see
17/02/28 10:20:54 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
However, my resource count in Mesos (via the UI) is accurate.
> Running Example SparkPi Job on Mesos
> ------------------------------------
>
> Key: SPARK-19770
> URL: https://issues.apache.org/jira/browse/SPARK-19770
> Project: Spark
> Issue Type: Question
> Components: Mesos, Spark Core, Spark Submit
> Affects Versions: 2.1.0
> Environment: spark-2.1.0-bin-hadoop2.3, mesos-1-1
> Reporter: Joe Olson
>
> I am trying to submit the example SparkPi job to a 9 node Spark cluster running on Mesos. My spark-submit statement:
> {quote}
> ./bin/spark-submit \
> --name "Test01:" \
> --class org.apache.spark.examples.SparkPi \
> --master mesos://<IP Address>:7078 \
> --deploy-mode cluster \
> --executor-memory 16G \
> --executor-cores 1 \
> --driver-cores 10 \
> --driver-memory 48G \
> --num-executors 1 \
> file://mnt/ocarchive1/sstables/jars/spark-examples_2.11-2.1.0.jar \
> 1000
> {quote}
> When I do this, the job completes successfully. I can go into the stdout file on the driver machine, and see the "Pi is roughly...." output.
> However, on most of the slave machines, If I go into the stderr file for that same job, I see the following exceptions:
> {quote}
> 17/02/28 10:20:59 ERROR RetryingBlockFetcher: Exception while beginning fetch of 1 outstanding blocks
> java.io.IOException: Failed to connect to <machine name>/fe80:0:0:0:ec4:7aff:fea4:82e1%5:44791
> {quote}
> There appears to be a connectivity issue between some of the 9 nodes, some of the time. It does not seem to be consistent on the routes between machines. (Sometimes #2 and #7 talk, sometimes they do not).
> How can I troubleshoot this? Network behavior seems normal otherwise.
> ALSO - sometimes in the logs I'll see
> {quote}
> 17/02/28 10:20:54 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
> {quote}
> However, my resource count in Mesos (via the UI) is accurate.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org