You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Michael Pisula <mi...@tngtech.com> on 2016/01/07 22:24:32 UTC

Spark job uses only one Worker

Hi there,

I ran a simple Batch Application on a Spark Cluster on EC2. Despite having 3
Worker Nodes, I could not get the application processed on more than one
node, regardless if I submitted the Application in Cluster or Client mode.
I also tried manually increasing the number of partitions in the code, no
effect. I also pass the master into the application.
I verified on the nodes themselves that only one node was active while the
job was running.
I pass enough data to make the job take 6 minutes to process.
The job is simple enough, reading data from two S3 files, joining records on
a shared field, filtering out some records and writing the result back to
S3.

Tried all kinds of stuff, but could not make it work. I did find similar
questions, but had already tried the solutions that worked in those cases.
Would be really happy about any pointers.

Cheers,
Michael



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-job-uses-only-one-Worker-tp25909.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Spark job uses only one Worker

Posted by Michael Pisula <mi...@tngtech.com>.

All the workers were connected, I even saw the job being processed on
different workers, so that was working fine.
I will fire up the cluster again tomorrow and post the results of
connecting to 7077 and using --total-executor-cores 4.

Thanks for the help

On 07.01.2016 23:10, Igor Berman wrote:
> do you see in master ui that workers connected to master & before you
> are running your app there are 2 available cores in master ui per each
> worker?
> I understand that there are 2 cores on each worker - the question is
> do they got registered under master
>
> regarding port it's very strange, please post what is problem
> connecting to 7077
>
> use *--total-executor-cores 4 in your submit*
> *
> *
> if you can post master ui screen after you submitted your app
>
>
> On 8 January 2016 at 00:02, Michael Pisula <michael.pisula@tngtech.com
> <ma...@tngtech.com>> wrote:
>
>     I had tried several parameters, including --total-executor-cores,
>     no effect.
>     As for the port, I tried 7077, but if I remember correctly I got
>     some kind of error that suggested to try 6066, with which it
>     worked just fine (apart from this issue here).
>
>     Each worker has two cores. I also tried increasing cores, again no
>     effect. I was able to increase the number of cores the job was
>     using on one worker, but it would not use any other worker (and it
>     would not start if the number of cores the job wanted was higher
>     than the number available on one worker).
>
>
>     On 07.01.2016 22:51, Igor Berman wrote:
>>     read about *--total-executor-cores*
>>     not sure why you specify port 6066 in master...usually it's 7077
>>     verify in master ui(usually port 8080) how many cores are
>>     there(depends on other configs, but usually workers connect to
>>     master with all their cores)
>>
>>     On 7 January 2016 at 23:46, Michael Pisula
>>     <michael.pisula@tngtech.com <ma...@tngtech.com>>
>>     wrote:
>>
>>         Hi,
>>
>>         I start the cluster using the spark-ec2 scripts, so the
>>         cluster is in stand-alone mode.
>>         Here is how I submit my job:
>>         spark/bin/spark-submit --class demo.spark.StaticDataAnalysis
>>         --master spark://<host>:6066 --deploy-mode cluster
>>         demo/Demo-1.0-SNAPSHOT-all.jar
>>
>>         Cheers,
>>         Michael
>>
>>
>>         On 07.01.2016 22:41, Igor Berman wrote:
>>>         share how you submit your job
>>>         what cluster(yarn, standalone)
>>>
>>>         On 7 January 2016 at 23:24, Michael Pisula
>>>         <michael.pisula@tngtech.com
>>>         <ma...@tngtech.com>> wrote:
>>>
>>>             Hi there,
>>>
>>>             I ran a simple Batch Application on a Spark Cluster on
>>>             EC2. Despite having 3
>>>             Worker Nodes, I could not get the application processed
>>>             on more than one
>>>             node, regardless if I submitted the Application in
>>>             Cluster or Client mode.
>>>             I also tried manually increasing the number of
>>>             partitions in the code, no
>>>             effect. I also pass the master into the application.
>>>             I verified on the nodes themselves that only one node
>>>             was active while the
>>>             job was running.
>>>             I pass enough data to make the job take 6 minutes to
>>>             process.
>>>             The job is simple enough, reading data from two S3
>>>             files, joining records on
>>>             a shared field, filtering out some records and writing
>>>             the result back to
>>>             S3.
>>>
>>>             Tried all kinds of stuff, but could not make it work. I
>>>             did find similar
>>>             questions, but had already tried the solutions that
>>>             worked in those cases.
>>>             Would be really happy about any pointers.
>>>
>>>             Cheers,
>>>             Michael
>>>
>>>
>>>
>>>             --
>>>             View this message in context:
>>>             http://apache-spark-user-list.1001560.n3.nabble.com/Spark-job-uses-only-one-Worker-tp25909.html
>>>             Sent from the Apache Spark User List mailing list
>>>             archive at Nabble.com.
>>>
>>>             ---------------------------------------------------------------------
>>>             To unsubscribe, e-mail:
>>>             user-unsubscribe@spark.apache.org
>>>             <ma...@spark.apache.org>
>>>             For additional commands, e-mail:
>>>             user-help@spark.apache.org
>>>             <ma...@spark.apache.org>
>>>
>>>
>>
>>         -- 
>>         Michael Pisula * michael.pisula@tngtech.com <ma...@tngtech.com> * +49-174-3180084 <tel:%2B49-174-3180084>
>>         TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
>>         Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
>>         Sitz: Unterföhring * Amtsgericht München * HRB 135082
>>
>>
>
>     -- 
>     Michael Pisula * michael.pisula@tngtech.com <ma...@tngtech.com> * +49-174-3180084 <tel:%2B49-174-3180084>
>     TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
>     Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
>     Sitz: Unterföhring * Amtsgericht München * HRB 135082
>
>

-- 
Michael Pisula * michael.pisula@tngtech.com * +49-174-3180084
TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
Sitz: Unterföhring * Amtsgericht München * HRB 135082

Re: Spark job uses only one Worker

Posted by Michael Pisula <mi...@tngtech.com>.

So, tried it again.
For 7077 I get this error:
16/01/08 18:59:58 WARN rest.RestSubmissionClient: Unable to connect to
server spark://ec2-54-229-68-25.eu-west-1.compute.amazonaws.com:7077.
Warning: Master endpoint
spark://ec2-54-229-68-25.eu-west-1.compute.amazonaws.com:7077 was not a
REST server. Falling back to legacy submission gateway instead.

As for --total-executor-cores, again, no effect.

See the attached screenshot, the topmost job lists only one core. Three
workers are running with 2 cores each.
This is the command I used:
spark/bin/spark-submit --num-executors 4 --class
com.tngtech.spark.StaticDataAnalysis --master spark://host:6066
--total-executor-cores 4 --deploy-mode cluster
demo/Demo-1.0-SNAPSHOT-all.jar

Cheers,
Michael


On 07.01.2016 23:10, Igor Berman wrote:
> do you see in master ui that workers connected to master & before you
> are running your app there are 2 available cores in master ui per each
> worker?
> I understand that there are 2 cores on each worker - the question is
> do they got registered under master
>
> regarding port it's very strange, please post what is problem
> connecting to 7077
>
> use *--total-executor-cores 4 in your submit*
> *
> *
> if you can post master ui screen after you submitted your app
>
>
> On 8 January 2016 at 00:02, Michael Pisula <michael.pisula@tngtech.com
> <ma...@tngtech.com>> wrote:
>
>     I had tried several parameters, including --total-executor-cores,
>     no effect.
>     As for the port, I tried 7077, but if I remember correctly I got
>     some kind of error that suggested to try 6066, with which it
>     worked just fine (apart from this issue here).
>
>     Each worker has two cores. I also tried increasing cores, again no
>     effect. I was able to increase the number of cores the job was
>     using on one worker, but it would not use any other worker (and it
>     would not start if the number of cores the job wanted was higher
>     than the number available on one worker).
>
>
>     On 07.01.2016 22:51, Igor Berman wrote:
>>     read about *--total-executor-cores*
>>     not sure why you specify port 6066 in master...usually it's 7077
>>     verify in master ui(usually port 8080) how many cores are
>>     there(depends on other configs, but usually workers connect to
>>     master with all their cores)
>>
>>     On 7 January 2016 at 23:46, Michael Pisula
>>     <michael.pisula@tngtech.com <ma...@tngtech.com>>
>>     wrote:
>>
>>         Hi,
>>
>>         I start the cluster using the spark-ec2 scripts, so the
>>         cluster is in stand-alone mode.
>>         Here is how I submit my job:
>>         spark/bin/spark-submit --class demo.spark.StaticDataAnalysis
>>         --master spark://<host>:6066 --deploy-mode cluster
>>         demo/Demo-1.0-SNAPSHOT-all.jar
>>
>>         Cheers,
>>         Michael
>>
>>
>>         On 07.01.2016 22:41, Igor Berman wrote:
>>>         share how you submit your job
>>>         what cluster(yarn, standalone)
>>>
>>>         On 7 January 2016 at 23:24, Michael Pisula
>>>         <michael.pisula@tngtech.com
>>>         <ma...@tngtech.com>> wrote:
>>>
>>>             Hi there,
>>>
>>>             I ran a simple Batch Application on a Spark Cluster on
>>>             EC2. Despite having 3
>>>             Worker Nodes, I could not get the application processed
>>>             on more than one
>>>             node, regardless if I submitted the Application in
>>>             Cluster or Client mode.
>>>             I also tried manually increasing the number of
>>>             partitions in the code, no
>>>             effect. I also pass the master into the application.
>>>             I verified on the nodes themselves that only one node
>>>             was active while the
>>>             job was running.
>>>             I pass enough data to make the job take 6 minutes to
>>>             process.
>>>             The job is simple enough, reading data from two S3
>>>             files, joining records on
>>>             a shared field, filtering out some records and writing
>>>             the result back to
>>>             S3.
>>>
>>>             Tried all kinds of stuff, but could not make it work. I
>>>             did find similar
>>>             questions, but had already tried the solutions that
>>>             worked in those cases.
>>>             Would be really happy about any pointers.
>>>
>>>             Cheers,
>>>             Michael
>>>
>>>
>>>
>>>             --
>>>             View this message in context:
>>>             http://apache-spark-user-list.1001560.n3.nabble.com/Spark-job-uses-only-one-Worker-tp25909.html
>>>             Sent from the Apache Spark User List mailing list
>>>             archive at Nabble.com.
>>>
>>>             ---------------------------------------------------------------------
>>>             To unsubscribe, e-mail:
>>>             user-unsubscribe@spark.apache.org
>>>             <ma...@spark.apache.org>
>>>             For additional commands, e-mail:
>>>             user-help@spark.apache.org
>>>             <ma...@spark.apache.org>
>>>
>>>
>>
>>         -- 
>>         Michael Pisula * michael.pisula@tngtech.com <ma...@tngtech.com> * +49-174-3180084 <tel:%2B49-174-3180084>
>>         TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
>>         Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
>>         Sitz: Unterföhring * Amtsgericht München * HRB 135082
>>
>>
>
>     -- 
>     Michael Pisula * michael.pisula@tngtech.com <ma...@tngtech.com> * +49-174-3180084 <tel:%2B49-174-3180084>
>     TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
>     Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
>     Sitz: Unterföhring * Amtsgericht München * HRB 135082
>
>

-- 
Michael Pisula * michael.pisula@tngtech.com * +49-174-3180084
TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
Sitz: Unterföhring * Amtsgericht München * HRB 135082

Re: Spark job uses only one Worker

Posted by Igor Berman <ig...@gmail.com>.

do you see in master ui that workers connected to master & before you are
running your app there are 2 available cores in master ui per each worker?
I understand that there are 2 cores on each worker - the question is do
they got registered under master

regarding port it's very strange, please post what is problem connecting to
7077

use *--total-executor-cores 4 in your submit*

if you can post master ui screen after you submitted your app


On 8 January 2016 at 00:02, Michael Pisula <mi...@tngtech.com>
wrote:

> I had tried several parameters, including --total-executor-cores, no
> effect.
> As for the port, I tried 7077, but if I remember correctly I got some kind
> of error that suggested to try 6066, with which it worked just fine (apart
> from this issue here).
>
> Each worker has two cores. I also tried increasing cores, again no effect.
> I was able to increase the number of cores the job was using on one worker,
> but it would not use any other worker (and it would not start if the number
> of cores the job wanted was higher than the number available on one worker).
>
>
> On 07.01.2016 22:51, Igor Berman wrote:
>
> read about *--total-executor-cores*
> not sure why you specify port 6066 in master...usually it's 7077
> verify in master ui(usually port 8080) how many cores are there(depends on
> other configs, but usually workers connect to master with all their cores)
>
> On 7 January 2016 at 23:46, Michael Pisula <mi...@tngtech.com>
> wrote:
>
>> Hi,
>>
>> I start the cluster using the spark-ec2 scripts, so the cluster is in
>> stand-alone mode.
>> Here is how I submit my job:
>> spark/bin/spark-submit --class demo.spark.StaticDataAnalysis --master
>> spark://<host>:6066 --deploy-mode cluster demo/Demo-1.0-SNAPSHOT-all.jar
>>
>> Cheers,
>> Michael
>>
>>
>> On 07.01.2016 22:41, Igor Berman wrote:
>>
>> share how you submit your job
>> what cluster(yarn, standalone)
>>
>> On 7 January 2016 at 23:24, Michael Pisula < <mi...@tngtech.com>
>> michael.pisula@tngtech.com> wrote:
>>
>>> Hi there,
>>>
>>> I ran a simple Batch Application on a Spark Cluster on EC2. Despite
>>> having 3
>>> Worker Nodes, I could not get the application processed on more than one
>>> node, regardless if I submitted the Application in Cluster or Client
>>> mode.
>>> I also tried manually increasing the number of partitions in the code, no
>>> effect. I also pass the master into the application.
>>> I verified on the nodes themselves that only one node was active while
>>> the
>>> job was running.
>>> I pass enough data to make the job take 6 minutes to process.
>>> The job is simple enough, reading data from two S3 files, joining
>>> records on
>>> a shared field, filtering out some records and writing the result back to
>>> S3.
>>>
>>> Tried all kinds of stuff, but could not make it work. I did find similar
>>> questions, but had already tried the solutions that worked in those
>>> cases.
>>> Would be really happy about any pointers.
>>>
>>> Cheers,
>>> Michael
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> <http://apache-spark-user-list.1001560.n3.nabble.com/Spark-job-uses-only-one-Worker-tp25909.html>
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-job-uses-only-one-Worker-tp25909.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: <us...@spark.apache.org>
>>> user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: <us...@spark.apache.org>
>>> user-help@spark.apache.org
>>>
>>>
>>
>> --
>> Michael Pisula * michael.pisula@tngtech.com * +49-174-3180084
>> TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
>> Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
>> Sitz: Unterföhring * Amtsgericht München * HRB 135082
>>
>>
>
> --
> Michael Pisula * michael.pisula@tngtech.com * +49-174-3180084
> TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
> Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
> Sitz: Unterföhring * Amtsgericht München * HRB 135082
>
>

Re: Spark job uses only one Worker

Posted by Prem Sure <pr...@gmail.com>.

to narrow down,you can try below
1) is the job going to same node everytime( when you execute job multiple
times)?. enable property spark.speculation, keep thread.sleep for 2 mins
and see if the job is going to a different worker from the executor posted
on initially. ( trying to find, there are no connection or setup related
issue)
2) whats your spark.executor.memory. try decreasing executor memory to a
value less than data size and if that helps in distributing.
3 While launching the cluster, play around with with number of slaves -
start with 1
./spark-ec2 -k <keypair> -i <key-file> -s <num-slaves> launch <cluster-name>

On Fri, Jan 8, 2016 at 2:53 PM, Michael Pisula <mi...@tngtech.com>
wrote:

> Hi Annabel,
>
> I am using Spark in stand-alone mode (deployment using the ec2 scripts
> packaged with spark).
>
> Cheers,
> Michael
>
>
> On 08.01.2016 00:43, Annabel Melongo wrote:
>
> Michael,
>
> I don't know what's your environment but if it's Cloudera, you should be
> able to see the link to your master in the Hue.
>
> Thanks
>
>
> On Thursday, January 7, 2016 5:03 PM, Michael Pisula
> <mi...@tngtech.com> <mi...@tngtech.com> wrote:
>
>
> I had tried several parameters, including --total-executor-cores, no
> effect.
> As for the port, I tried 7077, but if I remember correctly I got some kind
> of error that suggested to try 6066, with which it worked just fine (apart
> from this issue here).
>
> Each worker has two cores. I also tried increasing cores, again no effect.
> I was able to increase the number of cores the job was using on one worker,
> but it would not use any other worker (and it would not start if the number
> of cores the job wanted was higher than the number available on one worker).
>
> On 07.01.2016 22:51, Igor Berman wrote:
>
> read about *--total-executor-cores*
> not sure why you specify port 6066 in master...usually it's 7077
> verify in master ui(usually port 8080) how many cores are there(depends on
> other configs, but usually workers connect to master with all their cores)
>
> On 7 January 2016 at 23:46, Michael Pisula <mi...@tngtech.com>
> wrote:
>
> Hi,
>
> I start the cluster using the spark-ec2 scripts, so the cluster is in
> stand-alone mode.
> Here is how I submit my job:
> spark/bin/spark-submit --class demo.spark.StaticDataAnalysis --master
> spark://<host>:6066 --deploy-mode cluster demo/Demo-1.0-SNAPSHOT-all.jar
>
> Cheers,
> Michael
>
>
> On 07.01.2016 22:41, Igor Berman wrote:
>
> share how you submit your job
> what cluster(yarn, standalone)
>
> On 7 January 2016 at 23:24, Michael Pisula < <mi...@tngtech.com>
> michael.pisula@tngtech.com> wrote:
>
> Hi there,
>
> I ran a simple Batch Application on a Spark Cluster on EC2. Despite having
> 3
> Worker Nodes, I could not get the application processed on more than one
> node, regardless if I submitted the Application in Cluster or Client mode.
> I also tried manually increasing the number of partitions in the code, no
> effect. I also pass the master into the application.
> I verified on the nodes themselves that only one node was active while the
> job was running.
> I pass enough data to make the job take 6 minutes to process.
> The job is simple enough, reading data from two S3 files, joining records
> on
> a shared field, filtering out some records and writing the result back to
> S3.
>
> Tried all kinds of stuff, but could not make it work. I did find similar
> questions, but had already tried the solutions that worked in those cases.
> Would be really happy about any pointers.
>
> Cheers,
> Michael
>
>
>
> --
> View this message in context:
> <http://apache-spark-user-list.1001560.n3.nabble.com/Spark-job-uses-only-one-Worker-tp25909.html>
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-job-uses-only-one-Worker-tp25909.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: <us...@spark.apache.org>
> user-unsubscribe@spark.apache.org
> For additional commands, e-mail: <us...@spark.apache.org>
> user-help@spark.apache.org
>
>
>
> --
> Michael Pisula * michael.pisula@tngtech.com * +49-174-3180084
> TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
> Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
> Sitz: Unterföhring * Amtsgericht München * HRB 135082
>
>
>
> --
> Michael Pisula * michael.pisula@tngtech.com * +49-174-3180084
> TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
> Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
> Sitz: Unterföhring * Amtsgericht München * HRB 135082
>
>
>
>
> --
> Michael Pisula * michael.pisula@tngtech.com * +49-174-3180084
> TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
> Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
> Sitz: Unterföhring * Amtsgericht München * HRB 135082
>
>

Re: Spark job uses only one Worker

Posted by Michael Pisula <mi...@tngtech.com>.

Hi Annabel,

I am using Spark in stand-alone mode (deployment using the ec2 scripts
packaged with spark).

Cheers,
Michael

On 08.01.2016 00:43, Annabel Melongo wrote:
> Michael,
>
> I don't know what's your environment but if it's Cloudera, you should
> be able to see the link to your master in the Hue.
>
> Thanks
>
>
> On Thursday, January 7, 2016 5:03 PM, Michael Pisula
> <mi...@tngtech.com> wrote:
>
>
> I had tried several parameters, including --total-executor-cores, no
> effect.
> As for the port, I tried 7077, but if I remember correctly I got some
> kind of error that suggested to try 6066, with which it worked just
> fine (apart from this issue here).
>
> Each worker has two cores. I also tried increasing cores, again no
> effect. I was able to increase the number of cores the job was using
> on one worker, but it would not use any other worker (and it would not
> start if the number of cores the job wanted was higher than the number
> available on one worker).
>
> On 07.01.2016 22:51, Igor Berman wrote:
>> read about *--total-executor-cores*
>> not sure why you specify port 6066 in master...usually it's 7077
>> verify in master ui(usually port 8080) how many cores are
>> there(depends on other configs, but usually workers connect to master
>> with all their cores)
>>
>> On 7 January 2016 at 23:46, Michael Pisula
>> <michael.pisula@tngtech.com <ma...@tngtech.com>> wrote:
>>
>>     Hi,
>>
>>     I start the cluster using the spark-ec2 scripts, so the cluster
>>     is in stand-alone mode.
>>     Here is how I submit my job:
>>     spark/bin/spark-submit --class demo.spark.StaticDataAnalysis
>>     --master spark://<host>:6066 --deploy-mode cluster
>>     demo/Demo-1.0-SNAPSHOT-all.jar
>>
>>     Cheers,
>>     Michael
>>
>>
>>     On 07.01.2016 22:41, Igor Berman wrote:
>>>     share how you submit your job
>>>     what cluster(yarn, standalone)
>>>
>>>     On 7 January 2016 at 23:24, Michael Pisula
>>>     <michael.pisula@tngtech.com <ma...@tngtech.com>>
>>>     wrote:
>>>
>>>         Hi there,
>>>
>>>         I ran a simple Batch Application on a Spark Cluster on EC2.
>>>         Despite having 3
>>>         Worker Nodes, I could not get the application processed on
>>>         more than one
>>>         node, regardless if I submitted the Application in Cluster
>>>         or Client mode.
>>>         I also tried manually increasing the number of partitions in
>>>         the code, no
>>>         effect. I also pass the master into the application.
>>>         I verified on the nodes themselves that only one node was
>>>         active while the
>>>         job was running.
>>>         I pass enough data to make the job take 6 minutes to process.
>>>         The job is simple enough, reading data from two S3 files,
>>>         joining records on
>>>         a shared field, filtering out some records and writing the
>>>         result back to
>>>         S3.
>>>
>>>         Tried all kinds of stuff, but could not make it work. I did
>>>         find similar
>>>         questions, but had already tried the solutions that worked
>>>         in those cases.
>>>         Would be really happy about any pointers.
>>>
>>>         Cheers,
>>>         Michael
>>>
>>>
>>>
>>>         --
>>>         View this message in context:
>>>         http://apache-spark-user-list.1001560.n3.nabble.com/Spark-job-uses-only-one-Worker-tp25909.html
>>>         Sent from the Apache Spark User List mailing list archive at
>>>         Nabble.com.
>>>
>>>         ---------------------------------------------------------------------
>>>         To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>         <ma...@spark.apache.org>
>>>         For additional commands, e-mail: user-help@spark.apache.org
>>>         <ma...@spark.apache.org>
>>>
>>>
>>
>>     -- 
>>     Michael Pisula * michael.pisula@tngtech.com <ma...@tngtech.com> * +49-174-3180084
>>     TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
>>     Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
>>     Sitz: Unterföhring * Amtsgericht München * HRB 135082
>>
>>
>
> -- 
> Michael Pisula * michael.pisula@tngtech.com <ma...@tngtech.com> * +49-174-3180084
> TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
> Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
> Sitz: Unterföhring * Amtsgericht München * HRB 135082
>
>

-- 
Michael Pisula * michael.pisula@tngtech.com * +49-174-3180084
TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
Sitz: Unterföhring * Amtsgericht München * HRB 135082

Re: Spark job uses only one Worker

Posted by Annabel Melongo <me...@yahoo.com.INVALID>.

Michael,
I don't know what's your environment but if it's Cloudera, you should be able to see the link to your master in the Hue.
Thanks

On Thursday, January 7, 2016 5:03 PM, Michael Pisula <mi...@tngtech.com> wrote:

I had tried several parameters, including --total-executor-cores, no effect.
As for the port, I tried 7077, but if I remember correctly I got some kind of error that suggested to try 6066, with which it worked just fine (apart from this issue here).

Each worker has two cores. I also tried increasing cores, again no effect. I was able to increase the number of cores the job was using on one worker, but it would not use any other worker (and it would not start if the number of cores the job wanted was higher than the number available on one worker).

On 07.01.2016 22:51, Igor Berman wrote:

read about --total-executor-cores not sure why you specify port 6066 in master...usually it's 7077
verify in master ui(usually port 8080) how many cores are there(depends on other configs, but usually workers connect to master with all their cores)
On 7 January 2016 at 23:46, Michael Pisula <mi...@tngtech.com> wrote:

Hi,

I start the cluster using the spark-ec2 scripts, so the cluster is in stand-alone mode.
Here is how I submit my job:
spark/bin/spark-submit --class demo.spark.StaticDataAnalysis --master spark://<host>:6066 --deploy-mode cluster demo/Demo-1.0-SNAPSHOT-all.jar

Cheers,
Michael

On 07.01.2016 22:41, Igor Berman wrote:

share how you submit your job what cluster(yarn, standalone)
On 7 January 2016 at 23:24, Michael Pisula <mi...@tngtech.com> wrote:

Hi there,

I ran a simple Batch Application on a Spark Cluster on EC2. Despite having 3
Worker Nodes, I could not get the application processed on more than one
node, regardless if I submitted the Application in Cluster or Client mode.
I also tried manually increasing the number of partitions in the code, no
effect. I also pass the master into the application.
I verified on the nodes themselves that only one node was active while the
job was running.
I pass enough data to make the job take 6 minutes to process.
The job is simple enough, reading data from two S3 files, joining records on
a shared field, filtering out some records and writing the result back to
S3.

Tried all kinds of stuff, but could not make it work. I did find similar
questions, but had already tried the solutions that worked in those cases.
Would be really happy about any pointers.

Cheers,
Michael

--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-job-uses-only-one-Worker-tp25909.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

--
Michael Pisula * michael.pisula@tngtech.com * +49-174-3180084
TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
Sitz: Unterföhring * Amtsgericht München * HRB 135082

Re: Spark job uses only one Worker

Posted by Michael Pisula <mi...@tngtech.com>.

I had tried several parameters, including --total-executor-cores, no effect.
As for the port, I tried 7077, but if I remember correctly I got some
kind of error that suggested to try 6066, with which it worked just fine
(apart from this issue here).

Each worker has two cores. I also tried increasing cores, again no
effect. I was able to increase the number of cores the job was using on
one worker, but it would not use any other worker (and it would not
start if the number of cores the job wanted was higher than the number
available on one worker).

On 07.01.2016 22:51, Igor Berman wrote:
> read about *--total-executor-cores*
> not sure why you specify port 6066 in master...usually it's 7077
> verify in master ui(usually port 8080) how many cores are
> there(depends on other configs, but usually workers connect to master
> with all their cores)
>
> On 7 January 2016 at 23:46, Michael Pisula <michael.pisula@tngtech.com
> <ma...@tngtech.com>> wrote:
>
>     Hi,
>
>     I start the cluster using the spark-ec2 scripts, so the cluster is
>     in stand-alone mode.
>     Here is how I submit my job:
>     spark/bin/spark-submit --class demo.spark.StaticDataAnalysis
>     --master spark://<host>:6066 --deploy-mode cluster
>     demo/Demo-1.0-SNAPSHOT-all.jar
>
>     Cheers,
>     Michael
>
>
>     On 07.01.2016 22:41, Igor Berman wrote:
>>     share how you submit your job
>>     what cluster(yarn, standalone)
>>
>>     On 7 January 2016 at 23:24, Michael Pisula
>>     <michael.pisula@tngtech.com <ma...@tngtech.com>>
>>     wrote:
>>
>>         Hi there,
>>
>>         I ran a simple Batch Application on a Spark Cluster on EC2.
>>         Despite having 3
>>         Worker Nodes, I could not get the application processed on
>>         more than one
>>         node, regardless if I submitted the Application in Cluster or
>>         Client mode.
>>         I also tried manually increasing the number of partitions in
>>         the code, no
>>         effect. I also pass the master into the application.
>>         I verified on the nodes themselves that only one node was
>>         active while the
>>         job was running.
>>         I pass enough data to make the job take 6 minutes to process.
>>         The job is simple enough, reading data from two S3 files,
>>         joining records on
>>         a shared field, filtering out some records and writing the
>>         result back to
>>         S3.
>>
>>         Tried all kinds of stuff, but could not make it work. I did
>>         find similar
>>         questions, but had already tried the solutions that worked in
>>         those cases.
>>         Would be really happy about any pointers.
>>
>>         Cheers,
>>         Michael
>>
>>
>>
>>         --
>>         View this message in context:
>>         http://apache-spark-user-list.1001560.n3.nabble.com/Spark-job-uses-only-one-Worker-tp25909.html
>>         Sent from the Apache Spark User List mailing list archive at
>>         Nabble.com.
>>
>>         ---------------------------------------------------------------------
>>         To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>         <ma...@spark.apache.org>
>>         For additional commands, e-mail: user-help@spark.apache.org
>>         <ma...@spark.apache.org>
>>
>>
>
>     -- 
>     Michael Pisula * michael.pisula@tngtech.com <ma...@tngtech.com> * +49-174-3180084 <tel:%2B49-174-3180084>
>     TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
>     Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
>     Sitz: Unterföhring * Amtsgericht München * HRB 135082
>
>

-- 
Michael Pisula * michael.pisula@tngtech.com * +49-174-3180084
TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
Sitz: Unterföhring * Amtsgericht München * HRB 135082

Re: Spark job uses only one Worker

Posted by Igor Berman <ig...@gmail.com>.

read about *--total-executor-cores*
not sure why you specify port 6066 in master...usually it's 7077
verify in master ui(usually port 8080) how many cores are there(depends on
other configs, but usually workers connect to master with all their cores)

On 7 January 2016 at 23:46, Michael Pisula <mi...@tngtech.com>
wrote:

> Hi,
>
> I start the cluster using the spark-ec2 scripts, so the cluster is in
> stand-alone mode.
> Here is how I submit my job:
> spark/bin/spark-submit --class demo.spark.StaticDataAnalysis --master
> spark://<host>:6066 --deploy-mode cluster demo/Demo-1.0-SNAPSHOT-all.jar
>
> Cheers,
> Michael
>
>
> On 07.01.2016 22:41, Igor Berman wrote:
>
> share how you submit your job
> what cluster(yarn, standalone)
>
> On 7 January 2016 at 23:24, Michael Pisula <mi...@tngtech.com>
> wrote:
>
>> Hi there,
>>
>> I ran a simple Batch Application on a Spark Cluster on EC2. Despite
>> having 3
>> Worker Nodes, I could not get the application processed on more than one
>> node, regardless if I submitted the Application in Cluster or Client mode.
>> I also tried manually increasing the number of partitions in the code, no
>> effect. I also pass the master into the application.
>> I verified on the nodes themselves that only one node was active while the
>> job was running.
>> I pass enough data to make the job take 6 minutes to process.
>> The job is simple enough, reading data from two S3 files, joining records
>> on
>> a shared field, filtering out some records and writing the result back to
>> S3.
>>
>> Tried all kinds of stuff, but could not make it work. I did find similar
>> questions, but had already tried the solutions that worked in those cases.
>> Would be really happy about any pointers.
>>
>> Cheers,
>> Michael
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-job-uses-only-one-Worker-tp25909.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>
> --
> Michael Pisula * michael.pisula@tngtech.com * +49-174-3180084
> TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
> Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
> Sitz: Unterföhring * Amtsgericht München * HRB 135082
>
>

Re: Spark job uses only one Worker

Posted by Michael Pisula <mi...@tngtech.com>.

Hi,

I start the cluster using the spark-ec2 scripts, so the cluster is in
stand-alone mode.
Here is how I submit my job:
spark/bin/spark-submit --class demo.spark.StaticDataAnalysis --master
spark://<host>:6066 --deploy-mode cluster demo/Demo-1.0-SNAPSHOT-all.jar

Cheers,
Michael

On 07.01.2016 22:41, Igor Berman wrote:
> share how you submit your job
> what cluster(yarn, standalone)
>
> On 7 January 2016 at 23:24, Michael Pisula <michael.pisula@tngtech.com
> <ma...@tngtech.com>> wrote:
>
>     Hi there,
>
>     I ran a simple Batch Application on a Spark Cluster on EC2.
>     Despite having 3
>     Worker Nodes, I could not get the application processed on more
>     than one
>     node, regardless if I submitted the Application in Cluster or
>     Client mode.
>     I also tried manually increasing the number of partitions in the
>     code, no
>     effect. I also pass the master into the application.
>     I verified on the nodes themselves that only one node was active
>     while the
>     job was running.
>     I pass enough data to make the job take 6 minutes to process.
>     The job is simple enough, reading data from two S3 files, joining
>     records on
>     a shared field, filtering out some records and writing the result
>     back to
>     S3.
>
>     Tried all kinds of stuff, but could not make it work. I did find
>     similar
>     questions, but had already tried the solutions that worked in
>     those cases.
>     Would be really happy about any pointers.
>
>     Cheers,
>     Michael
>
>
>
>     --
>     View this message in context:
>     http://apache-spark-user-list.1001560.n3.nabble.com/Spark-job-uses-only-one-Worker-tp25909.html
>     Sent from the Apache Spark User List mailing list archive at
>     Nabble.com.
>
>     ---------------------------------------------------------------------
>     To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>     <ma...@spark.apache.org>
>     For additional commands, e-mail: user-help@spark.apache.org
>     <ma...@spark.apache.org>
>
>

-- 
Michael Pisula * michael.pisula@tngtech.com * +49-174-3180084
TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
Sitz: Unterföhring * Amtsgericht München * HRB 135082

Re: Spark job uses only one Worker

Posted by Igor Berman <ig...@gmail.com>.

share how you submit your job
what cluster(yarn, standalone)

On 7 January 2016 at 23:24, Michael Pisula <mi...@tngtech.com>
wrote:

> Hi there,
>
> I ran a simple Batch Application on a Spark Cluster on EC2. Despite having
> 3
> Worker Nodes, I could not get the application processed on more than one
> node, regardless if I submitted the Application in Cluster or Client mode.
> I also tried manually increasing the number of partitions in the code, no
> effect. I also pass the master into the application.
> I verified on the nodes themselves that only one node was active while the
> job was running.
> I pass enough data to make the job take 6 minutes to process.
> The job is simple enough, reading data from two S3 files, joining records
> on
> a shared field, filtering out some records and writing the result back to
> S3.
>
> Tried all kinds of stuff, but could not make it work. I did find similar
> questions, but had already tried the solutions that worked in those cases.
> Would be really happy about any pointers.
>
> Cheers,
> Michael
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-job-uses-only-one-Worker-tp25909.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>