You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by on <sc...@web.de> on 2016/07/25 17:21:46 UTC

Performance tuning for standalone on one host

Dear all,

I am running spark on one host ("local[2]") doing calculations like this
on a socket stream.
mainStream = socketStream.filter(lambda msg:
msg['header'].startswith('test')).map(lambda x: (x['host'], x) )
s1 = mainStream.updateStateByKey(updateFirst).map(lambda x: (1, x) )
s2 = mainStream.updateStateByKey(updateSecond,
initialRDD=initialMachineStates).map(lambda x: (2, x) )
out.join(bla2).foreachRDD(no_out)

I evaluated each calculations allone has a processing time about 400ms
but processing time of the code above is over 3 sec on average.

I know there are a lot of parameters unknown but does anybody has hints
how to tune this code / system? I already changed a lot of parameters,
such as #executors, #cores and so on.

Thanks in advance and best regards,
on

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Performance tuning for local mode on one host

Posted by Mich Talebzadeh <mi...@gmail.com>.

Hi On,

When you run in Spark mode there is only one SparkSubmit with one executor
only. How many cores do you have?

Each core will allow the same code to run concurrently so with local{8} you
will have 8 tasks running the same code on subset of your data

So do

cat /proc/cpuinfo|grep processor|wc -l


and determine how many Logical processors AKA cores you see


HTH




Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 25 July 2016 at 19:19, on <sc...@web.de> wrote:

>
> OK, sorry, I am running in local mode.
> Just a very small setup...
>
> (changed the subject)
>
> On 25.07.2016 20:01, Mich Talebzadeh wrote:
> > Hi,
> >
> > From your reference I can see that you are running in local mode with
> > two cores. But that is not standalone.
> >
> > Can you please clarify whether you start master and slaves processes.
> > Those are for standalone mode.
> >
> > sbin/start-master.sh
> > sbin/start-slaves.sh
> >
> > HTH
> >
> > Dr Mich Talebzadeh
> >
> >
> >
> > LinkedIn
> > /
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw/
> >
> >
> >
> > http://talebzadehmich.wordpress.com
> >
> >
> > *Disclaimer:* Use it at your own risk.Any and all responsibility for
> > any loss, damage or destruction of data or any other property which
> > may arise from relying on this email's technical content is explicitly
> > disclaimed. The author will in no case be liable for any monetary
> > damages arising from such loss, damage or destruction.
> >
> >
> >
> >
> > On 25 July 2016 at 18:21, on <schueler_1234@web.de
> > <ma...@web.de>> wrote:
> >
> >     Dear all,
> >
> >     I am running spark on one host ("local[2]") doing calculations
> >     like this
> >     on a socket stream.
> >     mainStream = socketStream.filter(lambda msg:
> >     msg['header'].startswith('test')).map(lambda x: (x['host'], x) )
> >     s1 = mainStream.updateStateByKey(updateFirst).map(lambda x: (1, x) )
> >     s2 = mainStream.updateStateByKey(updateSecond,
> >     initialRDD=initialMachineStates).map(lambda x: (2, x) )
> >     out.join(bla2).foreachRDD(no_out)
> >
> >     I evaluated each calculations allone has a processing time about
> 400ms
> >     but processing time of the code above is over 3 sec on average.
> >
> >     I know there are a lot of parameters unknown but does anybody has
> >     hints
> >     how to tune this code / system? I already changed a lot of
> parameters,
> >     such as #executors, #cores and so on.
> >
> >     Thanks in advance and best regards,
> >     on
> >
> >     ---------------------------------------------------------------------
> >     To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> >     <ma...@spark.apache.org>
> >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Re: Performance tuning for local mode on one host

Posted by on <sc...@web.de>.

OK, sorry, I am running in local mode.
Just a very small setup...

(changed the subject)

On 25.07.2016 20:01, Mich Talebzadeh wrote:
> Hi,
>
> From your reference I can see that you are running in local mode with
> two cores. But that is not standalone.
>
> Can you please clarify whether you start master and slaves processes.
> Those are for standalone mode.
>
> sbin/start-master.sh
> sbin/start-slaves.sh
>
> HTH
>
> Dr Mich Talebzadeh
>
>  
>
> LinkedIn
> / https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw/
>
>  
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk.Any and all responsibility for
> any loss, damage or destruction of data or any other property which
> may arise from relying on this email's technical content is explicitly
> disclaimed. The author will in no case be liable for any monetary
> damages arising from such loss, damage or destruction.
>
>  
>
>
> On 25 July 2016 at 18:21, on <schueler_1234@web.de
> <ma...@web.de>> wrote:
>
>     Dear all,
>
>     I am running spark on one host ("local[2]") doing calculations
>     like this
>     on a socket stream.
>     mainStream = socketStream.filter(lambda msg:
>     msg['header'].startswith('test')).map(lambda x: (x['host'], x) )
>     s1 = mainStream.updateStateByKey(updateFirst).map(lambda x: (1, x) )
>     s2 = mainStream.updateStateByKey(updateSecond,
>     initialRDD=initialMachineStates).map(lambda x: (2, x) )
>     out.join(bla2).foreachRDD(no_out)
>
>     I evaluated each calculations allone has a processing time about 400ms
>     but processing time of the code above is over 3 sec on average.
>
>     I know there are a lot of parameters unknown but does anybody has
>     hints
>     how to tune this code / system? I already changed a lot of parameters,
>     such as #executors, #cores and so on.
>
>     Thanks in advance and best regards,
>     on
>
>     ---------------------------------------------------------------------
>     To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>     <ma...@spark.apache.org>
>
>


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Performance tuning for local mode on one host

Posted by on <sc...@web.de>.

There are 4 cores on my system.

Running spark with setMaster("local[2]") results in:
PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+
COMMAND                                                                                                        

7 root      20   0 4748836 563400  29064 S  24.6  7.0   1:16.54
/usr/jdk1.8.0_101/bin/java -cp
/conf/:/usr/spark-2.0.0-preview-bin-hadoop2.6/jars/* -Xmx1g
org.apache.spark.de+
114 root      20   0  114208  31956   7028 S  15.7  0.4   0:16.35 python
-m
pyspark.daemon                                                                                       

117 root      20   0  114404  32116   7028 S  15.7  0.4   0:17.28 python
-m
pyspark.daemon                                                                                       

41 root      20   0  443548  60920  10416 S   0.0  0.8   0:10.84 python
/test.py                                                                                                

111 root      20   0  101272  31740   9356 S   0.0  0.4   0:00.29 python
-m pyspark.daemon 

with a processing time over 3 seconds running the code below. There must
be a lot of overhead somewhere as the code some nearly nothing, i.e., no
expensive calculations on a socket stream getting one message per second.

How to reduce this overhead?


On 25.07.2016 20:19, on wrote:
> OK, sorry, I am running in local mode.
> Just a very small setup...
>
> (changed the subject)
>
> On 25.07.2016 20:01, Mich Talebzadeh wrote:
>> Hi,
>>
>> From your reference I can see that you are running in local mode with
>> two cores. But that is not standalone.
>>
>> Can you please clarify whether you start master and slaves processes.
>> Those are for standalone mode.
>>
>> sbin/start-master.sh
>> sbin/start-slaves.sh
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>  
>>
>> LinkedIn
>> / https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw/
>>
>>  
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk.Any and all responsibility for
>> any loss, damage or destruction of data or any other property which
>> may arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary
>> damages arising from such loss, damage or destruction.
>>
>>  
>>
>>
>> On 25 July 2016 at 18:21, on <schueler_1234@web.de
>> <ma...@web.de>> wrote:
>>
>>     Dear all,
>>
>>     I am running spark on one host ("local[2]") doing calculations
>>     like this
>>     on a socket stream.
>>     mainStream = socketStream.filter(lambda msg:
>>     msg['header'].startswith('test')).map(lambda x: (x['host'], x) )
>>     s1 = mainStream.updateStateByKey(updateFirst).map(lambda x: (1, x) )
>>     s2 = mainStream.updateStateByKey(updateSecond,
>>     initialRDD=initialMachineStates).map(lambda x: (2, x) )
>>     out.join(bla2).foreachRDD(no_out)
>>
>>     I evaluated each calculations allone has a processing time about 400ms
>>     but processing time of the code above is over 3 sec on average.
>>
>>     I know there are a lot of parameters unknown but does anybody has
>>     hints
>>     how to tune this code / system? I already changed a lot of parameters,
>>     such as #executors, #cores and so on.
>>
>>     Thanks in advance and best regards,
>>     on
>>
>>     ---------------------------------------------------------------------
>>     To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>     <ma...@spark.apache.org>
>>
>>


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Performance tuning for standalone on one host

Posted by Mich Talebzadeh <mi...@gmail.com>.

Hi,

From your reference I can see that you are running in local mode with two
cores. But that is not standalone.

Can you please clarify whether you start master and slaves processes. Those
are for standalone mode.

sbin/start-master.sh
sbin/start-slaves.sh

HTH

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 25 July 2016 at 18:21, on <sc...@web.de> wrote:

> Dear all,
>
> I am running spark on one host ("local[2]") doing calculations like this
> on a socket stream.
> mainStream = socketStream.filter(lambda msg:
> msg['header'].startswith('test')).map(lambda x: (x['host'], x) )
> s1 = mainStream.updateStateByKey(updateFirst).map(lambda x: (1, x) )
> s2 = mainStream.updateStateByKey(updateSecond,
> initialRDD=initialMachineStates).map(lambda x: (2, x) )
> out.join(bla2).foreachRDD(no_out)
>
> I evaluated each calculations allone has a processing time about 400ms
> but processing time of the code above is over 3 sec on average.
>
> I know there are a lot of parameters unknown but does anybody has hints
> how to tune this code / system? I already changed a lot of parameters,
> such as #executors, #cores and so on.
>
> Thanks in advance and best regards,
> on
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>