You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by myasuka <my...@live.com> on 2014/10/01 08:00:47 UTC

Re: How to use multi thread in RDD map function ?

Thank for your advise, I have tried your recommended configuration, but 

SPARK_WORKER_CORES=32
SPARK_WORKER_INSTANCES=1

still work better, in the offical spark-standalone, about the parameter
'SPARK_WORKER_INSTANCES' /You can make this more than 1 if you have have
very large machines and would like multiple Spark worker processes./ Maybe
our 16 nodes cluster with 16cores, 24GB memory is not fit for more than 1
worker instance.

Yi Tian wrote
> Hi, myasuka
> 
> Have you checked the jvm gc time of each executor? 
> 
> I think you should increase the SPARK_EXECUTOR_CORES or
> SPARK_EXECUTOR_INSTANCES until you get the enough concurrency.
> 
> Here is my recommend config:
> 
> SPARK_EXECUTOR_CORES=8
> SPARK_EXECUTOR_INSTANCES=4
> SPARK_WORKER_MEMORY=8G
> 
> note: make sure you got enough memory on each node, more than
> SPARK_EXECUTOR_INSTANCES * SPARK_WORKER_MEMORY
> 
> Best Regards,
> 
> Yi Tian

> tianyi.asiainfo@

> 
> 
> 
> 
> On Sep 29, 2014, at 21:06, myasuka &lt;

> myasuka@

> &gt; wrote:
> 
>> Our cluster is a standalone cluster with 16 computing nodes, each node
>> has 16
>> cores. I set SPARK_WORKER_INSTANCES to 1, and set SPARK_WORKER_CORES to
>> 32,
>> we give 512 tasks all together, this situation can help increase the
>> concurrency. But if I  set SPARK_WORKER_INSTANCES to 2,
>> SPARK_WORKER_CORES
>> to 16, this dosen't work well.
>> 
>> Thank you for your reply.
>> 
>> 
>> Yi Tian wrote
>>> for yarn-client mode:
>>> 
>>> SPARK_EXECUTOR_CORES * SPARK_EXECUTOR_INSTANCES = 2(or 3) *
>>> TotalCoresOnYourCluster
>>> 
>>> for standlone mode:
>>> 
>>> SPARK_WORKER_INSTANCES * SPARK_WORKER_CORES = 2(or 3) *
>>> TotalCoresOnYourCluster
>>> 
>>> 
>>> 
>>> Best Regards,
>>> 
>>> Yi Tian
>> 
>>> tianyi.asiainfo@
>> 
>>> 
>>> 
>>> 
>>> 
>>> On Sep 28, 2014, at 17:59, myasuka &lt;
>> 
>>> myasuka@
>> 
>>> &gt; wrote:
>>> 
>>>> Hi, everyone
>>>>   I come across with a problem about increasing the concurency. In a
>>>> program, after shuffle write, each node should fetch 16 pair matrices
>>>> to
>>>> do
>>>> matrix multiplication. such as:
>>>> 
>>>> *import breeze.linalg.{DenseMatrix => BDM}
>>>> 
>>>> pairs.map(t => {
>>>>       val b1 = t._2._1.asInstanceOf[BDM[Double]]
>>>>       val b2 = t._2._2.asInstanceOf[BDM[Double]]
>>>> 
>>>>       val c = (b1 * b2).asInstanceOf[BDM[Double]]
>>>> 
>>>>       (new BlockID(t._1.row, t._1.column), c)
>>>>     })*
>>>> 
>>>>   Each node has 16 cores. However, no matter I set 16 tasks or more on
>>>> each node, the concurrency cannot be higher than 60%, which means not
>>>> every
>>>> core on the node is computing. Then I check the running log on the
>>>> WebUI,
>>>> according to the amount of shuffle read and write in every task, I see
>>>> some
>>>> task do once matrix multiplication, some do twice while some do none.
>>>> 
>>>>   Thus, I think of using java multi thread to increase the concurrency.
>>>> I
>>>> wrote a program in scala which calls java multi thread without Spark on
>>>> a
>>>> single node, by watch the 'top' monitor, I find this program can use
>>>> CPU
>>>> up
>>>> to 1500% ( means nearly every core are computing). But I have no idea
>>>> how
>>>> to
>>>> use Java multi thread in RDD transformation.
>>>> 
>>>>   Is there any one can provide some example code to use Java multi
>>>> thread
>>>> in RDD transformation, or give any idea to increase the concurrency ?
>>>> 
>>>> Thanks for all
>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-use-multi-thread-in-RDD-map-function-tp8583.html
>>>> Sent from the Apache Spark Developers List mailing list archive at
>>>> Nabble.com.
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: 
>> 
>>> dev-unsubscribe@.apache
>> 
>>>> For additional commands, e-mail: 
>> 
>>> dev-help@.apache
>> 
>>>> 
>> 
>> 
>> 
>> 
>> 
>> --
>> View this message in context:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-use-multi-thread-in-RDD-map-function-tp8583p8594.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: 

> dev-unsubscribe@.apache

>> For additional commands, e-mail: 

> dev-help@.apache





--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-use-multi-thread-in-RDD-map-function-tp8583p8616.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org