You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Takeshi Yamamuro <li...@gmail.com> on 2016/02/05 05:41:05 UTC

Re: sc.textFile the number of the workers to parallelize

Hi,

ISTM these tasks are just assigned with executors in preferred nodes, so
how about repartitioning rdd?

s3File.repartition(9).count

On Fri, Feb 5, 2016 at 5:04 AM, Lin, Hao <Ha...@finra.org> wrote:

> Hi,
>
>
>
> I have a question on the number of workers that Spark enable to
> parallelize the loading of files using sc.textFile. When I used sc.textFile
> to access multiple files in AWS S3, it seems to only enable 2 workers
> regardless of how many worker nodes I have in my cluster. So how does Spark
> configure the parallelization in regard of the size of cluster nodes? In
> the following case, spark has 896 tasks split between only two nodes
> 10.162.97.235 and 10.162.97.237, while I have 9 nodes in the cluster.
>
>
>
> thanks
>
>
>
> Example of doing a count:
>
>  scala> s3File.count
>
> 16/02/04 18:12:06 INFO SparkContext: Starting job: count at <console>:30
>
> 16/02/04 18:12:06 INFO DAGScheduler: Got job 0 (count at <console>:30)
> with 896 output partitions
>
> 16/02/04 18:12:06 INFO DAGScheduler: Final stage: ResultStage 0 (count at
> <console>:30)
>
> 16/02/04 18:12:06 INFO DAGScheduler: Parents of final stage: List()
>
> 16/02/04 18:12:06 INFO DAGScheduler: Missing parents: List()
>
> 16/02/04 18:12:06 INFO DAGScheduler: Submitting ResultStage 0
> (MapPartitionsRDD[1] at textFile at <console>:27), which has no missing
> parents
>
> 16/02/04 18:12:07 INFO MemoryStore: Block broadcast_1 stored as values in
> memory (estimated size 3.0 KB, free 228.3 KB)
>
> 16/02/04 18:12:07 INFO MemoryStore: Block broadcast_1_piece0 stored as
> bytes in memory (estimated size 1834.0 B, free 230.1 KB)
>
> 16/02/04 18:12:07 INFO BlockManagerInfo: Added broadcast_1_piece0 in
> memory on 10.162.98.112:46425 (size: 1834.0 B, free: 517.4 MB)
>
> 16/02/04 18:12:07 INFO SparkContext: Created broadcast 1 from broadcast at
> DAGScheduler.scala:1006
>
> 16/02/04 18:12:07 INFO DAGScheduler: Submitting 896 missing tasks from
> ResultStage 0 (MapPartitionsRDD[1] at textFile at <console>:27)
>
> 16/02/04 18:12:07 INFO YarnScheduler: Adding task set 0.0 with 896 tasks
>
> 16/02/04 18:12:07 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID
> 0, 10.162.97.235, partition 0,RACK_LOCAL, 2213 bytes)
>
> 16/02/04 18:12:07 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID
> 1, 10.162.97.237, partition 1,RACK_LOCAL, 2213 bytes)
>
> 16/02/04 18:12:07 INFO BlockManagerInfo: Added broadcast_1_piece0 in
> memory on 10.162.97.235:38643 (size: 1834.0 B, free: 1259.8 MB)
>
> 16/02/04 18:12:07 INFO BlockManagerInfo: Added broadcast_1_piece0 in
> memory on 10.162.97.237:45360 (size: 1834.0 B, free: 1259.8 MB)
>
> 16/02/04 18:12:07 INFO BlockManagerInfo: Added broadcast_0_piece0 in
> memory on 10.162.97.237:45360 (size: 23.8 KB, free: 1259.8 MB)
>
> 16/02/04 18:12:07 INFO BlockManagerInfo: Added broadcast_0_piece0 in
> memory on 10.162.97.235:38643 (size: 23.8 KB, free: 1259.8 MB)
> Confidentiality Notice:: This email, including attachments, may include
> non-public, proprietary, confidential or legally privileged information. If
> you are not an intended recipient or an authorized agent of an intended
> recipient, you are hereby notified that any dissemination, distribution or
> copying of the information contained in or transmitted with this e-mail is
> unauthorized and strictly prohibited. If you have received this email in
> error, please notify the sender by replying to this message and permanently
> delete this e-mail, its attachments, and any copies of it immediately. You
> should not retain, copy or use this e-mail or any attachment for any
> purpose, nor disclose all or any part of the contents to any other person.
> Thank you.
>



-- 
---
Takeshi Yamamuro

Re: sc.textFile the number of the workers to parallelize

Posted by Koert Kuipers <ko...@tresata.com>.

increase minPartitions:
sc.textFile(path, minPartitions = 9)


On Thu, Feb 4, 2016 at 11:41 PM, Takeshi Yamamuro <li...@gmail.com>
wrote:

> Hi,
>
> ISTM these tasks are just assigned with executors in preferred nodes, so
> how about repartitioning rdd?
>
> s3File.repartition(9).count
>
> On Fri, Feb 5, 2016 at 5:04 AM, Lin, Hao <Ha...@finra.org> wrote:
>
>> Hi,
>>
>>
>>
>> I have a question on the number of workers that Spark enable to
>> parallelize the loading of files using sc.textFile. When I used sc.textFile
>> to access multiple files in AWS S3, it seems to only enable 2 workers
>> regardless of how many worker nodes I have in my cluster. So how does Spark
>> configure the parallelization in regard of the size of cluster nodes? In
>> the following case, spark has 896 tasks split between only two nodes
>> 10.162.97.235 and 10.162.97.237, while I have 9 nodes in the cluster.
>>
>>
>>
>> thanks
>>
>>
>>
>> Example of doing a count:
>>
>>  scala> s3File.count
>>
>> 16/02/04 18:12:06 INFO SparkContext: Starting job: count at <console>:30
>>
>> 16/02/04 18:12:06 INFO DAGScheduler: Got job 0 (count at <console>:30)
>> with 896 output partitions
>>
>> 16/02/04 18:12:06 INFO DAGScheduler: Final stage: ResultStage 0 (count at
>> <console>:30)
>>
>> 16/02/04 18:12:06 INFO DAGScheduler: Parents of final stage: List()
>>
>> 16/02/04 18:12:06 INFO DAGScheduler: Missing parents: List()
>>
>> 16/02/04 18:12:06 INFO DAGScheduler: Submitting ResultStage 0
>> (MapPartitionsRDD[1] at textFile at <console>:27), which has no missing
>> parents
>>
>> 16/02/04 18:12:07 INFO MemoryStore: Block broadcast_1 stored as values in
>> memory (estimated size 3.0 KB, free 228.3 KB)
>>
>> 16/02/04 18:12:07 INFO MemoryStore: Block broadcast_1_piece0 stored as
>> bytes in memory (estimated size 1834.0 B, free 230.1 KB)
>>
>> 16/02/04 18:12:07 INFO BlockManagerInfo: Added broadcast_1_piece0 in
>> memory on 10.162.98.112:46425 (size: 1834.0 B, free: 517.4 MB)
>>
>> 16/02/04 18:12:07 INFO SparkContext: Created broadcast 1 from broadcast
>> at DAGScheduler.scala:1006
>>
>> 16/02/04 18:12:07 INFO DAGScheduler: Submitting 896 missing tasks from
>> ResultStage 0 (MapPartitionsRDD[1] at textFile at <console>:27)
>>
>> 16/02/04 18:12:07 INFO YarnScheduler: Adding task set 0.0 with 896 tasks
>>
>> 16/02/04 18:12:07 INFO TaskSetManager: Starting task 0.0 in stage 0.0
>> (TID 0, 10.162.97.235, partition 0,RACK_LOCAL, 2213 bytes)
>>
>> 16/02/04 18:12:07 INFO TaskSetManager: Starting task 1.0 in stage 0.0
>> (TID 1, 10.162.97.237, partition 1,RACK_LOCAL, 2213 bytes)
>>
>> 16/02/04 18:12:07 INFO BlockManagerInfo: Added broadcast_1_piece0 in
>> memory on 10.162.97.235:38643 (size: 1834.0 B, free: 1259.8 MB)
>>
>> 16/02/04 18:12:07 INFO BlockManagerInfo: Added broadcast_1_piece0 in
>> memory on 10.162.97.237:45360 (size: 1834.0 B, free: 1259.8 MB)
>>
>> 16/02/04 18:12:07 INFO BlockManagerInfo: Added broadcast_0_piece0 in
>> memory on 10.162.97.237:45360 (size: 23.8 KB, free: 1259.8 MB)
>>
>> 16/02/04 18:12:07 INFO BlockManagerInfo: Added broadcast_0_piece0 in
>> memory on 10.162.97.235:38643 (size: 23.8 KB, free: 1259.8 MB)
>> Confidentiality Notice:: This email, including attachments, may include
>> non-public, proprietary, confidential or legally privileged information. If
>> you are not an intended recipient or an authorized agent of an intended
>> recipient, you are hereby notified that any dissemination, distribution or
>> copying of the information contained in or transmitted with this e-mail is
>> unauthorized and strictly prohibited. If you have received this email in
>> error, please notify the sender by replying to this message and permanently
>> delete this e-mail, its attachments, and any copies of it immediately. You
>> should not retain, copy or use this e-mail or any attachment for any
>> purpose, nor disclose all or any part of the contents to any other person.
>> Thank you.
>>
>
>
>
> --
> ---
> Takeshi Yamamuro
>