You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Debasish Das <de...@gmail.com> on 2014/11/13 03:01:29 UTC

flatMap followed by mapPartitions

Hi,

I am doing a flatMap followed by mapPartitions to do some blocked
operation...flatMap is shuffling data but this shuffle is strictly
shuffling to disk and not over the network right ?

Thanks.
Deb

Re: flatMap followed by mapPartitions

Posted by Debasish Das <de...@gmail.com>.
mapPartitions tried to hold data is memory which did not work for me..

I am doing flatMap followed by groupByKey now with HashPartitioner and
number of blocks is 60 (Based on 120 cores I am running the job on)...

Now when the shuffle size < 100 GB it works fine...as flatMap shuffle goes
to 200 GB, 400 GB...I am getting:

FetchFailed(BlockManagerId(1, istgbd013.verizon.com, 44377, 0),
shuffleId=37, mapId=8, reduceId=54)

I have to shuffle because the memory on cluster is less than the shuffle
size of 400 GB..

The job runs fine if I sample and decrease my shuffle size within 100 GB..

Does groupByKey does a combiner similar to reduceByKey and aggregateByKey ?
I need a combiner operation to do some work on map side after flatMap
followed by rest of the work on reducers..

On Wed, Nov 12, 2014 at 8:35 PM, Mayur Rustagi <ma...@gmail.com>
wrote:

> flatmap would have to shuffle data only if output RDD is expected to be
> partitioned by some key.
> RDD[X].flatmap(X=>RDD[Y])
> If it has to shuffle it should be local.
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>
>
> On Thu, Nov 13, 2014 at 7:31 AM, Debasish Das <de...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I am doing a flatMap followed by mapPartitions to do some blocked
>> operation...flatMap is shuffling data but this shuffle is strictly
>> shuffling to disk and not over the network right ?
>>
>> Thanks.
>> Deb
>>
>
>

Re: flatMap followed by mapPartitions

Posted by Mayur Rustagi <ma...@gmail.com>.
flatmap would have to shuffle data only if output RDD is expected to be
partitioned by some key.
RDD[X].flatmap(X=>RDD[Y])
If it has to shuffle it should be local.

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>


On Thu, Nov 13, 2014 at 7:31 AM, Debasish Das <de...@gmail.com>
wrote:

> Hi,
>
> I am doing a flatMap followed by mapPartitions to do some blocked
> operation...flatMap is shuffling data but this shuffle is strictly
> shuffling to disk and not over the network right ?
>
> Thanks.
> Deb
>