You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Adrian Mocanu <am...@verticalscope.com> on 2015/03/13 19:58:54 UTC

how to print RDD by key into file with grouByKey

Hi
I have an RDD: RDD[(String, scala.Iterable[(Long, Int)])] which I want to print into a file, a file for each key string.
I tried to trigger a repartition of the RDD by doing group by on it. The grouping gives RDD[(String, scala.Iterable[Iterable[(Long, Int)]])] so  I flattened that:
  Rdd.groupByKey().mapValues(x=>x.flatten)

However, when I print with saveAsTextFile I get only 2 files

I was under the impression that groupBy repartitions the data by key and saveAsTextFile make a file per partition.
What am I doing wrong here?


Thanks
Adrian

Re: how to print RDD by key into file with grouByKey

Posted by Akhil Das <ak...@sigmoidanalytics.com>.

If you want more partitions then you have specify it as:

Rdd.groupByKey(*10*).mapValues...

I think if you don't specify anything, the # partitions will be the #
cores that you have for processing.

Thanks
Best Regards

On Sat, Mar 14, 2015 at 12:28 AM, Adrian Mocanu <am...@verticalscope.com>
wrote:

>  Hi
>
> I have an RDD: RDD[(String, scala.Iterable[(Long, Int)])] which I want to
> print into a file, a file for each key string.
>
> I tried to trigger a repartition of the RDD by doing group by on it. The
> grouping gives RDD[(String, scala.Iterable[Iterable[(Long, Int)]])] so  I
> flattened that:
>
>   Rdd.groupByKey().mapValues(x=>x.flatten)
>
>
>
> However, when I print with saveAsTextFile I get only 2 files
>
>
>
> I was under the impression that groupBy repartitions the data by key and
> saveAsTextFile make a file per partition.
>
> What am I doing wrong here?
>
>
>
>
>
> Thanks
>
> Adrian
>