You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Adrian Mocanu <am...@verticalscope.com> on 2015/03/13 19:58:54 UTC
how to print RDD by key into file with grouByKey
Hi
I have an RDD: RDD[(String, scala.Iterable[(Long, Int)])] which I want to print into a file, a file for each key string.
I tried to trigger a repartition of the RDD by doing group by on it. The grouping gives RDD[(String, scala.Iterable[Iterable[(Long, Int)]])] so I flattened that:
Rdd.groupByKey().mapValues(x=>x.flatten)
However, when I print with saveAsTextFile I get only 2 files
I was under the impression that groupBy repartitions the data by key and saveAsTextFile make a file per partition.
What am I doing wrong here?
Thanks
Adrian
Re: how to print RDD by key into file with grouByKey
Posted by Akhil Das <ak...@sigmoidanalytics.com>.
If you want more partitions then you have specify it as:
Rdd.groupByKey(*10*).mapValues...
I think if you don't specify anything, the # partitions will be the #
cores that you have for processing.
Thanks
Best Regards
On Sat, Mar 14, 2015 at 12:28 AM, Adrian Mocanu <am...@verticalscope.com>
wrote:
> Hi
>
> I have an RDD: RDD[(String, scala.Iterable[(Long, Int)])] which I want to
> print into a file, a file for each key string.
>
> I tried to trigger a repartition of the RDD by doing group by on it. The
> grouping gives RDD[(String, scala.Iterable[Iterable[(Long, Int)]])] so I
> flattened that:
>
> Rdd.groupByKey().mapValues(x=>x.flatten)
>
>
>
> However, when I print with saveAsTextFile I get only 2 files
>
>
>
> I was under the impression that groupBy repartitions the data by key and
> saveAsTextFile make a file per partition.
>
> What am I doing wrong here?
>
>
>
>
>
> Thanks
>
> Adrian
>