You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Hu...@Dell.com on 2013/11/14 03:33:09 UTC
interesting finding per using union
Hi,
I am creating initial javaRDD with partition 32 then loop per my data and union with initial javaRDD I have as follows
JavaRDD<String> dataSetRDD = null;
JavaRDD<String> unionDataSetRDD = null;
For (..) {
If (0 == i) {
unionDataSetRDD = SparkDriver.getSparkContext().parallelize(finalresult, 32);
} else {
dataSetRDD = SparkDriver.getSparkContext().parallelize(finalresult, 32);
unionDataSetRDD = unionDataSetRDD.union(dataSetRDD);
}
} //for
System.out.println("unionDataSetRDD: " + unionDataSetRDD.toDebugString());
Output
unionDataSetRDD: UnionRDD[6] at union at DatasetServiceImpl.java:174 (128 partitions)
UnionRDD[4] at union at DatasetServiceImpl.java:174 (96 partitions)
UnionRDD[2] at union at DatasetServiceImpl.java:174 (64 partitions)
ParallelCollectionRDD[0] at parallelize at DatasetServiceImpl.java:167 (32 partitions)
ParallelCollectionRDD[1] at parallelize at DatasetServiceImpl.java:172 (32 partitions)
ParallelCollectionRDD[3] at parallelize at DatasetServiceImpl.java:172 (32 partitions)
ParallelCollectionRDD[5] at parallelize at DatasetServiceImpl.java:172 (32 partitions)
The interesting is my final unionDataSetRDD endup with (128 partitions). I thought it keep the 32 partitions as I explicitly set in parallelize
Does above make sense?
Thanks,
Hussam
RE: interesting finding per using union
Posted by Hu...@Dell.com.
Dell - Internal Use - Confidential
Yes I unioned four RDDs of 32 partitions each.
Thank you,
Hussam
From: Matei Zaharia [mailto:matei.zaharia@gmail.com]
Sent: Wednesday, November 13, 2013 10:37 PM
To: user@spark.incubator.apache.org
Subject: Re: interesting finding per using union
Union just puts the data in two RDDs together, so you get an RDD containing the elements of both, and with the partitions that would've been in both. It's not a unique set union (that would be union() then distinct()). Here you've unioned four RDDs of 32 partitions each to get 128. If you want to have fewer partitions in the final RDD, but do want to include all that data together, you can call coalesce() after unioning them.
Matei
On Nov 13, 2013, at 6:33 PM, Hussam_Jarada@Dell.com<ma...@Dell.com> wrote:
Hi,
I am creating initial javaRDD with partition 32 then loop per my data and union with initial javaRDD I have as follows
JavaRDD<String> dataSetRDD = null;
JavaRDD<String> unionDataSetRDD = null;
For (..) {
If (0 == i) {
unionDataSetRDD = SparkDriver.getSparkContext().parallelize(finalresult, 32);
} else {
dataSetRDD = SparkDriver.getSparkContext().parallelize(finalresult, 32);
unionDataSetRDD = unionDataSetRDD.union(dataSetRDD);
}
} //for
System.out.println("unionDataSetRDD: " + unionDataSetRDD.toDebugString());
Output
unionDataSetRDD: UnionRDD[6] at union at DatasetServiceImpl.java:174 (128 partitions)
UnionRDD[4] at union at DatasetServiceImpl.java:174 (96 partitions)
UnionRDD[2] at union at DatasetServiceImpl.java:174 (64 partitions)
ParallelCollectionRDD[0] at parallelize at DatasetServiceImpl.java:167 (32 partitions)
ParallelCollectionRDD[1] at parallelize at DatasetServiceImpl.java:172 (32 partitions)
ParallelCollectionRDD[3] at parallelize at DatasetServiceImpl.java:172 (32 partitions)
ParallelCollectionRDD[5] at parallelize at DatasetServiceImpl.java:172 (32 partitions)
The interesting is my final unionDataSetRDD endup with (128 partitions). I thought it keep the 32 partitions as I explicitly set in parallelize
Does above make sense?
Thanks,
Hussam
Re: interesting finding per using union
Posted by Matei Zaharia <ma...@gmail.com>.
Union just puts the data in two RDDs together, so you get an RDD containing the elements of both, and with the partitions that would’ve been in both. It’s not a unique set union (that would be union() then distinct()). Here you’ve unioned four RDDs of 32 partitions each to get 128. If you want to have fewer partitions in the final RDD, but do want to include all that data together, you can call coalesce() after unioning them.
Matei
On Nov 13, 2013, at 6:33 PM, Hussam_Jarada@Dell.com wrote:
> Hi,
>
> I am creating initial javaRDD with partition 32 then loop per my data and union with initial javaRDD I have as follows
> JavaRDD<String> dataSetRDD = null;
> JavaRDD<String> unionDataSetRDD = null;
> For (..) {
> If (0 == i) {
> unionDataSetRDD = SparkDriver.getSparkContext().parallelize(finalresult, 32);
> } else {
> dataSetRDD = SparkDriver.getSparkContext().parallelize(finalresult, 32);
> unionDataSetRDD = unionDataSetRDD.union(dataSetRDD);
> }
> } //for
>
> System.out.println("unionDataSetRDD: " + unionDataSetRDD.toDebugString());
>
> Output
> unionDataSetRDD: UnionRDD[6] at union at DatasetServiceImpl.java:174 (128 partitions)
> UnionRDD[4] at union at DatasetServiceImpl.java:174 (96 partitions)
> UnionRDD[2] at union at DatasetServiceImpl.java:174 (64 partitions)
> ParallelCollectionRDD[0] at parallelize at DatasetServiceImpl.java:167 (32 partitions)
> ParallelCollectionRDD[1] at parallelize at DatasetServiceImpl.java:172 (32 partitions)
> ParallelCollectionRDD[3] at parallelize at DatasetServiceImpl.java:172 (32 partitions)
> ParallelCollectionRDD[5] at parallelize at DatasetServiceImpl.java:172 (32 partitions)
>
> The interesting is my final unionDataSetRDD endup with (128 partitions). I thought it keep the 32 partitions as I explicitly set in parallelize
>
> Does above make sense?
> Thanks,
> Hussam