You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Alexander Pivovarov <ap...@gmail.com> on 2016/06/09 03:42:55 UTC
rdd.distinct with Partitioner
most of the RDD methods which shuffle data take Partitioner as a parameter
But rdd.distinct does not have such signature
Should I open a PR for that?
/**
* Return a new RDD containing the distinct elements in this RDD.
*/
def distinct(partitioner: Partitioner)(implicit ord: Ordering[T] =
null): RDD[T] = withScope {
map(x => (x, null)).reduceByKey(partitioner, (x, y) => x).map(_._1)
}
Re: rdd.distinct with Partitioner
Posted by Mridul Muralidharan <mr...@gmail.com>.
The example violates the basic contract of a Partitioner.
It does make sense to take Partitioner as a param to distinct - though it
is fairly trivial to simulate that in user code as well ...
Regards
Mridul
On Wednesday, June 8, 2016, 汪洋 <ti...@icloud.com> wrote:
> Hi Alexander,
>
> I think it does not guarantee to be right if an arbitrary Partitioner is
> passed in.
>
> I have created a notebook and you can check it out. (
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/7973071962862063/2110745399505739/58107563000366/latest.html
> )
>
> Best regards,
>
> Yang
>
>
> 在 2016年6月9日,上午11:42,Alexander Pivovarov <apivovarov@gmail.com
> <javascript:_e(%7B%7D,'cvml','apivovarov@gmail.com');>> 写道:
>
> most of the RDD methods which shuffle data take Partitioner as a parameter
>
> But rdd.distinct does not have such signature
>
> Should I open a PR for that?
>
> /**
> * Return a new RDD containing the distinct elements in this RDD.
> */
>
> def distinct(partitioner: Partitioner)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
> map(x => (x, null)).reduceByKey(partitioner, (x, y) => x).map(_._1)
> }
>
>
>
Re: rdd.distinct with Partitioner
Posted by 汪洋 <ti...@icloud.com>.
Frankly speaking, I think reduceByKey with Partitioner has the same problem too and it should not be exposed to public user either. Because it is a little hard to fully understand how the partitioner behaves without looking at the actual code.
And if there exits a basic contract of a Partitioner, maybe it should be stated explicitly in the document if not enforced by code.
However, I don’t feel too strong to argue about this issue except stating my concern. It will not cause too much trouble anyway once users learn the semantics. Just a judgement call by the API designer.
> 在 2016年6月9日,下午12:51,Alexander Pivovarov <ap...@gmail.com> 写道:
>
> reduceByKey(randomPartitioner, (a, b) => a + b) also gives incorrect result
>
> Why reduceByKey with Partitioner exists then?
>
> On Wed, Jun 8, 2016 at 9:22 PM, 汪洋 <tiandiwoxin@icloud.com <ma...@icloud.com>> wrote:
> Hi Alexander,
>
> I think it does not guarantee to be right if an arbitrary Partitioner is passed in.
>
> I have created a notebook and you can check it out. (https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/7973071962862063/2110745399505739/58107563000366/latest.html <https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/7973071962862063/2110745399505739/58107563000366/latest.html>)
>
> Best regards,
>
> Yang
>
>
>> 在 2016年6月9日,上午11:42,Alexander Pivovarov <apivovarov@gmail.com <ma...@gmail.com>> 写道:
>>
>> most of the RDD methods which shuffle data take Partitioner as a parameter
>>
>> But rdd.distinct does not have such signature
>>
>> Should I open a PR for that?
>>
>> /**
>> * Return a new RDD containing the distinct elements in this RDD.
>> */
>> def distinct(partitioner: Partitioner)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
>> map(x => (x, null)).reduceByKey(partitioner, (x, y) => x).map(_._1)
>> }
>
>
Re: rdd.distinct with Partitioner
Posted by Alexander Pivovarov <ap...@gmail.com>.
reduceByKey(randomPartitioner, (a, b) => a + b) also gives incorrect result
Why reduceByKey with Partitioner exists then?
On Wed, Jun 8, 2016 at 9:22 PM, 汪洋 <ti...@icloud.com> wrote:
> Hi Alexander,
>
> I think it does not guarantee to be right if an arbitrary Partitioner is
> passed in.
>
> I have created a notebook and you can check it out. (
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/7973071962862063/2110745399505739/58107563000366/latest.html
> )
>
> Best regards,
>
> Yang
>
>
> 在 2016年6月9日,上午11:42,Alexander Pivovarov <ap...@gmail.com> 写道:
>
> most of the RDD methods which shuffle data take Partitioner as a parameter
>
> But rdd.distinct does not have such signature
>
> Should I open a PR for that?
>
> /**
> * Return a new RDD containing the distinct elements in this RDD.
> */
>
> def distinct(partitioner: Partitioner)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
> map(x => (x, null)).reduceByKey(partitioner, (x, y) => x).map(_._1)
> }
>
>
>
Re: rdd.distinct with Partitioner
Posted by 汪洋 <ti...@icloud.com>.
Hi Alexander,
I think it does not guarantee to be right if an arbitrary Partitioner is passed in.
I have created a notebook and you can check it out. (https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/7973071962862063/2110745399505739/58107563000366/latest.html <https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/7973071962862063/2110745399505739/58107563000366/latest.html>)
Best regards,
Yang
> 在 2016年6月9日,上午11:42,Alexander Pivovarov <ap...@gmail.com> 写道:
>
> most of the RDD methods which shuffle data take Partitioner as a parameter
>
> But rdd.distinct does not have such signature
>
> Should I open a PR for that?
>
> /**
> * Return a new RDD containing the distinct elements in this RDD.
> */
> def distinct(partitioner: Partitioner)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
> map(x => (x, null)).reduceByKey(partitioner, (x, y) => x).map(_._1)
> }