You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Mohit Jaggi <mo...@gmail.com> on 2015/01/29 18:52:00 UTC

Re: RDD.combineBy without intermediate (k,v) pair allocation

Francois,
RDD.aggregate() does not support aggregation by key. But, indeed, that is the kind of implementation I am looking for, one that does not allocate intermediate space for storing (K,V) pairs. When working with large datasets this type of intermediate memory allocation wrecks havoc with garbage collection, not to mention unnecessarily increases the working memory requirement of the program.

I wonder if someone has already noticed this and there is an effort underway to optimize this. If not, I will take a shot at adding this functionality.

Mohit.

> On Jan 27, 2015, at 1:52 PM, francois.garillot@typesafe.com wrote:
> 
> Have you looked at the `aggregate` function in the RDD API ? 
> 
> If your way of extracting the “key” (identifier) and “value” (payload) parts of the RDD elements is uniform (a function), it’s unclear to me how this would be more efficient that extracting key and value and then using combine, however.
> 
> —
> FG
> 
> 
> On Tue, Jan 27, 2015 at 10:17 PM, Mohit Jaggi <mohitjaggi@gmail.com <ma...@gmail.com>> wrote:
> 
> Hi All, 
> I have a use case where I have an RDD (not a k,v pair) where I want to do a combineByKey() operation. I can do that by creating an intermediate RDD of k,v pairs and using PairRDDFunctions.combineByKey(). However, I believe it will be more efficient if I can avoid this intermediate RDD. Is there a way I can do this by passing in a function that extracts the key, like in RDD.groupBy()? [oops, RDD.groupBy seems to create the intermediate RDD anyway, maybe a better implementation is possible for that too?] 
> If not, is it worth adding to the Spark API? 
> 
> Mohit. 
> --------------------------------------------------------------------- 
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org 
> For additional commands, e-mail: user-help@spark.apache.org 
> 
> 
>

Re: RDD.combineBy without intermediate (k,v) pair allocation

Posted by fr...@typesafe.com.

Sorry, I answered too fast. Please disregard my last message: I did mean aggregate. 




You say: "RDD.aggregate() does not support aggregation by key."




What would you need aggregation by key for, if you do not, at the beginning, have an RDD of key-value pairs, and do not want to build one ? Could you share more about the kind of processing you have in mind ?


—
FG

On Thu, Jan 29, 2015 at 8:01 PM, null <fr...@typesafe.com>
wrote:

> Oh, I’m sorry, I meant `aggregateByKey`.
> https://spark.apache.org/docs/1.2.0/api/scala/#org.apache.spark.rdd.PairRDDFunctions
> —
> FG
> On Thu, Jan 29, 2015 at 7:58 PM, Mohit Jaggi <mo...@gmail.com> wrote:
>> Francois,
>> RDD.aggregate() does not support aggregation by key. But, indeed, that is the kind of implementation I am looking for, one that does not allocate intermediate space for storing (K,V) pairs. When working with large datasets this type of intermediate memory allocation wrecks havoc with garbage collection, not to mention unnecessarily increases the working memory requirement of the program.
>> I wonder if someone has already noticed this and there is an effort underway to optimize this. If not, I will take a shot at adding this functionality.
>> Mohit.
>>> On Jan 27, 2015, at 1:52 PM, francois.garillot@typesafe.com wrote:
>>> 
>>> Have you looked at the `aggregate` function in the RDD API ? 
>>> 
>>> If your way of extracting the “key” (identifier) and “value” (payload) parts of the RDD elements is uniform (a function), it’s unclear to me how this would be more efficient that extracting key and value and then using combine, however.
>>> 
>>> —
>>> FG
>>> 
>>> 
>>> On Tue, Jan 27, 2015 at 10:17 PM, Mohit Jaggi <mohitjaggi@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> Hi All, 
>>> I have a use case where I have an RDD (not a k,v pair) where I want to do a combineByKey() operation. I can do that by creating an intermediate RDD of k,v pairs and using PairRDDFunctions.combineByKey(). However, I believe it will be more efficient if I can avoid this intermediate RDD. Is there a way I can do this by passing in a function that extracts the key, like in RDD.groupBy()? [oops, RDD.groupBy seems to create the intermediate RDD anyway, maybe a better implementation is possible for that too?] 
>>> If not, is it worth adding to the Spark API? 
>>> 
>>> Mohit. 
>>> --------------------------------------------------------------------- 
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org 
>>> For additional commands, e-mail: user-help@spark.apache.org 
>>> 
>>> 
>>>

Re: RDD.combineBy without intermediate (k,v) pair allocation

Posted by fr...@typesafe.com.

Oh, I’m sorry, I meant `aggregateByKey`.




https://spark.apache.org/docs/1.2.0/api/scala/#org.apache.spark.rdd.PairRDDFunctions



—
FG

On Thu, Jan 29, 2015 at 7:58 PM, Mohit Jaggi <mo...@gmail.com> wrote:

> Francois,
> RDD.aggregate() does not support aggregation by key. But, indeed, that is the kind of implementation I am looking for, one that does not allocate intermediate space for storing (K,V) pairs. When working with large datasets this type of intermediate memory allocation wrecks havoc with garbage collection, not to mention unnecessarily increases the working memory requirement of the program.
> I wonder if someone has already noticed this and there is an effort underway to optimize this. If not, I will take a shot at adding this functionality.
> Mohit.
>> On Jan 27, 2015, at 1:52 PM, francois.garillot@typesafe.com wrote:
>> 
>> Have you looked at the `aggregate` function in the RDD API ? 
>> 
>> If your way of extracting the “key” (identifier) and “value” (payload) parts of the RDD elements is uniform (a function), it’s unclear to me how this would be more efficient that extracting key and value and then using combine, however.
>> 
>> —
>> FG
>> 
>> 
>> On Tue, Jan 27, 2015 at 10:17 PM, Mohit Jaggi <mohitjaggi@gmail.com <ma...@gmail.com>> wrote:
>> 
>> Hi All, 
>> I have a use case where I have an RDD (not a k,v pair) where I want to do a combineByKey() operation. I can do that by creating an intermediate RDD of k,v pairs and using PairRDDFunctions.combineByKey(). However, I believe it will be more efficient if I can avoid this intermediate RDD. Is there a way I can do this by passing in a function that extracts the key, like in RDD.groupBy()? [oops, RDD.groupBy seems to create the intermediate RDD anyway, maybe a better implementation is possible for that too?] 
>> If not, is it worth adding to the Spark API? 
>> 
>> Mohit. 
>> --------------------------------------------------------------------- 
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org 
>> For additional commands, e-mail: user-help@spark.apache.org 
>> 
>> 
>>