You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by ameyc <am...@gmail.com> on 2014/12/03 22:26:22 UTC

Alternatives to groupByKey

Hi,

So my Spark app needs to run a sliding window through a time series dataset
(I'm not using Spark streaming). And then run different types on
aggregations on per window basis. Right now I'm using a groupByKey() which
gives me Iterables for each window. There are a few concerns I have with
this approach:

1. groupByKey() could potentially fail for a key not fitting in the memory.
2. I'd like to run aggregations like max(), mean() on each of the groups,
it'd be nice to have the RDD functionality at this point instead of the
iterables.
3. I can't use reduceByKey() or aggregateByKey() are some of my aggregations
need to have a view of the entire window.

Only other way I could think of is partitioning my RDDs into multiple RDDs
with each RDD representing a window. Is this a sensible approach? Or is
there any other way of going about this?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Alternatives-to-groupByKey-tp20293.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Alternatives to groupByKey

Posted by Xuefeng Wu <be...@gmail.com>.

looks good.

I concern about the  foldLeftByKey which looks break the consistence from foldLeft in RDD and aggregateByKey in PairRDD


Yours, Xuefeng Wu 吴雪峰 敬上

> On 2014年12月4日, at 上午7:47, Koert Kuipers <ko...@tresata.com> wrote:
> 
> foldLeftByKey

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Alternatives to groupByKey

Posted by Koert Kuipers <ko...@tresata.com>.

do these requirements boils down to a need for foldLeftByKey with sorting
of the values?

https://issues.apache.org/jira/browse/SPARK-3655


On Wed, Dec 3, 2014 at 6:34 PM, Xuefeng Wu <be...@gmail.com> wrote:

> I have similar requirememt，take top N by key. right now I use
> groupByKey，but one key would group more than half data in some dataset.
>
> Yours, Xuefeng Wu 吴雪峰 敬上
>
> On 2014年12月4日, at 上午7:26, Nathan Kronenfeld <nk...@oculusinfo.com>
> wrote:
>
> I think it would depend on the type and amount of information you're
> collecting.
>
> If you're just trying to collect small numbers for each window, and don't
> have an overwhelming number of windows, you might consider using
> accumulators.  Just make one per value per time window, and for each data
> point, add it to the accumulators for the time windows in which it
> belongs.  We've found this approach a lot faster than anything involving a
> shuffle.  This should work fine for stuff like max(), min(), and mean()
>
> If you're collecting enough data that accumulators are impractical, I
> think I would try multiple passes.  Cache your data, and for each pass,
> filter to that window, and perform all your operations on the filtered
> RDD.  Because of the caching, it won't be significantly slower than
> processing it all at once - in fact, it will probably be a lot faster,
> because the shuffles are shuffling less information.  This is similar to
> what you're suggesting about partitioning your rdd, but probably simpler
> and easier.
>
> That being said, your restriction 3 seems to be in contradiction to the
> rest of your request - if your aggregation needs to be able to look at all
> the data at once, then that seems contradictory to viewing the data through
> an RDD.  Could you explain a bit more what you mean by that?
>
>                 -Nathan
>
>
> On Wed, Dec 3, 2014 at 4:26 PM, ameyc <am...@gmail.com> wrote:
>
>> Hi,
>>
>> So my Spark app needs to run a sliding window through a time series
>> dataset
>> (I'm not using Spark streaming). And then run different types on
>> aggregations on per window basis. Right now I'm using a groupByKey() which
>> gives me Iterables for each window. There are a few concerns I have with
>> this approach:
>>
>> 1. groupByKey() could potentially fail for a key not fitting in the
>> memory.
>> 2. I'd like to run aggregations like max(), mean() on each of the groups,
>> it'd be nice to have the RDD functionality at this point instead of the
>> iterables.
>> 3. I can't use reduceByKey() or aggregateByKey() are some of my
>> aggregations
>> need to have a view of the entire window.
>>
>> Only other way I could think of is partitioning my RDDs into multiple RDDs
>> with each RDD representing a window. Is this a sensible approach? Or is
>> there any other way of going about this?
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Alternatives-to-groupByKey-tp20293.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>
>
> --
> Nathan Kronenfeld
> Senior Visualization Developer
> Oculus Info Inc
> 2 Berkeley Street, Suite 600,
> Toronto, Ontario M5A 4J5
> Phone:  +1-416-203-3003 x 238
> Email:  nkronenfeld@oculusinfo.com
>
>

Re: Alternatives to groupByKey

Posted by Xuefeng Wu <be...@gmail.com>.

I have similar requirememt，take top N by key. right now I use groupByKey，but one key would group more than half data in some dataset. 

Yours, Xuefeng Wu 吴雪峰 敬上

> On 2014年12月4日, at 上午7:26, Nathan Kronenfeld <nk...@oculusinfo.com> wrote:
> 
> I think it would depend on the type and amount of information you're collecting.
> 
> If you're just trying to collect small numbers for each window, and don't have an overwhelming number of windows, you might consider using accumulators.  Just make one per value per time window, and for each data point, add it to the accumulators for the time windows in which it belongs.  We've found this approach a lot faster than anything involving a shuffle.  This should work fine for stuff like max(), min(), and mean()
> 
> If you're collecting enough data that accumulators are impractical, I think I would try multiple passes.  Cache your data, and for each pass, filter to that window, and perform all your operations on the filtered RDD.  Because of the caching, it won't be significantly slower than processing it all at once - in fact, it will probably be a lot faster, because the shuffles are shuffling less information.  This is similar to what you're suggesting about partitioning your rdd, but probably simpler and easier.
> 
> That being said, your restriction 3 seems to be in contradiction to the rest of your request - if your aggregation needs to be able to look at all the data at once, then that seems contradictory to viewing the data through an RDD.  Could you explain a bit more what you mean by that?
> 
>                 -Nathan
> 
> 
>> On Wed, Dec 3, 2014 at 4:26 PM, ameyc <am...@gmail.com> wrote:
>> Hi,
>> 
>> So my Spark app needs to run a sliding window through a time series dataset
>> (I'm not using Spark streaming). And then run different types on
>> aggregations on per window basis. Right now I'm using a groupByKey() which
>> gives me Iterables for each window. There are a few concerns I have with
>> this approach:
>> 
>> 1. groupByKey() could potentially fail for a key not fitting in the memory.
>> 2. I'd like to run aggregations like max(), mean() on each of the groups,
>> it'd be nice to have the RDD functionality at this point instead of the
>> iterables.
>> 3. I can't use reduceByKey() or aggregateByKey() are some of my aggregations
>> need to have a view of the entire window.
>> 
>> Only other way I could think of is partitioning my RDDs into multiple RDDs
>> with each RDD representing a window. Is this a sensible approach? Or is
>> there any other way of going about this?
>> 
>> 
>> 
>> --
>> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Alternatives-to-groupByKey-tp20293.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
> 
> 
> 
> -- 
> Nathan Kronenfeld
> Senior Visualization Developer
> Oculus Info Inc
> 2 Berkeley Street, Suite 600,
> Toronto, Ontario M5A 4J5
> Phone:  +1-416-203-3003 x 238
> Email:  nkronenfeld@oculusinfo.com

Re: Alternatives to groupByKey

Posted by Nathan Kronenfeld <nk...@oculusinfo.com>.

I think it would depend on the type and amount of information you're
collecting.

If you're just trying to collect small numbers for each window, and don't
have an overwhelming number of windows, you might consider using
accumulators.  Just make one per value per time window, and for each data
point, add it to the accumulators for the time windows in which it
belongs.  We've found this approach a lot faster than anything involving a
shuffle.  This should work fine for stuff like max(), min(), and mean()

If you're collecting enough data that accumulators are impractical, I think
I would try multiple passes.  Cache your data, and for each pass, filter to
that window, and perform all your operations on the filtered RDD.  Because
of the caching, it won't be significantly slower than processing it all at
once - in fact, it will probably be a lot faster, because the shuffles are
shuffling less information.  This is similar to what you're suggesting
about partitioning your rdd, but probably simpler and easier.

That being said, your restriction 3 seems to be in contradiction to the
rest of your request - if your aggregation needs to be able to look at all
the data at once, then that seems contradictory to viewing the data through
an RDD.  Could you explain a bit more what you mean by that?

                -Nathan

On Wed, Dec 3, 2014 at 4:26 PM, ameyc <am...@gmail.com> wrote:

> Hi,
>
> So my Spark app needs to run a sliding window through a time series dataset
> (I'm not using Spark streaming). And then run different types on
> aggregations on per window basis. Right now I'm using a groupByKey() which
> gives me Iterables for each window. There are a few concerns I have with
> this approach:
>
> 1. groupByKey() could potentially fail for a key not fitting in the memory.
> 2. I'd like to run aggregations like max(), mean() on each of the groups,
> it'd be nice to have the RDD functionality at this point instead of the
> iterables.
> 3. I can't use reduceByKey() or aggregateByKey() are some of my
> aggregations
> need to have a view of the entire window.
>
> Only other way I could think of is partitioning my RDDs into multiple RDDs
> with each RDD representing a window. Is this a sensible approach? Or is
> there any other way of going about this?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Alternatives-to-groupByKey-tp20293.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

-- 
Nathan Kronenfeld
Senior Visualization Developer
Oculus Info Inc
2 Berkeley Street, Suite 600,
Toronto, Ontario M5A 4J5
Phone:  +1-416-203-3003 x 238
Email:  nkronenfeld@oculusinfo.com