You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Daniel Imberman <da...@gmail.com> on 2016/02/24 01:49:37 UTC
Performing multiple aggregations over the same data
Hi guys,
So I'm running into a speed issue where I have a dataset that needs to be
aggregated multiple times.
Initially my team had set up three accumulators and were running a single
foreach loop over the data. Something along the lines of
val accum1:Accumulable[a]
val accum2: Accumulable[b]
val accum3: Accumulable[c]
data.foreach{
u =>
accum1+=u
accum2 += u
accum3 += u
}
I am trying to switch these accumulations into an aggregation so that I can
get a speed boost and have access to accumulators for debugging. I am
currently trying to figure out a way to aggregate these three types at
once, since running 3 separate aggregations is significantly slower. Does
anyone have any thoughts as to how I can do this?
Thank you
Re: Performing multiple aggregations over the same data
Posted by Nick Sabol <ni...@gmail.com>.
Yeah, sounds like you want to aggregate to a triple, like
data.aggregate((0, 0, 0))(
(z, n) =>
// aggregate with zero value here,
(a1, a2) =>
// combine previous aggregations here
)
On Tue, Feb 23, 2016 at 10:40 PM, Michał Zieliński <
zielinski.michal0@gmail.com> wrote:
> Do you mean something like this?
>
> data.agg(sum("var1"),sum("var2"),sum("var3"))
>
> On 24 February 2016 at 01:49, Daniel Imberman <da...@gmail.com>
> wrote:
>
>> Hi guys,
>>
>> So I'm running into a speed issue where I have a dataset that needs to be
>> aggregated multiple times.
>>
>> Initially my team had set up three accumulators and were running a single
>> foreach loop over the data. Something along the lines of
>>
>> val accum1:Accumulable[a]
>> val accum2: Accumulable[b]
>> val accum3: Accumulable[c]
>>
>> data.foreach{
>> u =>
>> accum1+=u
>> accum2 += u
>> accum3 += u
>> }
>>
>> I am trying to switch these accumulations into an aggregation so that I
>> can get a speed boost and have access to accumulators for debugging. I am
>> currently trying to figure out a way to aggregate these three types at
>> once, since running 3 separate aggregations is significantly slower. Does
>> anyone have any thoughts as to how I can do this?
>>
>> Thank you
>>
>
>
Re: Performing multiple aggregations over the same data
Posted by Michał Zieliński <zi...@gmail.com>.
Do you mean something like this?
data.agg(sum("var1"),sum("var2"),sum("var3"))
On 24 February 2016 at 01:49, Daniel Imberman <da...@gmail.com>
wrote:
> Hi guys,
>
> So I'm running into a speed issue where I have a dataset that needs to be
> aggregated multiple times.
>
> Initially my team had set up three accumulators and were running a single
> foreach loop over the data. Something along the lines of
>
> val accum1:Accumulable[a]
> val accum2: Accumulable[b]
> val accum3: Accumulable[c]
>
> data.foreach{
> u =>
> accum1+=u
> accum2 += u
> accum3 += u
> }
>
> I am trying to switch these accumulations into an aggregation so that I
> can get a speed boost and have access to accumulators for debugging. I am
> currently trying to figure out a way to aggregate these three types at
> once, since running 3 separate aggregations is significantly slower. Does
> anyone have any thoughts as to how I can do this?
>
> Thank you
>