You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Daniel Imberman <da...@gmail.com> on 2016/02/24 01:49:37 UTC

Performing multiple aggregations over the same data

Hi guys,

So I'm running into a speed issue where I have a dataset that needs to be
aggregated multiple times.

Initially my team had set up three accumulators and were running a single
foreach loop over the data. Something along the lines of

val accum1:Accumulable[a]
val accum2: Accumulable[b]
val accum3: Accumulable[c]

data.foreach{
        u =>
                accum1+=u
                accum2 += u
                accum3 += u
}

I am trying to switch these accumulations into an aggregation so that I can
get a speed boost and have access to accumulators for debugging. I am
currently trying to figure out a way to aggregate these three types at
once, since running 3 separate aggregations is significantly slower. Does
anyone have any thoughts as to how I can do this?

Thank you

Re: Performing multiple aggregations over the same data

Posted by Nick Sabol <ni...@gmail.com>.
Yeah, sounds like you want to aggregate to a triple, like

data.aggregate((0, 0, 0))(
  (z, n) =>
    // aggregate with zero value here,
  (a1, a2) =>
    // combine previous aggregations here
)

On Tue, Feb 23, 2016 at 10:40 PM, Michał Zieliński <
zielinski.michal0@gmail.com> wrote:

> Do you mean something like this?
>
> data.agg(sum("var1"),sum("var2"),sum("var3"))
>
> On 24 February 2016 at 01:49, Daniel Imberman <da...@gmail.com>
> wrote:
>
>> Hi guys,
>>
>> So I'm running into a speed issue where I have a dataset that needs to be
>> aggregated multiple times.
>>
>> Initially my team had set up three accumulators and were running a single
>> foreach loop over the data. Something along the lines of
>>
>> val accum1:Accumulable[a]
>> val accum2: Accumulable[b]
>> val accum3: Accumulable[c]
>>
>> data.foreach{
>>         u =>
>>                 accum1+=u
>>                 accum2 += u
>>                 accum3 += u
>> }
>>
>> I am trying to switch these accumulations into an aggregation so that I
>> can get a speed boost and have access to accumulators for debugging. I am
>> currently trying to figure out a way to aggregate these three types at
>> once, since running 3 separate aggregations is significantly slower. Does
>> anyone have any thoughts as to how I can do this?
>>
>> Thank you
>>
>
>

Re: Performing multiple aggregations over the same data

Posted by Michał Zieliński <zi...@gmail.com>.
Do you mean something like this?

data.agg(sum("var1"),sum("var2"),sum("var3"))

On 24 February 2016 at 01:49, Daniel Imberman <da...@gmail.com>
wrote:

> Hi guys,
>
> So I'm running into a speed issue where I have a dataset that needs to be
> aggregated multiple times.
>
> Initially my team had set up three accumulators and were running a single
> foreach loop over the data. Something along the lines of
>
> val accum1:Accumulable[a]
> val accum2: Accumulable[b]
> val accum3: Accumulable[c]
>
> data.foreach{
>         u =>
>                 accum1+=u
>                 accum2 += u
>                 accum3 += u
> }
>
> I am trying to switch these accumulations into an aggregation so that I
> can get a speed boost and have access to accumulators for debugging. I am
> currently trying to figure out a way to aggregate these three types at
> once, since running 3 separate aggregations is significantly slower. Does
> anyone have any thoughts as to how I can do this?
>
> Thank you
>