You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Andrés Ivaldi <ia...@gmail.com> on 2016/11/03 18:29:36 UTC

Aggregation Calculation

Hello, I need to perform some aggregations and a kind of Cube/RollUp
calculation

Doing some test looks like Cube and RollUp performs aggregation over all
posible columns combination, but I just need some specific columns
combination.

What I'm trying to do is like a dataTable where te first N columns are may
rows and the second M values are my columns and the last columna are the
aggregated values, like Dimension / Measures

I need all the values of the N and M columns and the ones that correspond
to the aggregation function. I'll never need the values that previous
column has no value, ie

having N=2 so two columns as rows I'll need
R1 | R2  ....
##  |  ## ....
##  |   null ....

but not
null | ## ....

as roll up does, same approach to M columns


So the question is what could be the better way to perform this calculation.
Using rollUp/Cube give me a lot of values that I dont need
Using groupBy give me less information ( I could do several groupBy but
that is not performant, I think )
Is any other way to something like that?

Thanks.





-- 
Ing. Ivaldi Andres

Re: Aggregation Calculation

Posted by Andrés Ivaldi <ia...@gmail.com>.

Ok, so I've read that rollup is just syntactic sugar of GROUPING SET(...),
in that case I just need to use GROUPNG SET, but the examples in the
documentation this GROUPING SET is used with SQL syntaxis and I am doing it
programmatically, so I need the DataSet api, like ds.rollup(..) but for
grouping set,

Does any one knows how to do it?

thanks.



On Thu, Nov 3, 2016 at 5:17 PM, Andrés Ivaldi <ia...@gmail.com> wrote:

> I'm not sure about inline views, it will still performing aggregation that
> I don't need. I think I didn't explain right, I've already filtered the
> values that I need, the problem is that default calculation of rollUp give
> me some calculations that I don't want like only aggregation by the second
> column.
> Suppose tree columns (DataSet Columns) Year, Moth, Import, and I want
> aggregation sum(Import), and the combination of all Year/Month Sum(import),
> also Year Sum(import), but Mont Sum(import) doesn't care
>
> in table it will looks like
>
> YEAR | MOTH | Sum(Import)
> 2006 | 1    | xxxx
> 2005 | 1    | XXXX
> 2005 | 2    | xxxx
> 2006 | null | xxxx
> 2005 | null | xxxx
> null | null | xxxx
> null | 1    | xxxx
> null | 2    | xxxx
>
> the las tree rows are not needed, in this example I could perform
> filtering after rollUp i do the query by demand  so it will grow depending
> on number of rows and columns, and will be a lot of combinations that I
> don't need.
>
> thanks
>
>
>
>
>
> On Thu, Nov 3, 2016 at 4:04 PM, Stephen Boesch <ja...@gmail.com> wrote:
>
>> You would likely want to create inline views that perform the filtering *before
>> *performing t he cubes/rollup; in this way the cubes/rollups only
>> operate on the pruned rows/columns.
>>
>> 2016-11-03 11:29 GMT-07:00 Andrés Ivaldi <ia...@gmail.com>:
>>
>>> Hello, I need to perform some aggregations and a kind of Cube/RollUp
>>> calculation
>>>
>>> Doing some test looks like Cube and RollUp performs aggregation over all
>>> posible columns combination, but I just need some specific columns
>>> combination.
>>>
>>> What I'm trying to do is like a dataTable where te first N columns are
>>> may rows and the second M values are my columns and the last columna are
>>> the aggregated values, like Dimension / Measures
>>>
>>> I need all the values of the N and M columns and the ones that
>>> correspond to the aggregation function. I'll never need the values that
>>> previous column has no value, ie
>>>
>>> having N=2 so two columns as rows I'll need
>>> R1 | R2  ....
>>> ##  |  ## ....
>>> ##  |   null ....
>>>
>>> but not
>>> null | ## ....
>>>
>>> as roll up does, same approach to M columns
>>>
>>>
>>> So the question is what could be the better way to perform this
>>> calculation.
>>> Using rollUp/Cube give me a lot of values that I dont need
>>> Using groupBy give me less information ( I could do several groupBy but
>>> that is not performant, I think )
>>> Is any other way to something like that?
>>>
>>> Thanks.
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Ing. Ivaldi Andres
>>>
>>
>>
>
>
> --
> Ing. Ivaldi Andres
>



-- 
Ing. Ivaldi Andres

Re: Aggregation Calculation

Posted by Andrés Ivaldi <ia...@gmail.com>.

I'm not sure about inline views, it will still performing aggregation that
I don't need. I think I didn't explain right, I've already filtered the
values that I need, the problem is that default calculation of rollUp give
me some calculations that I don't want like only aggregation by the second
column.
Suppose tree columns (DataSet Columns) Year, Moth, Import, and I want
aggregation sum(Import), and the combination of all Year/Month Sum(import),
also Year Sum(import), but Mont Sum(import) doesn't care

in table it will looks like

YEAR | MOTH | Sum(Import)
2006 | 1    | xxxx
2005 | 1    | XXXX
2005 | 2    | xxxx
2006 | null | xxxx
2005 | null | xxxx
null | null | xxxx
null | 1    | xxxx
null | 2    | xxxx

the las tree rows are not needed, in this example I could perform filtering
after rollUp i do the query by demand  so it will grow depending on number
of rows and columns, and will be a lot of combinations that I don't need.

thanks

On Thu, Nov 3, 2016 at 4:04 PM, Stephen Boesch <ja...@gmail.com> wrote:

> You would likely want to create inline views that perform the filtering *before
> *performing t he cubes/rollup; in this way the cubes/rollups only operate
> on the pruned rows/columns.
>
> 2016-11-03 11:29 GMT-07:00 Andrés Ivaldi <ia...@gmail.com>:
>
>> Hello, I need to perform some aggregations and a kind of Cube/RollUp
>> calculation
>>
>> Doing some test looks like Cube and RollUp performs aggregation over all
>> posible columns combination, but I just need some specific columns
>> combination.
>>
>> What I'm trying to do is like a dataTable where te first N columns are
>> may rows and the second M values are my columns and the last columna are
>> the aggregated values, like Dimension / Measures
>>
>> I need all the values of the N and M columns and the ones that correspond
>> to the aggregation function. I'll never need the values that previous
>> column has no value, ie
>>
>> having N=2 so two columns as rows I'll need
>> R1 | R2  ....
>> ##  |  ## ....
>> ##  |   null ....
>>
>> but not
>> null | ## ....
>>
>> as roll up does, same approach to M columns
>>
>>
>> So the question is what could be the better way to perform this
>> calculation.
>> Using rollUp/Cube give me a lot of values that I dont need
>> Using groupBy give me less information ( I could do several groupBy but
>> that is not performant, I think )
>> Is any other way to something like that?
>>
>> Thanks.
>>
>>
>>
>>
>>
>> --
>> Ing. Ivaldi Andres
>>
>
>

-- 
Ing. Ivaldi Andres

Re: Aggregation Calculation

Posted by Stephen Boesch <ja...@gmail.com>.

You would likely want to create inline views that perform the filtering *before
*performing t he cubes/rollup; in this way the cubes/rollups only operate
on the pruned rows/columns.

2016-11-03 11:29 GMT-07:00 Andrés Ivaldi <ia...@gmail.com>:

> Hello, I need to perform some aggregations and a kind of Cube/RollUp
> calculation
>
> Doing some test looks like Cube and RollUp performs aggregation over all
> posible columns combination, but I just need some specific columns
> combination.
>
> What I'm trying to do is like a dataTable where te first N columns are may
> rows and the second M values are my columns and the last columna are the
> aggregated values, like Dimension / Measures
>
> I need all the values of the N and M columns and the ones that correspond
> to the aggregation function. I'll never need the values that previous
> column has no value, ie
>
> having N=2 so two columns as rows I'll need
> R1 | R2  ....
> ##  |  ## ....
> ##  |   null ....
>
> but not
> null | ## ....
>
> as roll up does, same approach to M columns
>
>
> So the question is what could be the better way to perform this
> calculation.
> Using rollUp/Cube give me a lot of values that I dont need
> Using groupBy give me less information ( I could do several groupBy but
> that is not performant, I think )
> Is any other way to something like that?
>
> Thanks.
>
>
>
>
>
> --
> Ing. Ivaldi Andres
>