You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Julien CHAMP <jc...@tellmeplus.com> on 2017/12/15 10:32:30 UTC

Several Aggregations on a window function

Hi Spark Community members !

I want to do several ( from 1 to 10) aggregate functions using window
functions on something like 100 columns.

Instead of doing several pass on the data to compute each aggregate
function, is there a way to do this efficiently ?



Currently it seems that doing


val tw =
  Window
    .orderBy("date")
    .partitionBy("id")
    .rangeBetween(-8035200000L, 0)

and then

x
   .withColumn("agg1", max("col").over(tw))
   .withColumn("agg2", min("col").over(tw))
   .withColumn("aggX", avg("col").over(tw))


Is not really efficient :/
It seems that it iterates on the whole column for each aggregation ? Am I
right ?

Is there a way to compute all the required operations on a columns with a
single pass ?
Event better, to compute all the required operations on ALL columns with a
single pass ?

Thx for your Future[Answers]

Julien





-- 


Julien CHAMP — Data Scientist


*Web : **www.tellmeplus.com* <http://tellmeplus.com/> — *Email :
**jchamp@tellmeplus.com
<jc...@tellmeplus.com>*

*Phone ** : **06 89 35 01 89 <0689350189> * — *LinkedIn* :  *here*
<https://www.linkedin.com/in/julienchamp>

TellMePlus S.A — Predictive Objects

*Paris* : 7 rue des Pommerots, 78400 Chatou
*Montpellier* : 51 impasse des églantiers, 34980 St Clément de Rivière

-- 

Ce message peut contenir des informations confidentielles ou couvertes par 
le secret professionnel, à l’intention de son destinataire. Si vous n’en 
êtes pas le destinataire, merci de contacter l’expéditeur et d’en supprimer 
toute copie.
This email may contain confidential and/or privileged information for the 
intended recipient. If you are not the intended recipient, please contact 
the sender and delete all copies.


-- 
 <http://www.tellmeplus.com/assets/emailing/banner.html>

Re: Several Aggregations on a window function

Posted by Anastasios Zouzias <zo...@gmail.com>.
Hi,

You can use https://twitter.github.io/algebird/ which provides an
implementation of interesting Monoids and ways to combine them to tuples
(or products) of Monoids. Of course, you are not bound to use the algebird
library but it might be helpful to bootstrap.



On Mon, Dec 18, 2017 at 7:18 PM, Julien CHAMP <jc...@tellmeplus.com> wrote:

> It seems interesting, however scalding seems to require be used outside of
> spark ?
>
>
> Le lun. 18 déc. 2017 à 17:15, Anastasios Zouzias <zo...@gmail.com> a
> écrit :
>
>> Hi Julien,
>>
>> I am not sure if my answer applies on the streaming part of your
>> question. However, in batch processing, if you want to perform multiple
>> aggregations over an RDD with a single pass, a common approach is to use
>> multiple aggregators (a.k.a. tuple monoids), see below an example from
>> algebird:
>>
>> https://github.com/twitter/scalding/wiki/Aggregation-
>> using-Algebird-Aggregators#composing-aggregators.
>>
>> Best,
>> Anastasios
>>
>> On Mon, Dec 18, 2017 at 10:38 AM, Julien CHAMP <jc...@tellmeplus.com>
>> wrote:
>>
>>> I've been looking for several solutions but I can't find something
>>> efficient to compute many window function efficiently ( optimized
>>> computation or efficient parallelism )
>>> Am I the only one interested by this ?
>>>
>>>
>>> Regards,
>>>
>>> Julien
>>>
>>
>>> Le ven. 15 déc. 2017 à 21:34, Julien CHAMP <jc...@tellmeplus.com> a
>>> écrit :
>>>
>>>> May be I should consider something like impala ?
>>>>
>>>> Le ven. 15 déc. 2017 à 11:32, Julien CHAMP <jc...@tellmeplus.com> a
>>>> écrit :
>>>>
>>>>> Hi Spark Community members !
>>>>>
>>>>> I want to do several ( from 1 to 10) aggregate functions using window
>>>>> functions on something like 100 columns.
>>>>>
>>>>> Instead of doing several pass on the data to compute each aggregate
>>>>> function, is there a way to do this efficiently ?
>>>>>
>>>>>
>>>>>
>>>>> Currently it seems that doing
>>>>>
>>>>>
>>>>> val tw =
>>>>>   Window
>>>>>     .orderBy("date")
>>>>>     .partitionBy("id")
>>>>>     .rangeBetween(-8035200000L, 0)
>>>>>
>>>>> and then
>>>>>
>>>>> x
>>>>>    .withColumn("agg1", max("col").over(tw))
>>>>>    .withColumn("agg2", min("col").over(tw))
>>>>>    .withColumn("aggX", avg("col").over(tw))
>>>>>
>>>>>
>>>>> Is not really efficient :/
>>>>> It seems that it iterates on the whole column for each aggregation ?
>>>>> Am I right ?
>>>>>
>>>>> Is there a way to compute all the required operations on a columns
>>>>> with a single pass ?
>>>>> Event better, to compute all the required operations on ALL columns
>>>>> with a single pass ?
>>>>>
>>>>> Thx for your Future[Answers]
>>>>>
>>>>> Julien
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>>
>>>>> Julien CHAMP — Data Scientist
>>>>>
>>>>>
>>>>> *Web : **www.tellmeplus.com* <http://tellmeplus.com/> — *Email : **jchamp@tellmeplus.com
>>>>> <jc...@tellmeplus.com>*
>>>>>
>>>>> *Phone ** : **06 89 35 01 89 <0689350189> * — *LinkedIn* :  *here*
>>>>> <https://www.linkedin.com/in/julienchamp>
>>>>>
>>>>> TellMePlus S.A — Predictive Objects
>>>>>
>>>>> *Paris* : 7 rue des Pommerots, 78400 Chatou
>>>>> <https://maps.google.com/?q=7+rue+des+Pommerots,+78400+Chatou&entry=gmail&source=g>
>>>>> *Montpellier* : 51 impasse des églantiers, 34980 St Clément de Rivière
>>>>> <https://maps.google.com/?q=51+impasse+des+%C3%A9glantiers,+34980+St+Cl%C3%A9ment+de+Rivi%C3%A8re&entry=gmail&source=g>
>>>>>
>>>> --
>>>>
>>>>
>>>> Julien CHAMP — Data Scientist
>>>>
>>>>
>>>> *Web : **www.tellmeplus.com* <http://tellmeplus.com/> — *Email : **jchamp@tellmeplus.com
>>>> <jc...@tellmeplus.com>*
>>>>
>>>> *Phone ** : **06 89 35 01 89 <0689350189> * — *LinkedIn* :  *here*
>>>> <https://www.linkedin.com/in/julienchamp>
>>>>
>>>> TellMePlus S.A — Predictive Objects
>>>>
>>>> *Paris* : 7 rue des Pommerots, 78400 Chatou
>>>> <https://maps.google.com/?q=7+rue+des+Pommerots,+78400+Chatou&entry=gmail&source=g>
>>>> *Montpellier* : 51 impasse des églantiers, 34980 St Clément de Rivière
>>>> <https://maps.google.com/?q=51+impasse+des+%C3%A9glantiers,+34980+St+Cl%C3%A9ment+de+Rivi%C3%A8re&entry=gmail&source=g>
>>>>
>>> --
>>>
>>>
>>> Julien CHAMP — Data Scientist
>>>
>>>
>>> *Web : **www.tellmeplus.com* <http://tellmeplus.com/> — *Email : **jchamp@tellmeplus.com
>>> <jc...@tellmeplus.com>*
>>>
>>> *Phone ** : **06 89 35 01 89 <0689350189> * — *LinkedIn* :  *here*
>>> <https://www.linkedin.com/in/julienchamp>
>>>
>>> TellMePlus S.A — Predictive Objects
>>>
>>> *Paris* : 7 rue des Pommerots, 78400 Chatou
>>> <https://maps.google.com/?q=7+rue+des+Pommerots,+78400+Chatou&entry=gmail&source=g>
>>> *Montpellier* : 51 impasse des églantiers, 34980 St Clément de Rivière
>>> <https://maps.google.com/?q=51+impasse+des+%C3%A9glantiers,+34980+St+Cl%C3%A9ment+de+Rivi%C3%A8re&entry=gmail&source=g>
>>>
>>>
>>> Ce message peut contenir des informations confidentielles ou couvertes
>>> par le secret professionnel, à l’intention de son destinataire. Si vous
>>> n’en êtes pas le destinataire, merci de contacter l’expéditeur et d’en
>>> supprimer toute copie.
>>> This email may contain confidential and/or privileged information for
>>> the intended recipient. If you are not the intended recipient, please
>>> contact the sender and delete all copies.
>>>
>>>
>>> <http://www.tellmeplus.com/assets/emailing/banner.html>
>>>
>>
>>
>>
>> --
>> -- Anastasios Zouzias
>> <az...@zurich.ibm.com>
>>
> --
>
>
> Julien CHAMP — Data Scientist
>
>
> *Web : **www.tellmeplus.com* <http://tellmeplus.com/> — *Email : **jchamp@tellmeplus.com
> <jc...@tellmeplus.com>*
>
> *Phone ** : **06 89 35 01 89 <0689350189> * — *LinkedIn* :  *here*
> <https://www.linkedin.com/in/julienchamp>
>
> TellMePlus S.A — Predictive Objects
>
> *Paris* : 7 rue des Pommerots, 78400 Chatou
> <https://maps.google.com/?q=7+rue+des+Pommerots,+78400+Chatou&entry=gmail&source=g>
> *Montpellier* : 51 impasse des églantiers, 34980 St Clément de Rivière
> <https://maps.google.com/?q=51+impasse+des+%C3%A9glantiers,+34980+St+Cl%C3%A9ment+de+Rivi%C3%A8re&entry=gmail&source=g>
>
>
> Ce message peut contenir des informations confidentielles ou couvertes par
> le secret professionnel, à l’intention de son destinataire. Si vous n’en
> êtes pas le destinataire, merci de contacter l’expéditeur et d’en supprimer
> toute copie.
> This email may contain confidential and/or privileged information for the
> intended recipient. If you are not the intended recipient, please contact
> the sender and delete all copies.
>
>
> <http://www.tellmeplus.com/assets/emailing/banner.html>
>



-- 
-- Anastasios Zouzias
<az...@zurich.ibm.com>

Re: Several Aggregations on a window function

Posted by Julien CHAMP <jc...@tellmeplus.com>.
It seems interesting, however scalding seems to require be used outside of
spark ?


Le lun. 18 déc. 2017 à 17:15, Anastasios Zouzias <zo...@gmail.com> a
écrit :

> Hi Julien,
>
> I am not sure if my answer applies on the streaming part of your question.
> However, in batch processing, if you want to perform multiple aggregations
> over an RDD with a single pass, a common approach is to use multiple
> aggregators (a.k.a. tuple monoids), see below an example from algebird:
>
>
> https://github.com/twitter/scalding/wiki/Aggregation-using-Algebird-Aggregators#composing-aggregators
> .
>
> Best,
> Anastasios
>
> On Mon, Dec 18, 2017 at 10:38 AM, Julien CHAMP <jc...@tellmeplus.com>
> wrote:
>
>> I've been looking for several solutions but I can't find something
>> efficient to compute many window function efficiently ( optimized
>> computation or efficient parallelism )
>> Am I the only one interested by this ?
>>
>>
>> Regards,
>>
>> Julien
>>
>
>> Le ven. 15 déc. 2017 à 21:34, Julien CHAMP <jc...@tellmeplus.com> a
>> écrit :
>>
>>> May be I should consider something like impala ?
>>>
>>> Le ven. 15 déc. 2017 à 11:32, Julien CHAMP <jc...@tellmeplus.com> a
>>> écrit :
>>>
>>>> Hi Spark Community members !
>>>>
>>>> I want to do several ( from 1 to 10) aggregate functions using window
>>>> functions on something like 100 columns.
>>>>
>>>> Instead of doing several pass on the data to compute each aggregate
>>>> function, is there a way to do this efficiently ?
>>>>
>>>>
>>>>
>>>> Currently it seems that doing
>>>>
>>>>
>>>> val tw =
>>>>   Window
>>>>     .orderBy("date")
>>>>     .partitionBy("id")
>>>>     .rangeBetween(-8035200000L, 0)
>>>>
>>>> and then
>>>>
>>>> x
>>>>    .withColumn("agg1", max("col").over(tw))
>>>>    .withColumn("agg2", min("col").over(tw))
>>>>    .withColumn("aggX", avg("col").over(tw))
>>>>
>>>>
>>>> Is not really efficient :/
>>>> It seems that it iterates on the whole column for each aggregation ? Am
>>>> I right ?
>>>>
>>>> Is there a way to compute all the required operations on a columns with
>>>> a single pass ?
>>>> Event better, to compute all the required operations on ALL columns
>>>> with a single pass ?
>>>>
>>>> Thx for your Future[Answers]
>>>>
>>>> Julien
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>>
>>>> Julien CHAMP — Data Scientist
>>>>
>>>>
>>>> *Web : **www.tellmeplus.com* <http://tellmeplus.com/> — *Email : **jchamp@tellmeplus.com
>>>> <jc...@tellmeplus.com>*
>>>>
>>>> *Phone ** : **06 89 35 01 89 <0689350189> * — *LinkedIn* :  *here*
>>>> <https://www.linkedin.com/in/julienchamp>
>>>>
>>>> TellMePlus S.A — Predictive Objects
>>>>
>>>> *Paris* : 7 rue des Pommerots, 78400 Chatou
>>>> <https://maps.google.com/?q=7+rue+des+Pommerots,+78400+Chatou&entry=gmail&source=g>
>>>> *Montpellier* : 51 impasse des églantiers, 34980 St Clément de Rivière
>>>> <https://maps.google.com/?q=51+impasse+des+%C3%A9glantiers,+34980+St+Cl%C3%A9ment+de+Rivi%C3%A8re&entry=gmail&source=g>
>>>>
>>> --
>>>
>>>
>>> Julien CHAMP — Data Scientist
>>>
>>>
>>> *Web : **www.tellmeplus.com* <http://tellmeplus.com/> — *Email : **jchamp@tellmeplus.com
>>> <jc...@tellmeplus.com>*
>>>
>>> *Phone ** : **06 89 35 01 89 <0689350189> * — *LinkedIn* :  *here*
>>> <https://www.linkedin.com/in/julienchamp>
>>>
>>> TellMePlus S.A — Predictive Objects
>>>
>>> *Paris* : 7 rue des Pommerots, 78400 Chatou
>>> <https://maps.google.com/?q=7+rue+des+Pommerots,+78400+Chatou&entry=gmail&source=g>
>>> *Montpellier* : 51 impasse des églantiers, 34980 St Clément de Rivière
>>> <https://maps.google.com/?q=51+impasse+des+%C3%A9glantiers,+34980+St+Cl%C3%A9ment+de+Rivi%C3%A8re&entry=gmail&source=g>
>>>
>> --
>>
>>
>> Julien CHAMP — Data Scientist
>>
>>
>> *Web : **www.tellmeplus.com* <http://tellmeplus.com/> — *Email : **jchamp@tellmeplus.com
>> <jc...@tellmeplus.com>*
>>
>> *Phone ** : **06 89 35 01 89 <0689350189> * — *LinkedIn* :  *here*
>> <https://www.linkedin.com/in/julienchamp>
>>
>> TellMePlus S.A — Predictive Objects
>>
>> *Paris* : 7 rue des Pommerots, 78400 Chatou
>> <https://maps.google.com/?q=7+rue+des+Pommerots,+78400+Chatou&entry=gmail&source=g>
>> *Montpellier* : 51 impasse des églantiers, 34980 St Clément de Rivière
>> <https://maps.google.com/?q=51+impasse+des+%C3%A9glantiers,+34980+St+Cl%C3%A9ment+de+Rivi%C3%A8re&entry=gmail&source=g>
>>
>>
>> Ce message peut contenir des informations confidentielles ou couvertes
>> par le secret professionnel, à l’intention de son destinataire. Si vous
>> n’en êtes pas le destinataire, merci de contacter l’expéditeur et d’en
>> supprimer toute copie.
>> This email may contain confidential and/or privileged information for the
>> intended recipient. If you are not the intended recipient, please contact
>> the sender and delete all copies.
>>
>>
>> <http://www.tellmeplus.com/assets/emailing/banner.html>
>>
>
>
>
> --
> -- Anastasios Zouzias
> <az...@zurich.ibm.com>
>
-- 


Julien CHAMP — Data Scientist


*Web : **www.tellmeplus.com* <http://tellmeplus.com/> — *Email :
**jchamp@tellmeplus.com
<jc...@tellmeplus.com>*

*Phone ** : **06 89 35 01 89 <0689350189> * — *LinkedIn* :  *here*
<https://www.linkedin.com/in/julienchamp>

TellMePlus S.A — Predictive Objects

*Paris* : 7 rue des Pommerots, 78400 Chatou
*Montpellier* : 51 impasse des églantiers, 34980 St Clément de Rivière

-- 

Ce message peut contenir des informations confidentielles ou couvertes par 
le secret professionnel, à l’intention de son destinataire. Si vous n’en 
êtes pas le destinataire, merci de contacter l’expéditeur et d’en supprimer 
toute copie.
This email may contain confidential and/or privileged information for the 
intended recipient. If you are not the intended recipient, please contact 
the sender and delete all copies.


-- 
 <http://www.tellmeplus.com/assets/emailing/banner.html>

Re: Several Aggregations on a window function

Posted by Anastasios Zouzias <zo...@gmail.com>.
Hi Julien,

I am not sure if my answer applies on the streaming part of your question.
However, in batch processing, if you want to perform multiple aggregations
over an RDD with a single pass, a common approach is to use multiple
aggregators (a.k.a. tuple monoids), see below an example from algebird:

https://github.com/twitter/scalding/wiki/Aggregation-using-Algebird-Aggregators#composing-aggregators
.

Best,
Anastasios

On Mon, Dec 18, 2017 at 10:38 AM, Julien CHAMP <jc...@tellmeplus.com>
wrote:

> I've been looking for several solutions but I can't find something
> efficient to compute many window function efficiently ( optimized
> computation or efficient parallelism )
> Am I the only one interested by this ?
>
>
> Regards,
>
> Julien
>
> Le ven. 15 déc. 2017 à 21:34, Julien CHAMP <jc...@tellmeplus.com> a
> écrit :
>
>> May be I should consider something like impala ?
>>
>> Le ven. 15 déc. 2017 à 11:32, Julien CHAMP <jc...@tellmeplus.com> a
>> écrit :
>>
>>> Hi Spark Community members !
>>>
>>> I want to do several ( from 1 to 10) aggregate functions using window
>>> functions on something like 100 columns.
>>>
>>> Instead of doing several pass on the data to compute each aggregate
>>> function, is there a way to do this efficiently ?
>>>
>>>
>>>
>>> Currently it seems that doing
>>>
>>>
>>> val tw =
>>>   Window
>>>     .orderBy("date")
>>>     .partitionBy("id")
>>>     .rangeBetween(-8035200000L, 0)
>>>
>>> and then
>>>
>>> x
>>>    .withColumn("agg1", max("col").over(tw))
>>>    .withColumn("agg2", min("col").over(tw))
>>>    .withColumn("aggX", avg("col").over(tw))
>>>
>>>
>>> Is not really efficient :/
>>> It seems that it iterates on the whole column for each aggregation ? Am
>>> I right ?
>>>
>>> Is there a way to compute all the required operations on a columns with
>>> a single pass ?
>>> Event better, to compute all the required operations on ALL columns with
>>> a single pass ?
>>>
>>> Thx for your Future[Answers]
>>>
>>> Julien
>>>
>>>
>>>
>>>
>>>
>>> --
>>>
>>>
>>> Julien CHAMP — Data Scientist
>>>
>>>
>>> *Web : **www.tellmeplus.com* <http://tellmeplus.com/> — *Email : **jchamp@tellmeplus.com
>>> <jc...@tellmeplus.com>*
>>>
>>> *Phone ** : **06 89 35 01 89 <0689350189> * — *LinkedIn* :  *here*
>>> <https://www.linkedin.com/in/julienchamp>
>>>
>>> TellMePlus S.A — Predictive Objects
>>>
>>> *Paris* : 7 rue des Pommerots, 78400 Chatou
>>> <https://maps.google.com/?q=7+rue+des+Pommerots,+78400+Chatou&entry=gmail&source=g>
>>> *Montpellier* : 51 impasse des églantiers, 34980 St Clément de Rivière
>>> <https://maps.google.com/?q=51+impasse+des+%C3%A9glantiers,+34980+St+Cl%C3%A9ment+de+Rivi%C3%A8re&entry=gmail&source=g>
>>>
>> --
>>
>>
>> Julien CHAMP — Data Scientist
>>
>>
>> *Web : **www.tellmeplus.com* <http://tellmeplus.com/> — *Email : **jchamp@tellmeplus.com
>> <jc...@tellmeplus.com>*
>>
>> *Phone ** : **06 89 35 01 89 <0689350189> * — *LinkedIn* :  *here*
>> <https://www.linkedin.com/in/julienchamp>
>>
>> TellMePlus S.A — Predictive Objects
>>
>> *Paris* : 7 rue des Pommerots, 78400 Chatou
>> <https://maps.google.com/?q=7+rue+des+Pommerots,+78400+Chatou&entry=gmail&source=g>
>> *Montpellier* : 51 impasse des églantiers, 34980 St Clément de Rivière
>> <https://maps.google.com/?q=51+impasse+des+%C3%A9glantiers,+34980+St+Cl%C3%A9ment+de+Rivi%C3%A8re&entry=gmail&source=g>
>>
> --
>
>
> Julien CHAMP — Data Scientist
>
>
> *Web : **www.tellmeplus.com* <http://tellmeplus.com/> — *Email : **jchamp@tellmeplus.com
> <jc...@tellmeplus.com>*
>
> *Phone ** : **06 89 35 01 89 <0689350189> * — *LinkedIn* :  *here*
> <https://www.linkedin.com/in/julienchamp>
>
> TellMePlus S.A — Predictive Objects
>
> *Paris* : 7 rue des Pommerots, 78400 Chatou
> <https://maps.google.com/?q=7+rue+des+Pommerots,+78400+Chatou&entry=gmail&source=g>
> *Montpellier* : 51 impasse des églantiers, 34980 St Clément de Rivière
> <https://maps.google.com/?q=51+impasse+des+%C3%A9glantiers,+34980+St+Cl%C3%A9ment+de+Rivi%C3%A8re&entry=gmail&source=g>
>
>
> Ce message peut contenir des informations confidentielles ou couvertes par
> le secret professionnel, à l’intention de son destinataire. Si vous n’en
> êtes pas le destinataire, merci de contacter l’expéditeur et d’en supprimer
> toute copie.
> This email may contain confidential and/or privileged information for the
> intended recipient. If you are not the intended recipient, please contact
> the sender and delete all copies.
>
>
> <http://www.tellmeplus.com/assets/emailing/banner.html>
>



-- 
-- Anastasios Zouzias
<az...@zurich.ibm.com>

Re: Several Aggregations on a window function

Posted by Julien CHAMP <jc...@tellmeplus.com>.
I've been looking for several solutions but I can't find something
efficient to compute many window function efficiently ( optimized
computation or efficient parallelism )
Am I the only one interested by this ?


Regards,

Julien

Le ven. 15 déc. 2017 à 21:34, Julien CHAMP <jc...@tellmeplus.com> a écrit :

> May be I should consider something like impala ?
>
> Le ven. 15 déc. 2017 à 11:32, Julien CHAMP <jc...@tellmeplus.com> a
> écrit :
>
>> Hi Spark Community members !
>>
>> I want to do several ( from 1 to 10) aggregate functions using window
>> functions on something like 100 columns.
>>
>> Instead of doing several pass on the data to compute each aggregate
>> function, is there a way to do this efficiently ?
>>
>>
>>
>> Currently it seems that doing
>>
>>
>> val tw =
>>   Window
>>     .orderBy("date")
>>     .partitionBy("id")
>>     .rangeBetween(-8035200000L, 0)
>>
>> and then
>>
>> x
>>    .withColumn("agg1", max("col").over(tw))
>>    .withColumn("agg2", min("col").over(tw))
>>    .withColumn("aggX", avg("col").over(tw))
>>
>>
>> Is not really efficient :/
>> It seems that it iterates on the whole column for each aggregation ? Am I
>> right ?
>>
>> Is there a way to compute all the required operations on a columns with a
>> single pass ?
>> Event better, to compute all the required operations on ALL columns with
>> a single pass ?
>>
>> Thx for your Future[Answers]
>>
>> Julien
>>
>>
>>
>>
>>
>> --
>>
>>
>> Julien CHAMP — Data Scientist
>>
>>
>> *Web : **www.tellmeplus.com* <http://tellmeplus.com/> — *Email : **jchamp@tellmeplus.com
>> <jc...@tellmeplus.com>*
>>
>> *Phone ** : **06 89 35 01 89 <0689350189> * — *LinkedIn* :  *here*
>> <https://www.linkedin.com/in/julienchamp>
>>
>> TellMePlus S.A — Predictive Objects
>>
>> *Paris* : 7 rue des Pommerots, 78400 Chatou
>> *Montpellier* : 51 impasse des églantiers, 34980 St Clément de Rivière
>>
> --
>
>
> Julien CHAMP — Data Scientist
>
>
> *Web : **www.tellmeplus.com* <http://tellmeplus.com/> — *Email : **jchamp@tellmeplus.com
> <jc...@tellmeplus.com>*
>
> *Phone ** : **06 89 35 01 89 <0689350189> * — *LinkedIn* :  *here*
> <https://www.linkedin.com/in/julienchamp>
>
> TellMePlus S.A — Predictive Objects
>
> *Paris* : 7 rue des Pommerots, 78400 Chatou
> *Montpellier* : 51 impasse des églantiers, 34980 St Clément de Rivière
>
-- 


Julien CHAMP — Data Scientist


*Web : **www.tellmeplus.com* <http://tellmeplus.com/> — *Email :
**jchamp@tellmeplus.com
<jc...@tellmeplus.com>*

*Phone ** : **06 89 35 01 89 <0689350189> * — *LinkedIn* :  *here*
<https://www.linkedin.com/in/julienchamp>

TellMePlus S.A — Predictive Objects

*Paris* : 7 rue des Pommerots, 78400 Chatou
*Montpellier* : 51 impasse des églantiers, 34980 St Clément de Rivière

-- 

Ce message peut contenir des informations confidentielles ou couvertes par 
le secret professionnel, à l’intention de son destinataire. Si vous n’en 
êtes pas le destinataire, merci de contacter l’expéditeur et d’en supprimer 
toute copie.
This email may contain confidential and/or privileged information for the 
intended recipient. If you are not the intended recipient, please contact 
the sender and delete all copies.


-- 
 <http://www.tellmeplus.com/assets/emailing/banner.html>

Re: Several Aggregations on a window function

Posted by Julien CHAMP <jc...@tellmeplus.com>.
May be I should consider something like impala ?

Le ven. 15 déc. 2017 à 11:32, Julien CHAMP <jc...@tellmeplus.com> a écrit :

> Hi Spark Community members !
>
> I want to do several ( from 1 to 10) aggregate functions using window
> functions on something like 100 columns.
>
> Instead of doing several pass on the data to compute each aggregate
> function, is there a way to do this efficiently ?
>
>
>
> Currently it seems that doing
>
>
> val tw =
>   Window
>     .orderBy("date")
>     .partitionBy("id")
>     .rangeBetween(-8035200000L, 0)
>
> and then
>
> x
>    .withColumn("agg1", max("col").over(tw))
>    .withColumn("agg2", min("col").over(tw))
>    .withColumn("aggX", avg("col").over(tw))
>
>
> Is not really efficient :/
> It seems that it iterates on the whole column for each aggregation ? Am I
> right ?
>
> Is there a way to compute all the required operations on a columns with a
> single pass ?
> Event better, to compute all the required operations on ALL columns with a
> single pass ?
>
> Thx for your Future[Answers]
>
> Julien
>
>
>
>
>
> --
>
>
> Julien CHAMP — Data Scientist
>
>
> *Web : **www.tellmeplus.com* <http://tellmeplus.com/> — *Email : **jchamp@tellmeplus.com
> <jc...@tellmeplus.com>*
>
> *Phone ** : **06 89 35 01 89 <0689350189> * — *LinkedIn* :  *here*
> <https://www.linkedin.com/in/julienchamp>
>
> TellMePlus S.A — Predictive Objects
>
> *Paris* : 7 rue des Pommerots, 78400 Chatou
> *Montpellier* : 51 impasse des églantiers, 34980 St Clément de Rivière
>
-- 


Julien CHAMP — Data Scientist


*Web : **www.tellmeplus.com* <http://tellmeplus.com/> — *Email :
**jchamp@tellmeplus.com
<jc...@tellmeplus.com>*

*Phone ** : **06 89 35 01 89 <0689350189> * — *LinkedIn* :  *here*
<https://www.linkedin.com/in/julienchamp>

TellMePlus S.A — Predictive Objects

*Paris* : 7 rue des Pommerots, 78400 Chatou
*Montpellier* : 51 impasse des églantiers, 34980 St Clément de Rivière

-- 

Ce message peut contenir des informations confidentielles ou couvertes par 
le secret professionnel, à l’intention de son destinataire. Si vous n’en 
êtes pas le destinataire, merci de contacter l’expéditeur et d’en supprimer 
toute copie.
This email may contain confidential and/or privileged information for the 
intended recipient. If you are not the intended recipient, please contact 
the sender and delete all copies.


-- 
 <http://www.tellmeplus.com/assets/emailing/banner.html>