You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Julien CHAMP <jc...@tellmeplus.com> on 2017/07/04 15:45:54 UTC

Window function / streaming

Hi there !

Let me explain my problem to see if you have a good solution to help me :)

Let's imagine that I have all my data in a DB or a file, that I load in a
dataframe DF with the following columns :
*id | timestamp(ms) | value*
A | 1000000 |  100
A | 1000010 |  50
B | 1000000 |  100
B | 1000010 |  50
B | 1000020 |  200
B | 2500000 |  500
C | 1000000 |  200
C | 1000010 |  500

The timestamp is a *long value*, so as to be able to express date in ms
from 0000-01-01 to today !

I want to compute operations such as min, max, average on the *value column*,
for a given window function, and grouped by id ( Bonus :  if possible for
only some timestamps... )

For example if I have 3 tuples :

id | timestamp(ms) | value
B | 1000000 |  100
B | 1000010 |  50
B | 1000020 |  200
B | 2500000 |  500

I would like to be able to compute the min value for windows of time = 20.
This would result in such a DF :

id | timestamp(ms) | value | min___value
B | 1000000 |  100 | 100
B | 1000010 |  50  | 50
B | 1000020 |  200 | 50
B | 2500000 |  500 | 500

This seems the perfect use case for window function in spark  ( cf :
https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html
 )
I can use :

Window.orderBy("timestamp").partitionBy("id").rangeBetween(-20,0)
df.withColumn("min___value", min(df.col("value")).over(tw))

This leads to the perfect answer !

However, there is a big bug with window functions as reported here (
https://issues.apache.org/jira/browse/SPARK-19451 ) when working with Long
values !!! So I can't use this....

So my question is ( of course ) how can I resolve my problem ?
If I use spark streaming I will face the same issue ?

I'll be glad to discuss this problem with you, feel free to answer :)

Regards,

Julien




-- 


Julien CHAMP — Data Scientist


*Web : **www.tellmeplus.com* <http://tellmeplus.com/> — *Email :
**jchamp@tellmeplus.com
<jc...@tellmeplus.com>*

*Phone ** : **06 89 35 01 89 <0689350189> * — *LinkedIn* :  *here*
<https://www.linkedin.com/in/julienchamp>

TellMePlus S.A — Predictive Objects

*Paris* : 7 rue des Pommerots, 78400 Chatou
*Montpellier* : 51 impasse des églantiers, 34980 St Clément de Rivière

-- 

Ce message peut contenir des informations confidentielles ou couvertes par 
le secret professionnel, à l’intention de son destinataire. Si vous n’en 
êtes pas le destinataire, merci de contacter l’expéditeur et d’en supprimer 
toute copie.
This email may contain confidential and/or privileged information for the 
intended recipient. If you are not the intended recipient, please contact 
the sender and delete all copies.


-- 
 <http://www.tellmeplus.com/assets/emailing/banner.html>