You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Julien CHAMP <jc...@tellmeplus.com> on 2017/07/04 15:45:54 UTC
Window function / streaming
Hi there !
Let me explain my problem to see if you have a good solution to help me :)
Let's imagine that I have all my data in a DB or a file, that I load in a
dataframe DF with the following columns :
*id | timestamp(ms) | value*
A | 1000000 | 100
A | 1000010 | 50
B | 1000000 | 100
B | 1000010 | 50
B | 1000020 | 200
B | 2500000 | 500
C | 1000000 | 200
C | 1000010 | 500
The timestamp is a *long value*, so as to be able to express date in ms
from 0000-01-01 to today !
I want to compute operations such as min, max, average on the *value column*,
for a given window function, and grouped by id ( Bonus : if possible for
only some timestamps... )
For example if I have 3 tuples :
id | timestamp(ms) | value
B | 1000000 | 100
B | 1000010 | 50
B | 1000020 | 200
B | 2500000 | 500
I would like to be able to compute the min value for windows of time = 20.
This would result in such a DF :
id | timestamp(ms) | value | min___value
B | 1000000 | 100 | 100
B | 1000010 | 50 | 50
B | 1000020 | 200 | 50
B | 2500000 | 500 | 500
This seems the perfect use case for window function in spark ( cf :
https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html
)
I can use :
Window.orderBy("timestamp").partitionBy("id").rangeBetween(-20,0)
df.withColumn("min___value", min(df.col("value")).over(tw))
This leads to the perfect answer !
However, there is a big bug with window functions as reported here (
https://issues.apache.org/jira/browse/SPARK-19451 ) when working with Long
values !!! So I can't use this....
So my question is ( of course ) how can I resolve my problem ?
If I use spark streaming I will face the same issue ?
I'll be glad to discuss this problem with you, feel free to answer :)
Regards,
Julien
--
Julien CHAMP — Data Scientist
*Web : **www.tellmeplus.com* <http://tellmeplus.com/> — *Email :
**jchamp@tellmeplus.com
<jc...@tellmeplus.com>*
*Phone ** : **06 89 35 01 89 <0689350189> * — *LinkedIn* : *here*
<https://www.linkedin.com/in/julienchamp>
TellMePlus S.A — Predictive Objects
*Paris* : 7 rue des Pommerots, 78400 Chatou
*Montpellier* : 51 impasse des églantiers, 34980 St Clément de Rivière
--
Ce message peut contenir des informations confidentielles ou couvertes par
le secret professionnel, à l’intention de son destinataire. Si vous n’en
êtes pas le destinataire, merci de contacter l’expéditeur et d’en supprimer
toute copie.
This email may contain confidential and/or privileged information for the
intended recipient. If you are not the intended recipient, please contact
the sender and delete all copies.
--
<http://www.tellmeplus.com/assets/emailing/banner.html>