You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (Jira)" <ji...@apache.org> on 2019/09/05 12:42:00 UTC

[jira] [Resolved] (SPARK-28975) How do I overwrite a piece of data and recalculate it?

     [ https://issues.apache.org/jira/browse/SPARK-28975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon resolved SPARK-28975.
----------------------------------
    Resolution: Invalid

Please ask questions into mailing list before filing here as an issue.

> How do I overwrite a piece of data and recalculate it?
> ------------------------------------------------------
>
>                 Key: SPARK-28975
>                 URL: https://issues.apache.org/jira/browse/SPARK-28975
>             Project: Spark
>          Issue Type: Question
>          Components: Spark Core
>    Affects Versions: 2.3.0
>            Reporter: ruiliang
>            Priority: Blocker
>              Labels: spark
>         Attachments: image-2019-09-05-01-14-26-620.png
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
>  翻译:
> I have a requirement to make real-time statistics of the total amount and quantity of today's orders, but there will be repeated order ids in the pushed data, so we need to repeat according to the order ID and order time, take the latest order at the order time, and overwrite the previous order data.And then recalculate.
> I see the documentation has this function,
>  
> / / Without watermark using guid column
> streamingDf. DropDuplicates (" guid)"
>  
> But this one doesn't add duplicate data, but it doesn't overwrite the old data, so I want the new data to overwrite the old data, and then I'm going to recalculate sum and things like that, but I don't find an interface for this function, right?
> I also thought StructuredSessionization this case, this is to maintain the state of a single id, in this case could you calculate all total online time?So if I want state: GroupState[SessionInfo]) ->sum(durationMs), is there any other solution?thank you
>  
> 原:
> 我有一个需求,时实统计今日订单的总金额和总数量,但推送的数据会有订单ID重复,需要根据订单ID和订单时间来去重,取订单时间最新的一条订单,覆盖掉之前的订单数据。然后重新计算。
>  我看到文档有这个功能,
> {{// Without watermark using guid columnstreamingDf.dropDuplicates("guid")}}
> 但这个是不增加重复数据,但不会去覆盖旧数据,我想让新数据覆盖掉旧数据,然后在重新计算 sum(money) 类似这样,但是没有找到这个功能的接口可以用?
>  我也想到过 StructuredSessionization 这个例子,这个是维护单个id的状态,在这个例子中能否算出全部总在线时长呢?类似我要 state: GroupState[SessionInfo]) ->sum(durationMs) 这样的计算,有没有其它解决方案呢?谢谢
>  
> !image-2019-09-05-01-14-26-620.png!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org