You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jungtaek Lim (JIRA)" <ji...@apache.org> on 2019/06/27 22:08:00 UTC
[jira] [Comment Edited] (SPARK-28192) Data Source - State - Write side

    [ https://issues.apache.org/jira/browse/SPARK-28192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16874538#comment-16874538 ] 

Jungtaek Lim edited comment on SPARK-28192 at 6/27/19 10:07 PM:
----------------------------------------------------------------

I realized new DSv2 (maybe old DSv2 too?) requires Dataframe to be partitioned correctly before putting sink. State writer is not the case, and unfortunately there's no storage coordinating this. It should repartition via key by itself, which could be possible with DSv1 (since it provides Dataframe to write) but no longer possible with DSv2.

[https://github.com/HeartSaVioR/spark-state-tools/blob/2f97f264186e852144e7ec3f9b2ab3dda4e45179/src/main/scala/net/heartsavior/spark/sql/state/StateStoreWriter.scala#L63-L75]

[~rdblue] [~cloud_fan] Which would be the best to address this? Would I need to wrap this with some method to handle repartition before adding to sink?


was (Author: kabhwan):
I realized new DSv2 (maybe old DSv2 too?) requires Dataframe to be partitioned correctly before putting sink. State writer is not the case, as there's no storage coordinating this. It should repartition via key by itself, which could be possible with DSv1 (since it provides Dataframe to write) but no longer possible with DSv2.

[https://github.com/HeartSaVioR/spark-state-tools/blob/2f97f264186e852144e7ec3f9b2ab3dda4e45179/src/main/scala/net/heartsavior/spark/sql/state/StateStoreWriter.scala#L63-L75]

[~rdblue] [~cloud_fan] Which would be the best to address this? Would I need to wrap this with some method to handle repartition before adding to sink?

> Data Source - State - Write side
> --------------------------------
>
>                 Key: SPARK-28192
>                 URL: https://issues.apache.org/jira/browse/SPARK-28192
>             Project: Spark
>          Issue Type: Sub-task
>          Components: Structured Streaming
>    Affects Versions: 3.0.0
>            Reporter: Jungtaek Lim
>            Priority: Major
>
> This issue tracks the efforts on addressing batch write on state data source.
> It could include "state repartition" if it doesn't require huge effort for new DSv2, but it can be also move out to separate issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org