You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (Jira)" <ji...@apache.org> on 2021/08/26 17:00:00 UTC

[jira] [Commented] (SPARK-35611) Introduce the strategy on mismatched offset for start offset timestamp on Kafka data source

    [ https://issues.apache.org/jira/browse/SPARK-35611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405367#comment-17405367 ] 

Apache Spark commented on SPARK-35611:
--------------------------------------

User 'xuanyuanking' has created a pull request for this issue:
https://github.com/apache/spark/pull/33854

> Introduce the strategy on mismatched offset for start offset timestamp on Kafka data source
> -------------------------------------------------------------------------------------------
>
>                 Key: SPARK-35611
>                 URL: https://issues.apache.org/jira/browse/SPARK-35611
>             Project: Spark
>          Issue Type: Improvement
>          Components: Structured Streaming
>    Affects Versions: 3.0.2, 3.1.1
>            Reporter: Jungtaek Lim
>            Assignee: Jungtaek Lim
>            Priority: Major
>             Fix For: 3.2.0
>
>
> 1. Rationalization
> We encountered a real-world case Spark fails the query if some of the partitions don't have matching offset by timestamp.
> This is intended behavior to avoid bring unintended output for some cases like:
> * timestamp 2 is presented as timestamp-offset, but the some of partitions don't have the record yet
> * record with timestamp 1 comes "later" in the following micro-batch
> which is possible since Kafka allows to specify the timestamp in record.
> Here the unintended output we talked about was the risk of reading record with timestamp 1 in the next micro-batch despite the option specifying timestamp 2.
> But for many cases end users just suppose timestamp is increasing monotonically, and current behavior blocks these cases to make progress.
> 2. Proposal
> For the cases the timestamp is supposed to increase monotonically, it's safe to consider the offset to be latest (technically, offset for latest record + 1) if there's no matching record via timestamp.
> This would be pretty much helpful for the case where there's a skew between partitions and some partitions have older records.
> * AS-IS: Spark simply fails the query and end users have to deal with workarounds requiring manual steps.
> * TO-BE: Spark will assign the latest offset for these partitions, so that Spark can read newer records from these partitions in further micro-batches.
> To retain the existing behavior and also give some help for the proposed "TO-BE" behavior, we'd like to introduce the strategy on mismatched offset for start offset timestamp.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org