You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by puneetloya <pu...@gmail.com> on 2018/11/18 00:52:55 UTC

[Spark Structued Streaming]: Read kafka offset from a timestamp

I would like to request a feature for reading data from Kafka Source based on
a timestamp. So that if the application needs to process data from a certain
time, it should be able to do it. I do agree, that there is checkpoint which
gives us a continuation of stream process but what if I want to rewind the
checkpoints.
According to Spark experts, its not advised to edit checkpoints and finding
the right offsets to replay Spark is tricky but replaying from a certain
timestamp is a lot easier, atleast with a decent monitoring system.( the
time from where things started to fall apart like a buggy push or a bad
setting change)

The Kafka consumer APIs support this method OffsetForTimes which can easily
give the right offsets,

https://kafka.apache.org/10/javadoc/org/apache/kafka/clients/consumer/Consumer.html#offsetsForTimes

Similar to the StartingOffsets and EndingOffsets, it can support
startTimestamp and endTimeStamp

In a SAAS environment, when continuous data keeps flowing, these small
tweaks can help us repair our systems. Spark Structured Streaming is already
great but features like these will keep things under control in a live
production processing environment.

Cheers,
Puneet



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: [Spark Structued Streaming]: Read kafka offset from a timestamp

Posted by Jungtaek Lim <ka...@gmail.com>.

It really depends on whether we use it only for starting query (instead of
restoring from checkpoint) or we would want to restore the previous batch
from specific time (with restoring state as well).

For former would make sense and I'll try to see whether I can address it.
For latter that doesn't look like possible because we normally want to go
back to at least couple of hours which hundreds of batches could have been
processed and no information for such batch is left. Looks like you're
talking about the latter, but if we agree the former is also enough helpful
I can take a look at it.

Btw, it might also help on batch query as well, actually sounds more
helpful on batch query.

-Jungtaek Lim (HeartSaVioR)

2018년 11월 18일 (일) 오전 9:53, puneetloya <pu...@gmail.com>님이 작성:

> I would like to request a feature for reading data from Kafka Source based
> on
> a timestamp. So that if the application needs to process data from a
> certain
> time, it should be able to do it. I do agree, that there is checkpoint
> which
> gives us a continuation of stream process but what if I want to rewind the
> checkpoints.
> According to Spark experts, its not advised to edit checkpoints and finding
> the right offsets to replay Spark is tricky but replaying from a certain
> timestamp is a lot easier, atleast with a decent monitoring system.( the
> time from where things started to fall apart like a buggy push or a bad
> setting change)
>
> The Kafka consumer APIs support this method OffsetForTimes which can easily
> give the right offsets,
>
>
> https://kafka.apache.org/10/javadoc/org/apache/kafka/clients/consumer/Consumer.html#offsetsForTimes
>
> Similar to the StartingOffsets and EndingOffsets, it can support
> startTimestamp and endTimeStamp
>
> In a SAAS environment, when continuous data keeps flowing, these small
> tweaks can help us repair our systems. Spark Structured Streaming is
> already
> great but features like these will keep things under control in a live
> production processing environment.
>
> Cheers,
> Puneet
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>