You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "Michael D. Coon" <md...@yahoo.com.INVALID> on 2016/06/01 18:21:12 UTC

KStreams Rewind Offset

All,
  I think it's great that the ProcessorContext offers the partition and offset of the current record being processed; however, it offers no way for me to actually use the information. I would like to be able to rewind to a particular offset on a partition if I needed to. The consumer is also not exposed to me so I couldn't access things directly that way either. Is this in the works or would it interfere with rebalancing/auto-commits?
Mike


Re: KStreams Rewind Offset

Posted by Guozhang Wang <wa...@gmail.com>.
Hello Michael,

Just want to clarify a few thing before I can provide some suggestions in
simplifying your pipeline with Kafka Streams: as for the "race condition
between reading the stream data and the data being removed", do you mean
that the stream records is actually mutable, hence after they are read,
their content could be modified? For example:

stream at time 0:                                       (begin), (edge1),
(edge2), ... (end)

graphDB read up to (edge2) at time 1:     (begin), (edge1), (edge2), ...
(end)

stream gets modified at time 0:                (begin), *(edge1: deleted)*,
(edge2), ... (end)


If that is the case, I would suggest you make the stream records immutable,
but append "removal" records to the end of the stream when removal happens,
which matches better with the stream processing principles, for example:

    (edge1: created), (edge2: created), (edge1: removed), ...

So a later "removed" record can be treated just as an update to a previous
record with the same key "edge1", and upon reading it the graphDB can just
cleanup the created edge accordingly.


If there are other race causes that prevent you to make immutable live
streams, please let me know and we can discuss further.


Guozhang




On Thu, Jun 2, 2016 at 6:55 AM, Michael D. Coon <md...@yahoo.com.invalid>
wrote:

> Mattias,
>    That's disappointing given that Kafka offers me the ability to rewind
> and replay data. My use case is that we are building graph data structures
> based on data indexed from a live stream. At any time, the live data
> content may be marked for deletion for any number of reasons; but during
> that marking process if a graph structure is being built, it may not
> realize the data was marked for deletion (i.e. there is a race between
> graph referencing the data and the data being removed).
>
>    We need to be able to subsequently go back and clean up the graph data
> once we realize the graph contains data that was marked for deletion. But
> we can't delete/cleanup the graph until it completes...so we thought we
> could track all data referenced by the graph being created and once it was
> complete, subsequently replay the data references and determine if any were
> marked for removal and subsequently clean up the graph. We hoped that by
> sending "start/end" indicators into a graph data reference topic, some
> KStreams flow could see the "end", recognize that the graph completed, and
> simply replay all its data references to cleanup the graph. I guess we
> could use a standard consumer and do this outside of KStreams. Not a big
> deal...was just hoping to keep things in the KStreams realm. I'm sure there
> are other ways to solve this even outside of using Kafka at all; but why do
> that? :)
> Mike
>
>
>
>     On Thursday, June 2, 2016 8:59 AM, Matthias J. Sax <
> matthias@confluent.io> wrote:
>
>
>  Hi Mike,
>
> currently, this is not possible. We are already discussing some changes
> with regard to reprocess. However, I doubt that going back to a specific
> offset of a specific partition will be supported as it would be too
> difficult to reset the internal data structures and intermediate results
> correctly (also with regard to committing)
>
> What is your exact use case? What kind of feature are you looking for?
> We are always interested to get feedback/idea from users.
>
>
> -Matthias
>
> On 06/01/2016 08:21 PM, Michael D. Coon wrote:
> > All,
> >  I think it's great that the ProcessorContext offers the partition and
> offset of the current record being processed; however, it offers no way for
> me to actually use the information. I would like to be able to rewind to a
> particular offset on a partition if I needed to. The consumer is also not
> exposed to me so I couldn't access things directly that way either. Is this
> in the works or would it interfere with rebalancing/auto-commits?
> > Mike
> >
> >
>
>
>
>



-- 
-- Guozhang

Re: KStreams Rewind Offset

Posted by "Michael D. Coon" <md...@yahoo.com.INVALID>.
Mattias,
   That's disappointing given that Kafka offers me the ability to rewind and replay data. My use case is that we are building graph data structures based on data indexed from a live stream. At any time, the live data content may be marked for deletion for any number of reasons; but during that marking process if a graph structure is being built, it may not realize the data was marked for deletion (i.e. there is a race between graph referencing the data and the data being removed). 

   We need to be able to subsequently go back and clean up the graph data once we realize the graph contains data that was marked for deletion. But we can't delete/cleanup the graph until it completes...so we thought we could track all data referenced by the graph being created and once it was complete, subsequently replay the data references and determine if any were marked for removal and subsequently clean up the graph. We hoped that by sending "start/end" indicators into a graph data reference topic, some KStreams flow could see the "end", recognize that the graph completed, and simply replay all its data references to cleanup the graph. I guess we could use a standard consumer and do this outside of KStreams. Not a big deal...was just hoping to keep things in the KStreams realm. I'm sure there are other ways to solve this even outside of using Kafka at all; but why do that? :)
Mike

 

    On Thursday, June 2, 2016 8:59 AM, Matthias J. Sax <ma...@confluent.io> wrote:
 

 Hi Mike,

currently, this is not possible. We are already discussing some changes
with regard to reprocess. However, I doubt that going back to a specific
offset of a specific partition will be supported as it would be too
difficult to reset the internal data structures and intermediate results
correctly (also with regard to committing)

What is your exact use case? What kind of feature are you looking for?
We are always interested to get feedback/idea from users.


-Matthias

On 06/01/2016 08:21 PM, Michael D. Coon wrote:
> All,
>  I think it's great that the ProcessorContext offers the partition and offset of the current record being processed; however, it offers no way for me to actually use the information. I would like to be able to rewind to a particular offset on a partition if I needed to. The consumer is also not exposed to me so I couldn't access things directly that way either. Is this in the works or would it interfere with rebalancing/auto-commits?
> Mike
> 
> 


  

Re: KStreams Rewind Offset

Posted by "Matthias J. Sax" <ma...@confluent.io>.
Hi Mike,

currently, this is not possible. We are already discussing some changes
with regard to reprocess. However, I doubt that going back to a specific
offset of a specific partition will be supported as it would be too
difficult to reset the internal data structures and intermediate results
correctly (also with regard to committing)

What is your exact use case? What kind of feature are you looking for?
We are always interested to get feedback/idea from users.


-Matthias

On 06/01/2016 08:21 PM, Michael D. Coon wrote:
> All,
>   I think it's great that the ProcessorContext offers the partition and offset of the current record being processed; however, it offers no way for me to actually use the information. I would like to be able to rewind to a particular offset on a partition if I needed to. The consumer is also not exposed to me so I couldn't access things directly that way either. Is this in the works or would it interfere with rebalancing/auto-commits?
> Mike
> 
>