You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Cody Koeninger (JIRA)" <ji...@apache.org> on 2016/10/08 18:45:20 UTC

[jira] [Commented] (SPARK-17812) More granular control of starting offsets

    [ https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15558506#comment-15558506 ] 

Cody Koeninger commented on SPARK-17812:
----------------------------------------

So I'm willing to do this work, mostly because I've already done it, but there are some user interface issues here that need to get figured out.

You already chose the name "startingOffset" for specifying the equivalent of auto.offset.reset.  Now we're looking at actually adding starting offsets.  Furthermore, it should be possible to specify starting offsets for some partitions, while relying on the equivalent of auto.offset.reset for other unspecified ones (the existing DStream does this).

What are you expecting configuration of this to look like?  I can see a couple of options:

1. Try to cram everything into startingOffset with some horrible string-based DSL
2. Have a separate option for specifying starting offsets for real, with a name that makes it clear what it is, yet doesn't use "startingoffset".  As for the value, I guess in json form of some kind?   { "topicfoo" : { "0": 1234, "1": 4567 }}

Somewhat related is that Assign needs a way of specifying topicpartitions.

As far as the idea to seek back X offsets, I think it'd be better to look at offset time indexing.
If you are going to do the X offsets back idea, the offsets -1L and -2L already have special meaning, so it's going to be kind of confusing to allow negative numbers in an interface that is specifying offsets.


> More granular control of starting offsets
> -----------------------------------------
>
>                 Key: SPARK-17812
>                 URL: https://issues.apache.org/jira/browse/SPARK-17812
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>            Reporter: Michael Armbrust
>
> Right now you can only run a Streaming Query starting from either the earliest or latests offsets available at the moment the query is started.  Sometimes this is a lot of data.  It would be nice to be able to do the following:
>  - seek back {{X}} offsets in the stream from the moment the query starts
>  - seek to user specified offsets



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org