You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by "Saulius Grigaliunas (JIRA)" <ji...@apache.org> on 2015/02/13 09:45:11 UTC

[jira] [Commented] (SAMZA-461) Race when initializing offsets at job startup leads to skipped messages

    [ https://issues.apache.org/jira/browse/SAMZA-461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14319750#comment-14319750 ] 

Saulius Grigaliunas commented on SAMZA-461:
-------------------------------------------

Any chance of getting this into 0.8.1? Without this patch 0.8.0 is unusable for us..

> Race when initializing offsets at job startup leads to skipped messages
> -----------------------------------------------------------------------
>
>                 Key: SAMZA-461
>                 URL: https://issues.apache.org/jira/browse/SAMZA-461
>             Project: Samza
>          Issue Type: Bug
>          Components: kafka
>            Reporter: Ben Kirwin
>            Assignee: Ben Kirwin
>             Fix For: 0.9.0
>
>         Attachments: 0001-Default-to-upcoming-offset-when-stream-is-empty.patch
>
>
> If the default offset is set to oldest, a Samza job should start from the very beginning of the stream:
> {code}
> systems.kafka.samza.offset.default=oldest
> {code}
> However, if the very first messages are added to the stream while the job is booting up, it's possible for those messages to be skipped entirely.
> When there are no messages in a stream, Samza reads the 'oldest' offset as null. This null value is added to the map of starting offsets in the offset manager. When the Kafka broker proxy gets the null offset, it complains:
> {code}
> It appears that we received an invalid or empty offset [...] Attempting to use Kafka's auto.offset.reset setting. This can result in data loss if processing continues.
> {code}
> If auto.offset.reset is not manually configured, this defaults to starting with the latest value. If messages have appeared in the stream in the meantime, the job will start *after* those messages, and data is indeed lost.
> It seems like setting oldestOffset to equal upcomingOffset would solve the issue. (It's also semantically reasonable -- the upcoming offset is indeed the oldest offset that will ever be read.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)