You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@storm.apache.org by "Robert Joseph Evans (JIRA)" <ji...@apache.org> on 2015/05/28 15:49:25 UTC

[jira] [Issue Comment Deleted] (STORM-837) HdfsState ignores commits

     [ https://issues.apache.org/jira/browse/STORM-837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Joseph Evans updated STORM-837:
--------------------------------------
    Comment: was deleted

(was: HdfsState is not a MapState.  It is just a State, there are no get operations supported on it, it is a sink that writes all input to HDFS.  The issue is with others reading the data.  The readers in this case are likely to be a batch job using a Hadoop input format to read the data.  For regular storm it provides at most once or at least once semantics, and the HdfsBolt, which is also a sink, provides the exact same semantics, so if there is duplicate data or lost data it could be for more reasons then just the Bolt not syncing things correctly.  In that case however, I would like to see the spout not ack a tuple until it has been synced to disk, that way we truly can be sure no data is lost, but that is another issue.

for trident we expect exactly once semantics, especially form something that comes as an official part of storm.  The File formats that the data is written out in are just a log with no ability to overwrite out of data data like a MapState can.  They also have no knowledge of zookeeper and the batch ids that have been or not been fully committed.  And even if they did have that knowledge the entries have to commit ID in them to let the reader know which ones it should ignore while reading.)

> HdfsState ignores commits
> -------------------------
>
>                 Key: STORM-837
>                 URL: https://issues.apache.org/jira/browse/STORM-837
>             Project: Apache Storm
>          Issue Type: Bug
>            Reporter: Robert Joseph Evans
>            Priority: Critical
>
> HdfsState works with trident which is supposed to provide exactly once processing.  It does this two ways, first by informing the state about commits so it can be sure the data is written out, and second by having a commit id, so that double commits can be handled.
> HdfsState ignores the beginCommit and commit calls, and with that ignores the ids.  This means that if you use HdfsState and your worker crashes you may both lose data and get some data twice.
> At a minimum the flush and file rotation should be tied to the commit in some way.  The commit ID should at a minimum be written out with the data so someone reading the data can have a hope of deduping it themselves.
> Also with the rotationActions it is possible for a file that was partially written is leaked, and never moved to the final location, because it is not rotated.  I personally think the actions are too generic for this case and need to be deprecated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)