You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@storm.apache.org by "Roshan Naik (JIRA)" <ji...@apache.org> on 2017/01/19 21:13:26 UTC

[jira] [Created] (STORM-2308) Support for Non-replayable Sources

Roshan Naik created STORM-2308:
----------------------------------

             Summary: Support for Non-replayable Sources
                 Key: STORM-2308
                 URL: https://issues.apache.org/jira/browse/STORM-2308
             Project: Apache Storm
          Issue Type: Sub-task
          Components: storm-core
    Affects Versions: 2.0.0
            Reporter: Roshan Naik


In order to recover from failures without data loss, Storm (and other streaming systems) places the responsibility of buffering events on the source system. In the event of a crash or other failure, in-flight events can be re-fetched from the source and their processing can be retried on recovery. A nice benefit of this approach is that it keeps Storm’s architecture simple. 

While it is desirable to avoid the complexities of creating an internal reliable buffering system, it is not necessary to restrict Spouts to accept data only from persistent sources such Kafka, Hdfs or databases. Some amount of data loss is acceptable in many uses cases. Storm already supports such use cases by allowing ACK-ing to be disabled. 

Users who can tolerate data loss, benefit from having spouts that can accept data directly from a wider variety of sources such as HTTP, TCP/UDP, Syslog, Flume etc. For such use cases, by not forcing all data to go through a system like Kafka, end-to-end latency improves in addition to simplifying management and reducing cost of the data pipeline. Users who care about not losing data can always funnel the incoming data via Kafka or another persistent store and enable ACKs.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)