You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hari Shreedharan (JIRA)" <ji...@apache.org> on 2014/10/31 19:48:34 UTC

[jira] [Created] (SPARK-4174) Optionally provide notifications to Receivers when DStream has been generated

Hari Shreedharan created SPARK-4174:
---------------------------------------

             Summary: Optionally provide notifications to Receivers when DStream has been generated
                 Key: SPARK-4174
                 URL: https://issues.apache.org/jira/browse/SPARK-4174
             Project: Spark
          Issue Type: Bug
            Reporter: Hari Shreedharan


Receivers receiving data from Message Queues, like Active MQ, Kafka etc can replay messages if required. Using the HDFS WAL mechanism for such systems affects efficiency as we are incurring an unnecessary HDFS write when we can recover the data from the queue anyway.

We can fix this by providing a notification to the receiver when the RDD is generated from the blocks. We need to consider the case where a receiver might fail before the RDD is generated and come back on a different executor when the RDD is generated. Either way, this is likely to cause duplicates and not data loss -- so we may be ok.

I am thinking about something of the order of accepting a callback function which gets called when the RDD is generated. We can keep the function local in a map of batch id -> function, which gets called when the function gets generated (we can inform the ReceiverSupervisorImpl via Akka when the driver generates the RDD). Of course, just an early thought - I will work on a design doc for this one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org