You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hari Shreedharan (JIRA)" <ji...@apache.org> on 2014/11/05 05:14:35 UTC

[jira] [Commented] (SPARK-4174) Streaming: Optionally provide notifications to Receivers when DStream has been generated

    [ https://issues.apache.org/jira/browse/SPARK-4174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14197464#comment-14197464 ] 

Hari Shreedharan commented on SPARK-4174:
-----------------------------------------

I will write up a doc soon, but here are some initial thoughts:

We should be able to provide one callback per batch, unfortunately receivers are not really batch aware. So there are couple of options:
- Take a function to be called per block, and call all of the functions in order. This can generate as many callbacks as the blocks per batch - this may not be huge but this also means that there is a chain of methods that would get called when each block gets completed. This is most likely the simplest as far as semantics are concerned. 
- We could add a new method to the Receiver API - `onBlockComplete(List[BlockId])`. This would allow the receiver to implement Unfortunately none of the push* methods returned the block id - therefore this method would really be useful only if the receiver calls the push* methods with the block id passed in. We'd have to document this though. 

I am open to other semantics as well, but the API is a bit limiting since we don't report the block info back to the receiver as of now.

The implementation idea I have for either would bepretty similar. Before I go into that, any thoughts on the semantics?


> Streaming: Optionally provide notifications to Receivers when DStream has been generated
> ----------------------------------------------------------------------------------------
>
>                 Key: SPARK-4174
>                 URL: https://issues.apache.org/jira/browse/SPARK-4174
>             Project: Spark
>          Issue Type: Improvement
>            Reporter: Hari Shreedharan
>            Assignee: Hari Shreedharan
>
> Receivers receiving data from Message Queues, like Active MQ, Kafka etc can replay messages if required. Using the HDFS WAL mechanism for such systems affects efficiency as we are incurring an unnecessary HDFS write when we can recover the data from the queue anyway.
> We can fix this by providing a notification to the receiver when the RDD is generated from the blocks. We need to consider the case where a receiver might fail before the RDD is generated and come back on a different executor when the RDD is generated. Either way, this is likely to cause duplicates and not data loss -- so we may be ok.
> I am thinking about something of the order of accepting a callback function which gets called when the RDD is generated. We can keep the function local in a map of batch id -> function, which gets called when the function gets generated (we can inform the ReceiverSupervisorImpl via Akka when the driver generates the RDD). Of course, just an early thought - I will work on a design doc for this one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org