You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Matei Zaharia (JIRA)" <ji...@apache.org> on 2014/11/06 00:45:35 UTC
[jira] [Updated] (SPARK-4040) Update spark documentation for local mode and spark-streaming.

     [ https://issues.apache.org/jira/browse/SPARK-4040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matei Zaharia updated SPARK-4040:
---------------------------------
    Assignee: jay vyas

> Update spark documentation for local mode and spark-streaming. 
> ---------------------------------------------------------------
>
>                 Key: SPARK-4040
>                 URL: https://issues.apache.org/jira/browse/SPARK-4040
>             Project: Spark
>          Issue Type: Documentation
>          Components: Documentation
>            Reporter: jay vyas
>            Assignee: jay vyas
>             Fix For: 1.2.0
>
>
> *Note:   this JIRA has changed since its inception - its not a bug, but something which can be tricky to surmise from existing docs.  So the attached patch is a doc improvement.*
> Below is the original JIRA which was filed: 
> Please note that Im somewhat new to spark streaming's API, and am not a spark expert - so I've done the best to write up and reproduce this "bug".  If its not a bug i hope an expert will help to explain why and promptly close it.  However, it appears it could be a bug after discussing with [~rnowling] who is a spark contributor.
> CC [~rnowling] [~willbenton] 
>  
> It appears that in a DStream context, a call to   {{MappedRDD.count()}} blocks progress and prevents emission of RDDs from a stream.
> {noformat}
>     tweetStream.foreachRDD((rdd,lent)=> {
>       tweetStream.repartition(1)
>       //val count = rdd.count()  DONT DO THIS !
>       checks += 1;
>       if (checks > 20) {
>         ssc.stop()
>       }
>    }
> {noformat} 
> The above code block should inevitably halt, after 20 intervals of RDDs... However, if we uncomment the call  to {{rdd.count()}}, it turns out that we get an infinite stream which emits no RDDs , and thus our program runs forever (ssc.stop is unreachable), because *forEach doesnt receive any more entries*.  
> I suspect this is actually because the foreach block never completes, because {{count()}} is winds up calling {{compute}}, which ultimately just reads from the stream.
> I havent put together a minimal reproducer or unit test yet, but I can work on doing so if more info is needed.
> I guess this could be seen as an application bug - but i think spark might be made smarter to throw its hands up when people execute blocking code in a stream processor. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org