You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2015/04/17 10:11:59 UTC

[jira] [Resolved] (SPARK-6974) Possible error in TwitterPopularTags

     [ https://issues.apache.org/jira/browse/SPARK-6974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen resolved SPARK-6974.
------------------------------
    Resolution: Invalid

Nicola this is probably best as a question at user@ first.

I'm not sure what you mean here. Yes, a DStream consists of many RDDs. Here we have a window over RDDs. The window updates. At every point the top tags in the sliding 60 second window are counted.

This is direct and straightforward use of the API. I'm not sure what you mean that it can change or isn't documented.

> Possible error in TwitterPopularTags
> ------------------------------------
>
>                 Key: SPARK-6974
>                 URL: https://issues.apache.org/jira/browse/SPARK-6974
>             Project: Spark
>          Issue Type: Bug
>            Reporter: Nicola Ferraro
>            Priority: Minor
>
> Looking at the example for Twitter popular tags in spark streaming (https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/TwitterPopularTags.scala), it seems that the algorithm can have issues in some cases.
> Top k tags are computed using the following function on a DStream:
> topCounts60.foreachRDD(rdd => { ... print ... })
> But the function passed to "foreachRDD" is called multiple times when your DStream is composed of multiple RDDs, once per RDD in the DStream, resulting in multiple Top-k charts.
> Probably this scenario is unlikely to happen, because a previous transformation on the DStream (reduceByKeyAndWindow) collapses all RDDs of the stream into a single one.
> The problem is that this behavior is not stated in the documentation and can be changed in future versions.
> Moreover, computing correctly the topK chart in streaming seems impossible if you rely on the documentation only. But it is the base algorithm for many RT dashboard use cases.
> I have also tried to get some reply on stackoverflow (http://stackoverflow.com/questions/29539655/how-to-compute-the-top-k-words).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org