You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2014/11/08 10:42:33 UTC
[jira] [Resolved] (SPARK-954) One repeated sampling, and I am not
sure if it is correct.
[ https://issues.apache.org/jira/browse/SPARK-954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen resolved SPARK-954.
-----------------------------
Resolution: Won't Fix
>From the discussion, and later ones about guarantees of determinism in RDDs, sounds like this is working as intended.
> One repeated sampling, and I am not sure if it is correct.
> ----------------------------------------------------------
>
> Key: SPARK-954
> URL: https://issues.apache.org/jira/browse/SPARK-954
> Project: Spark
> Issue Type: Story
> Affects Versions: 0.7.3
> Reporter: caizhua
>
> This piece of code reads the dataset, and then has two operations on the dataset. If I consider the RDD as a view definition, I think the result is correct. However, since the first iteration does result_sample.count(), then I was wondering whether we should repeat the computation in the initialize_doc_topic_word_count(.) function, when we run the the second result_sample.map(lambda (block_id, doc_prob): doc_prob).count(). Since people write Spark as a program not as a database view, sometimes it is confusing. For example, considering there initialize_doc_topic_word_count(.) is a statistical function with runtime seeds, I am not sure if this have impact on the result.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org