You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2014/11/08 10:42:33 UTC

[jira] [Resolved] (SPARK-954) One repeated sampling, and I am not sure if it is correct.

     [ https://issues.apache.org/jira/browse/SPARK-954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen resolved SPARK-954.
-----------------------------
    Resolution: Won't Fix

>From the discussion, and later ones about guarantees of determinism in RDDs, sounds like this is working as intended.

> One repeated sampling, and I am not sure if it is correct.
> ----------------------------------------------------------
>
>                 Key: SPARK-954
>                 URL: https://issues.apache.org/jira/browse/SPARK-954
>             Project: Spark
>          Issue Type: Story
>    Affects Versions: 0.7.3
>            Reporter: caizhua
>
> This piece of code reads the dataset, and then has two operations on the dataset. If I consider the RDD as a view definition, I think the result is correct. However, since the first iteration does result_sample.count(), then I was wondering whether we should repeat the computation in the initialize_doc_topic_word_count(.) function, when we run the the second result_sample.map(lambda (block_id, doc_prob): doc_prob).count(). Since people write Spark as a program not as a database view, sometimes it is confusing. For example, considering there  initialize_doc_topic_word_count(.)  is a statistical function with runtime seeds, I am not sure if this have impact on the result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org