You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by ThomasLau <gi...@git.apache.org> on 2016/05/27 06:15:18 UTC

[GitHub] spark pull request: fix: a forked process random extends parent's ...

GitHub user ThomasLau opened a pull request:

    https://github.com/apache/spark/pull/13350

    fix: a forked process random extends parent's random state

    ## add random.seed() before the forked worker to run in daemon.py 
    here is a test code:
    
    ```python
    from random import random
    from operator import add
    def funcx(x):
      print x[0],x[1]
      return 1 if x[0]**2 + x[1]**2 < 1 else 0
    def genRnd(ind):
      x=random() * 2 - 1
      y=random() * 2 - 1
      return (x,y)
    def runsp(total):
      ret=sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(lambda x, y: x + y)/float(total) * 4
      print ret
    runsp(3)
    ```
    
    once i start the pyspark-shell, no matter how many times i run "runsp(N)"  aafter, this code always print out 
    
    ```
    0.896083541418 -0.635625854075
    -0.0423532645466 -0.526910255885
    0.498518696049 -0.872983895832
    1.3333333333333333
    >>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total) * 4
    0.896083541418 -0.635625854075
    -0.0423532645466 -0.526910255885
    0.498518696049 -0.872983895832
    1.3333333333333333
    >>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total) * 4
    0.896083541418 -0.635625854075
    -0.0423532645466 -0.526910255885
    0.498518696049 -0.872983895832
    1.3333333333333333
    ```
    
    i think this is because when we import  pyspark.worker in the daemon.py, we alse import a random by the shuffle.py which is imported by  pyspark.worker, this worker, forked by "pid = os.fork()", also remains the state of the parent's random, thus every forked worker get the same random.next().
    
    ## we need to re-random the random by random.seed, which will solve the problem, but i think this PR. may not be the proper fix.
    ths. 
    
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ThomasLau/spark master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13350.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13350
    
----
commit 595abab29fb9dd5889885dd4cfd4676caa161601
Author: Thomas <th...@thomasmac.local>
Date:   2016-05-27T03:52:59Z

    fix: a forked process random extends parent's random state

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: fix: a forked process random extends parent's ...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/13350#issuecomment-222073415
  
    Hi @ThomasLau, It would be nicer if the contribution follows https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13350: [SPARK-15611][PySPARK] fix got the same sequence ...

Posted by ThomasLau <gi...@git.apache.org>.
Github user ThomasLau closed the pull request at:

    https://github.com/apache/spark/pull/13350


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: fix: a forked process random extends parent's ...

Posted by ThomasLau <gi...@git.apache.org>.
Github user ThomasLau commented on the pull request:

    https://github.com/apache/spark/pull/13350#issuecomment-222110671
  
    @HyukjinKwon sorry  for that. i just report this bug to the jire [SPARK-15611](https://issues.apache.org/jira/browse/SPARK-15611).
    
    and many many thanks for your advice!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: fix: a forked process random extends parent's ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13350#issuecomment-222068744
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org