You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by ThomasLau <gi...@git.apache.org> on 2016/05/27 06:15:18 UTC
[GitHub] spark pull request: fix: a forked process random extends parent's ...
GitHub user ThomasLau opened a pull request:
https://github.com/apache/spark/pull/13350
fix: a forked process random extends parent's random state
## add random.seed() before the forked worker to run in daemon.py
here is a test code:
```python
from random import random
from operator import add
def funcx(x):
print x[0],x[1]
return 1 if x[0]**2 + x[1]**2 < 1 else 0
def genRnd(ind):
x=random() * 2 - 1
y=random() * 2 - 1
return (x,y)
def runsp(total):
ret=sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(lambda x, y: x + y)/float(total) * 4
print ret
runsp(3)
```
once i start the pyspark-shell, no matter how many times i run "runsp(N)" aafter, this code always print out
```
0.896083541418 -0.635625854075
-0.0423532645466 -0.526910255885
0.498518696049 -0.872983895832
1.3333333333333333
>>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total) * 4
0.896083541418 -0.635625854075
-0.0423532645466 -0.526910255885
0.498518696049 -0.872983895832
1.3333333333333333
>>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total) * 4
0.896083541418 -0.635625854075
-0.0423532645466 -0.526910255885
0.498518696049 -0.872983895832
1.3333333333333333
```
i think this is because when we import pyspark.worker in the daemon.py, we alse import a random by the shuffle.py which is imported by pyspark.worker, this worker, forked by "pid = os.fork()", also remains the state of the parent's random, thus every forked worker get the same random.next().
## we need to re-random the random by random.seed, which will solve the problem, but i think this PR. may not be the proper fix.
ths.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/ThomasLau/spark master
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/13350.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #13350
----
commit 595abab29fb9dd5889885dd4cfd4676caa161601
Author: Thomas <th...@thomasmac.local>
Date: 2016-05-27T03:52:59Z
fix: a forked process random extends parent's random state
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: fix: a forked process random extends parent's ...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the pull request:
https://github.com/apache/spark/pull/13350#issuecomment-222073415
Hi @ThomasLau, It would be nicer if the contribution follows https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #13350: [SPARK-15611][PySPARK] fix got the same sequence ...
Posted by ThomasLau <gi...@git.apache.org>.
Github user ThomasLau closed the pull request at:
https://github.com/apache/spark/pull/13350
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: fix: a forked process random extends parent's ...
Posted by ThomasLau <gi...@git.apache.org>.
Github user ThomasLau commented on the pull request:
https://github.com/apache/spark/pull/13350#issuecomment-222110671
@HyukjinKwon sorry for that. i just report this bug to the jire [SPARK-15611](https://issues.apache.org/jira/browse/SPARK-15611).
and many many thanks for your advice!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: fix: a forked process random extends parent's ...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/13350#issuecomment-222068744
Can one of the admins verify this patch?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org