You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by darabos <gi...@git.apache.org> on 2015/02/11 14:56:50 UTC

[GitHub] spark pull request: Remove outdated remark about take(n).

GitHub user darabos opened a pull request:

    https://github.com/apache/spark/pull/4533

    Remove outdated remark about take(n).

    Looking at the code, I believe this remark about `take(n)` computing partitions on the driver is no longer correct. Apologies if I'm wrong.
    
    This came up in http://stackoverflow.com/q/28436559/3318517.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/darabos/spark patch-2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/4533.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4533
    
----
commit cc80f3ac1cd2a776cf51c1831dd45e2580a0e2a8
Author: Daniel Darabos <da...@gmail.com>
Date:   2015-02-11T13:55:20Z

    Remove outdated remark about take(n).

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: Remove outdated remark about take(n).

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/4533#issuecomment-73925690
  
    Sounds correct. The subsequent tries do try in parallel. So, I suppose that's pretty good evidence it's parallelized. Unless anyone else speaks up I think this sentence can be removed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: Remove outdated remark about take(n).

Posted by darabos <gi...@git.apache.org>.

Github user darabos commented on the pull request:

    https://github.com/apache/spark/pull/4533#issuecomment-73910087
  
    Oh, thanks. I never looked into how `allowLocal` works.
    
    Looks like it results in local execution if the number of affected partitions is 1 (https://github.com/apache/spark/blob/v1.2.0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L749). So `take(n)` will always start with a local execution of partition 0, and then, if it decided it needs 10 more partitions, those partitions will be executed non-locally in parallel. Is that reading correct?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: Remove outdated remark about take(n).

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4533#issuecomment-73884236
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: Remove outdated remark about take(n).

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/4533


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: Remove outdated remark about take(n).

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/4533#issuecomment-73907037
  
    Looking at the implementation, the driver does query one partition at a time for the number of elements it thinks it needs and continues until it is satisfied. I'd imagine this means querying just the first partition in almost all cases, though it is done serially. Seems about right to me as opposed to preemptively querying many partitions. The job can run locally on the driver. So I don't know if the comment is categorically wrong, but also doesn't seem to be clearly communicating something helpful. For that reason, IMHO it could be removed. There's no "gotcha" I know of here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org