You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by davies <gi...@git.apache.org> on 2014/08/16 02:36:05 UTC

[GitHub] spark pull request: [SPARK-3073] [PySpark] use external sort in so...

GitHub user davies opened a pull request:

    https://github.com/apache/spark/pull/1978

    [SPARK-3073] [PySpark] use external sort in sortBy() and sortByKey()

    Using external sort to support sort large datasets in reduce stage.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/davies/spark sort

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1978.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1978
    
----
commit 55602ee6122aa5ce5b3f52b66cb0a74a6e275fba
Author: Davies Liu <da...@gmail.com>
Date:   2014-08-15T21:48:34Z

    use external sort in sortBy() and sortByKey()

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3073] [PySpark] use external sort in so...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/1978#issuecomment-52385449
  
    Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3073] [PySpark] use external sort in so...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/1978#issuecomment-53510004
  
    Cool, thanks! Going to merge this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3073] [PySpark] use external sort in so...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/1978#issuecomment-52859801
  
    Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3073] [PySpark] use external sort in so...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/1978#issuecomment-52721491
  
    Jenkins, retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3073] [PySpark] use external sort in so...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/1978#issuecomment-52703837
  
    In Python 2.6/7,  heapq.merge() do not support `key` and `reverse`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3073] [PySpark] use external sort in so...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1978#discussion_r16689375
  
    --- Diff: python/pyspark/shuffle.py ---
    @@ -434,6 +437,63 @@ def _recursive_merged_items(self, start):
                     os.remove(os.path.join(path, str(i)))
     
     
    +class ExternalSorter(object):
    +    """
    +    ExtenalSorter will divide the elements into chunks, sort them in
    +    memory and dump them into disks, finally merge them back.
    +
    +    The spilling will only happen when the used memory goes above
    +    the limit.
    +    """
    +    def __init__(self, memory_limit, serializer=None):
    +        self.memory_limit = memory_limit
    +        self.local_dirs = _get_local_dirs("sort")
    +        self.serializer = serializer or BatchedSerializer(PickleSerializer(), 1024)
    +
    +    def _get_path(self, n):
    --- End diff --
    
    Yeah good question. We don't have that yet, but in the future we'll have support for multiple local storage levels.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3073] [PySpark] use external sort in so...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1978#issuecomment-52717916
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18898/consoleFull) for   PR 1978 at commit [`eb53ca6`](https://github.com/apache/spark/commit/eb53ca6b42c7cc29398a680efcb198f7334879fb).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3073] [PySpark] use external sort in so...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/1978


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3073] [PySpark] use external sort in so...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1978#issuecomment-52717406
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18896/consoleFull) for   PR 1978 at commit [`644abaf`](https://github.com/apache/spark/commit/644abaf0e05e1ba534ae7ddcd34cfa0e0e9cc25e).
     * This patch **does not** merge cleanly!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3073] [PySpark] use external sort in so...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/1978#issuecomment-53472233
  
    @mateiz added checking for spilled bytes in tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3073] [PySpark] use external sort in so...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1978#discussion_r16445660
  
    --- Diff: python/pyspark/shuffle.py ---
    @@ -54,6 +57,13 @@ def get_used_memory():
             return 0
     
     
    +def _get_local_dirs(sub):
    +    """ Get all the directories """
    +    path = os.environ.get("SPARK_LOCAL_DIR", "/tmp")
    --- End diff --
    
    As of my new PR, this will need to be changed to "SPARK_LOCAL_DIRS" (plural).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3073] [PySpark] use external sort in so...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1978#issuecomment-52725086
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18906/consoleFull) for   PR 1978 at commit [`eb53ca6`](https://github.com/apache/spark/commit/eb53ca6b42c7cc29398a680efcb198f7334879fb).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3073] [PySpark] use external sort in so...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/1978#issuecomment-53356744
  
    @mateiz I had addressed above comments, it also fix the same problem for external merger, please take another look again, thx. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3073] [PySpark] use external sort in so...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1978#issuecomment-52712481
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18888/consoleFull) for   PR 1978 at commit [`644abaf`](https://github.com/apache/spark/commit/644abaf0e05e1ba534ae7ddcd34cfa0e0e9cc25e).
     * This patch **does not** merge cleanly!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3073] [PySpark] use external sort in so...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/1978#issuecomment-52702967
  
    Why do we need to use `heapq3`?  Is there a way to support this feature using the standard Python 2.7 `heapq`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3073] [PySpark] use external sort in so...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1978#issuecomment-52712494
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18888/consoleFull) for   PR 1978 at commit [`644abaf`](https://github.com/apache/spark/commit/644abaf0e05e1ba534ae7ddcd34cfa0e0e9cc25e).
     * This patch **fails** unit tests.
     * This patch **does not** merge cleanly!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3073] [PySpark] use external sort in so...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/1978#issuecomment-53514954
  
    @mateiz PR #2152


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3073] [PySpark] use external sort in so...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1978#issuecomment-52717415
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18896/consoleFull) for   PR 1978 at commit [`644abaf`](https://github.com/apache/spark/commit/644abaf0e05e1ba534ae7ddcd34cfa0e0e9cc25e).
     * This patch **fails** unit tests.
     * This patch **does not** merge cleanly!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3073] [PySpark] use external sort in so...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1978#issuecomment-52385513
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18665/consoleFull) for   PR 1978 at commit [`55602ee`](https://github.com/apache/spark/commit/55602ee6122aa5ce5b3f52b66cb0a74a6e275fba).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3073] [PySpark] use external sort in so...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/1978#issuecomment-52717234
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3073] [PySpark] use external sort in so...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/1978#issuecomment-52718971
  
    @massie @jey @pwendell Is there a reason why it would be unsafe to run `git clean` in the pull request builder?  Would this inadvertently delete any files that it needs, such as Spark configurations?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3073] [PySpark] use external sort in so...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1978#issuecomment-52717934
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18898/consoleFull) for   PR 1978 at commit [`eb53ca6`](https://github.com/apache/spark/commit/eb53ca6b42c7cc29398a680efcb198f7334879fb).
     * This patch **fails** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class ExternalSorter(object):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3073] [PySpark] use external sort in so...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1978#issuecomment-53361464
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19168/consoleFull) for   PR 1978 at commit [`b125d2f`](https://github.com/apache/spark/commit/b125d2f3928373809281272ee40e779f2d1e2c73).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class ExternalSorter(object):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3073] [PySpark] use external sort in so...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/1978#issuecomment-53467379
  
    Looks pretty good, just added one question about the test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3073] [PySpark] use external sort in so...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1978#issuecomment-53480301
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19228/consoleFull) for   PR 1978 at commit [`bbcd9ba`](https://github.com/apache/spark/commit/bbcd9bac5b0e7a134b20328800580f00ba1f6b32).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `    $FWDIR/bin/spark-submit --class org.apache.spark.repl.Main "$`
      * `    $FWDIR/bin/spark-submit --class org.apache.spark.repl.Main "$`
      * `case class SparkListenerTaskStart(stageId: Int, stageAttemptId: Int, taskInfo: TaskInfo)`
      * `In multiclass classification, all `$2^`
      * `public final class JavaDecisionTree `
      * `class KMeansModel (val clusterCenters: Array[Vector]) extends Serializable `
      * `class BoundedFloat(float):`
      * `class ExternalSorter(object):`
      * `class JoinedRow2 extends Row `
      * `class JoinedRow3 extends Row `
      * `class JoinedRow4 extends Row `
      * `class JoinedRow5 extends Row `
      * `class GenericRow(protected[sql] val values: Array[Any]) extends Row `
      * `abstract class MutableValue extends Serializable `
      * `final class MutableInt extends MutableValue `
      * `final class MutableFloat extends MutableValue `
      * `final class MutableBoolean extends MutableValue `
      * `final class MutableDouble extends MutableValue `
      * `final class MutableShort extends MutableValue `
      * `final class MutableLong extends MutableValue `
      * `final class MutableByte extends MutableValue `
      * `final class MutableAny extends MutableValue `
      * `final class SpecificMutableRow(val values: Array[MutableValue]) extends MutableRow `
      * `case class CountDistinct(expressions: Seq[Expression]) extends PartialAggregate `
      * `case class CollectHashSet(expressions: Seq[Expression]) extends AggregateExpression `
      * `case class CollectHashSetFunction(`
      * `case class CombineSetsAndCount(inputSet: Expression) extends AggregateExpression `
      * `case class CombineSetsAndCountFunction(`
      * `case class CountDistinctFunction(`
      * `case class MaxOf(left: Expression, right: Expression) extends Expression `
      * `case class NewSet(elementType: DataType) extends LeafExpression `
      * `case class AddItemToSet(item: Expression, set: Expression) extends Expression `
      * `case class CombineSets(left: Expression, right: Expression) extends BinaryExpression `
      * `case class CountSet(child: Expression) extends UnaryExpression `
      * `case class ExplainCommand(plan: LogicalPlan, extended: Boolean = false) extends Command `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3073] [PySpark] use external sort in so...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1978#issuecomment-53357012
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19168/consoleFull) for   PR 1978 at commit [`b125d2f`](https://github.com/apache/spark/commit/b125d2f3928373809281272ee40e779f2d1e2c73).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3073] [PySpark] use external sort in so...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1978#issuecomment-53472545
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19228/consoleFull) for   PR 1978 at commit [`bbcd9ba`](https://github.com/apache/spark/commit/bbcd9bac5b0e7a134b20328800580f00ba1f6b32).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3073] [PySpark] use external sort in so...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1978#discussion_r16732341
  
    --- Diff: python/pyspark/tests.py ---
    @@ -117,6 +119,37 @@ def test_huge_dataset(self):
             m._cleanup()
     
     
    +class TestSorter(unittest.TestCase):
    +    def test_in_memory_sort(self):
    +        l = range(1024)
    +        random.shuffle(l)
    +        sorter = ExternalSorter(1024)
    +        self.assertEquals(sorted(l), list(sorter.sorted(l)))
    +        self.assertEquals(sorted(l, reverse=True), list(sorter.sorted(l, reverse=True)))
    +        self.assertEquals(sorted(l, key=lambda x: -x), list(sorter.sorted(l, key=lambda x: -x)))
    +        self.assertEquals(sorted(l, key=lambda x: -x, reverse=True),
    +                          list(sorter.sorted(l, key=lambda x: -x, reverse=True)))
    +
    +    def test_external_sort(self):
    +        l = range(1024)
    +        random.shuffle(l)
    +        sorter = ExternalSorter(1)
    +        self.assertEquals(sorted(l), list(sorter.sorted(l)))
    +        self.assertEquals(sorted(l, reverse=True), list(sorter.sorted(l, reverse=True)))
    +        self.assertEquals(sorted(l, key=lambda x: -x), list(sorter.sorted(l, key=lambda x: -x)))
    +        self.assertEquals(sorted(l, key=lambda x: -x, reverse=True),
    +                          list(sorter.sorted(l, key=lambda x: -x, reverse=True)))
    +
    +    def test_external_sort_in_rdd(self):
    +        conf = SparkConf().set("spark.python.worker.memory", "1m")
    +        sc = SparkContext(conf=conf)
    +        l = range(10240)
    +        random.shuffle(l)
    +        rdd = sc.parallelize(l, 10)
    +        self.assertEquals(sorted(l), rdd.sortBy(lambda x: x).collect())
    +        sc.stop()
    --- End diff --
    
    Have you tested that this actually spills any data? I guess it does because the bare Python interpreter already consumes more than 1 MB?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3073] [PySpark] use external sort in so...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1978#issuecomment-52864287
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19013/consoleFull) for   PR 1978 at commit [`1f075ed`](https://github.com/apache/spark/commit/1f075eda06270b8b6dd04a2ef6d31bc7563a182d).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `case class SparkListenerTaskStart(stageId: Int, stageAttemptId: Int, taskInfo: TaskInfo)`
      * `class ExternalSorter(object):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3073] [PySpark] use external sort in so...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1978#issuecomment-52721781
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18906/consoleFull) for   PR 1978 at commit [`eb53ca6`](https://github.com/apache/spark/commit/eb53ca6b42c7cc29398a680efcb198f7334879fb).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3073] [PySpark] use external sort in so...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/1978#issuecomment-52712122
  
    cc @mateiz 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3073] [PySpark] use external sort in so...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/1978#issuecomment-52702292
  
    I think we also need to add a license statement to LICENSE.txt (like we've done with CloudPickle and Py4J).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3073] [PySpark] use external sort in so...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/1978#issuecomment-53510087
  
    BTW can you send a PR for the randomizing change to branch-1.1? I don't think we'll add sorting in branch-1.1 since it's a new feature, but we can add that randomizing patch as a bug fix. Or do you think it won't matter much?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3073] [PySpark] use external sort in so...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1978#issuecomment-52860143
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19013/consoleFull) for   PR 1978 at commit [`1f075ed`](https://github.com/apache/spark/commit/1f075eda06270b8b6dd04a2ef6d31bc7563a182d).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3073] [PySpark] use external sort in so...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/1978#issuecomment-52719256
  
    I ssh'd into the worker and deleted that checkpoint directory, so maybe it will work now.
    
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3073] [PySpark] use external sort in so...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/1978#issuecomment-52718525
  
    Test failure is due to RAT complaining about some temporary files left behind by another test:
    
    ```
    =========================================================================
    Running Apache RAT checks
    =========================================================================
    Could not find Apache license headers in the following files:
     !????? /home/jenkins/workspace/SparkPullRequestBuilder@2/mllib/checkpoint/.temp.crc
     !????? /home/jenkins/workspace/SparkPullRequestBuilder@2/mllib/checkpoint/temp
    ```
    
    There's a few ways to fix this:
    
    - Add an exclude to .rat-excludes
    - Modify the pull request builder to use `git clean` to remove these untracked files from the working tree.
    - Configure RAT to only check files tracked by git (by using `git ls-files`)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3073] [PySpark] use external sort in so...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1978#discussion_r16687577
  
    --- Diff: python/pyspark/tests.py ---
    @@ -117,6 +118,28 @@ def test_huge_dataset(self):
             m._cleanup()
     
     
    +class TestSorter(unittest.TestCase):
    +    def test_in_memory_sort(self):
    +        l = range(1024)
    +        random.shuffle(l)
    +        sorter = ExternalSorter(1024)
    +        self.assertEquals(sorted(l), list(sorter.sorted(l)))
    +        self.assertEquals(sorted(l, reverse=True), list(sorter.sorted(l, reverse=True)))
    +        self.assertEquals(sorted(l, key=lambda x: -x), list(sorter.sorted(l, key=lambda x: -x)))
    +        self.assertEquals(sorted(l, key=lambda x: -x, reverse=True),
    +                          list(sorter.sorted(l, key=lambda x: -x, reverse=True)))
    +
    +    def test_external_sort(self):
    +        l = range(1024)
    +        random.shuffle(l)
    +        sorter = ExternalSorter(1)
    +        self.assertEquals(sorted(l), list(sorter.sorted(l)))
    +        self.assertEquals(sorted(l, reverse=True), list(sorter.sorted(l, reverse=True)))
    +        self.assertEquals(sorted(l, key=lambda x: -x), list(sorter.sorted(l, key=lambda x: -x)))
    +        self.assertEquals(sorted(l, key=lambda x: -x, reverse=True),
    +                          list(sorter.sorted(l, key=lambda x: -x, reverse=True)))
    +
    --- End diff --
    
    Would be good to add a test that calls sortByKey on a large dataset here too, in addition to just testing the ExternalSorter separately.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3073] [PySpark] use external sort in so...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1978#discussion_r16687536
  
    --- Diff: python/pyspark/shuffle.py ---
    @@ -434,6 +437,63 @@ def _recursive_merged_items(self, start):
                     os.remove(os.path.join(path, str(i)))
     
     
    +class ExternalSorter(object):
    +    """
    +    ExtenalSorter will divide the elements into chunks, sort them in
    +    memory and dump them into disks, finally merge them back.
    +
    +    The spilling will only happen when the used memory goes above
    +    the limit.
    +    """
    +    def __init__(self, memory_limit, serializer=None):
    +        self.memory_limit = memory_limit
    +        self.local_dirs = _get_local_dirs("sort")
    +        self.serializer = serializer or BatchedSerializer(PickleSerializer(), 1024)
    +
    +    def _get_path(self, n):
    --- End diff --
    
    Basically I'm worried that everyone writes to disk1 first, then everyone writes to disk2, etc, and we only use one disk at a time.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3073] [PySpark] use external sort in so...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1978#discussion_r16688107
  
    --- Diff: python/pyspark/shuffle.py ---
    @@ -434,6 +437,63 @@ def _recursive_merged_items(self, start):
                     os.remove(os.path.join(path, str(i)))
     
     
    +class ExternalSorter(object):
    +    """
    +    ExtenalSorter will divide the elements into chunks, sort them in
    +    memory and dump them into disks, finally merge them back.
    +
    +    The spilling will only happen when the used memory goes above
    +    the limit.
    +    """
    +    def __init__(self, memory_limit, serializer=None):
    +        self.memory_limit = memory_limit
    +        self.local_dirs = _get_local_dirs("sort")
    +        self.serializer = serializer or BatchedSerializer(PickleSerializer(), 1024)
    +
    +    def _get_path(self, n):
    --- End diff --
    
    Good catch, maybe shuffling the directories randomly in the begging would be better.
    
    PS: Could you have a configured policy to choose local disks, such as use the first one AMAP, it's will be useful when one of the local disks is SSD.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3073] [PySpark] use external sort in so...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1978#discussion_r16687523
  
    --- Diff: python/pyspark/shuffle.py ---
    @@ -434,6 +437,63 @@ def _recursive_merged_items(self, start):
                     os.remove(os.path.join(path, str(i)))
     
     
    +class ExternalSorter(object):
    +    """
    +    ExtenalSorter will divide the elements into chunks, sort them in
    +    memory and dump them into disks, finally merge them back.
    +
    +    The spilling will only happen when the used memory goes above
    +    the limit.
    +    """
    +    def __init__(self, memory_limit, serializer=None):
    +        self.memory_limit = memory_limit
    +        self.local_dirs = _get_local_dirs("sort")
    +        self.serializer = serializer or BatchedSerializer(PickleSerializer(), 1024)
    +
    +    def _get_path(self, n):
    --- End diff --
    
    Because there will be multiple Python worker processes running on the same node, if they all need to spill, it looks like they'll use the same directories in order here. Can you instead start each one at a random ID and then increment that to have it cycle through?
    
    I'm not sure whether this can also affect the external hashing code, but if so, it would be good to fix that too (as a separate JIRA).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3073] [PySpark] use external sort in so...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1978#issuecomment-52386347
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18665/consoleFull) for   PR 1978 at commit [`55602ee`](https://github.com/apache/spark/commit/55602ee6122aa5ce5b3f52b66cb0a74a6e275fba).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `abstract class Serializer `
      * `abstract class SerializerInstance `
      * `abstract class SerializationStream `
      * `abstract class DeserializationStream `
      * `class ExternalSorter(object):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org