You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by davies <gi...@git.apache.org> on 2014/09/10 00:04:58 UTC

[GitHub] spark pull request: [SPARK-3463] [PySpark] aggregate and show spil...

GitHub user davies opened a pull request:

    https://github.com/apache/spark/pull/2336

    [SPARK-3463] [PySpark] aggregate and show spilled bytes in Python

    Aggregate the number of bytes spilled into disks during aggregation or sorting, show them in Web UI.
    
    ![spilled](https://cloud.githubusercontent.com/assets/40902/4209758/4b995562-386d-11e4-97c1-8e838ee1d4e3.png)
    
    This patch is blocked by SPARK-3465. (It includes a fix for that).


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/davies/spark metrics

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2336.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2336
    
----
commit fbe902942449d8b112a4f6ca1e0be6968218213c
Author: Davies Liu <da...@gmail.com>
Date:   2014-09-09T21:54:51Z

    show spilled bytes in Python in web ui

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3463] [PySpark] aggregate and show spil...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2336#issuecomment-55193472
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/36/consoleFull) for   PR 2336 at commit [`fbe9029`](https://github.com/apache/spark/commit/fbe902942449d8b112a4f6ca1e0be6968218213c).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3463] [PySpark] aggregate and show spil...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2336#issuecomment-55219182
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/62/consoleFull) for   PR 2336 at commit [`fbe9029`](https://github.com/apache/spark/commit/fbe902942449d8b112a4f6ca1e0be6968218213c).
     * This patch **fails** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3463] [PySpark] aggregate and show spil...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2336#issuecomment-55192697
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/35/consoleFull) for   PR 2336 at commit [`fbe9029`](https://github.com/apache/spark/commit/fbe902942449d8b112a4f6ca1e0be6968218213c).
     * This patch **fails** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3463] [PySpark] aggregate and show spil...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2336#issuecomment-55219139
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/62/consoleFull) for   PR 2336 at commit [`fbe9029`](https://github.com/apache/spark/commit/fbe902942449d8b112a4f6ca1e0be6968218213c).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3463] [PySpark] aggregate and show spil...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2336#issuecomment-55219154
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/63/consoleFull) for   PR 2336 at commit [`fbe9029`](https://github.com/apache/spark/commit/fbe902942449d8b112a4f6ca1e0be6968218213c).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3463] [PySpark] aggregate and show spil...

Posted by andrewor14 <gi...@git.apache.org>.

Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2336#discussion_r17452140
  
    --- Diff: python/pyspark/shuffle.py ---
    @@ -486,15 +496,18 @@ def sorted(self, iterator, key=None, reverse=False):
                 if len(chunk) < batch:
                     break
     
    -            if get_used_memory() > self.memory_limit:
    +            used_memory = get_used_memory()
    +            if used_memory > self.memory_limit:
                     # sort them inplace will save memory
                     current_chunk.sort(key=key, reverse=reverse)
                     path = self._get_path(len(chunks))
                     with open(path, 'w') as f:
                         self.serializer.dump_stream(current_chunk, f)
    -                self._spilled_bytes += os.path.getsize(path)
                     chunks.append(self.serializer.load_stream(open(path)))
                     current_chunk = []
    +                gc.collect()
    --- End diff --
    
    Why do we garbage collect here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3463] [PySpark] aggregate and show spil...

Posted by andrewor14 <gi...@git.apache.org>.

Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2336#discussion_r17452035
  
    --- Diff: core/src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala ---
    @@ -242,7 +242,8 @@ class JobProgressListener(conf: SparkConf) extends SparkListener with Logging {
                 t.taskMetrics)
     
               // Overwrite task metrics
    -          t.taskMetrics = Some(taskMetrics)
    +          // FIXME: deepcopy the metrics, or they will be the same object in local mode
    +          t.taskMetrics = Some(scala.util.Marshal.load[TaskMetrics](scala.util.Marshal.dump(taskMetrics)))
    --- End diff --
    
    Do we want to do something similar to what you did in #2338 here, i.e. do it only if this is local mode?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3463] [PySpark] aggregate and show spil...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2336#discussion_r17509949
  
    --- Diff: python/pyspark/shuffle.py ---
    @@ -68,6 +68,11 @@ def _get_local_dirs(sub):
         return [os.path.join(d, "python", str(os.getpid()), sub) for d in dirs]
     
     
    +# global stats
    +MemoryBytesSpilled = 0L
    +DiskBytesSpilled = 0L
    +
    +
    --- End diff --
    
    I already did clean these two before running a task, so it will also work with reused Python worker.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3463] [PySpark] aggregate and show spil...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2336#discussion_r17457233
  
    --- Diff: python/pyspark/shuffle.py ---
    @@ -486,15 +496,18 @@ def sorted(self, iterator, key=None, reverse=False):
                 if len(chunk) < batch:
                     break
     
    -            if get_used_memory() > self.memory_limit:
    +            used_memory = get_used_memory()
    +            if used_memory > self.memory_limit:
                     # sort them inplace will save memory
                     current_chunk.sort(key=key, reverse=reverse)
                     path = self._get_path(len(chunks))
                     with open(path, 'w') as f:
                         self.serializer.dump_stream(current_chunk, f)
    -                self._spilled_bytes += os.path.getsize(path)
                     chunks.append(self.serializer.load_stream(open(path)))
                     current_chunk = []
    +                gc.collect()
    --- End diff --
    
    Try to reclaim as much memory as possible, so latter sort and spill can process as much more data as possible. @mateiz had done some benchmark for ExternalMerger, this can improve the performance for some cases (such as list of lint).
    
    Also, after gc.collect(), the used_memory() will be more accurate.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3463] [PySpark] aggregate and show spil...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2336#discussion_r17457126
  
    --- Diff: python/pyspark/shuffle.py ---
    @@ -68,6 +68,11 @@ def _get_local_dirs(sub):
         return [os.path.join(d, "python", str(os.getpid()), sub) for d in dirs]
     
     
    +# global stats
    +MemoryBytesSpilled = 0L
    +DiskBytesSpilled = 0L
    +
    +
    --- End diff --
    
    In Python, there is no TaskContext or something like that, I have not find a better way to do it right now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3463] [PySpark] aggregate and show spil...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/2336


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3463] [PySpark] aggregate and show spil...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2336#discussion_r17516995
  
    --- Diff: python/pyspark/worker.py ---
    @@ -27,12 +27,11 @@
     # copy_reg module.
     from pyspark.accumulators import _accumulatorRegistry
     from pyspark.broadcast import Broadcast, _broadcastRegistry
    -from pyspark.cloudpickle import CloudPickler
    --- End diff --
    
    couldpickle is imported by serializers, so it's not needed here. The comments are removed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3463] [PySpark] aggregate and show spil...

Posted by andrewor14 <gi...@git.apache.org>.

Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2336#discussion_r17452130
  
    --- Diff: python/pyspark/shuffle.py ---
    @@ -68,6 +68,11 @@ def _get_local_dirs(sub):
         return [os.path.join(d, "python", str(os.getpid()), sub) for d in dirs]
     
     
    +# global stats
    +MemoryBytesSpilled = 0L
    +DiskBytesSpilled = 0L
    +
    +
    --- End diff --
    
    Do these have to be global properties? I'm not familiar with this part of the python code, but in the scala version they're part of `TaskContext`. Is there not an equivalent here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3463] [PySpark] aggregate and show spil...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2336#issuecomment-55351136
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20185/consoleFull) for   PR 2336 at commit [`1245eb7`](https://github.com/apache/spark/commit/1245eb7df252876b54651393385fb0600130f6da).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `case class CreateTableAsSelect(`
      * `case class CreateTableAsSelect(`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3463] [PySpark] aggregate and show spil...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2336#discussion_r17514735
  
    --- Diff: python/pyspark/worker.py ---
    @@ -27,12 +27,11 @@
     # copy_reg module.
     from pyspark.accumulators import _accumulatorRegistry
     from pyspark.broadcast import Broadcast, _broadcastRegistry
    -from pyspark.cloudpickle import CloudPickler
    --- End diff --
    
    A few lines prior to this, there was a comment
    
    ```
    # CloudPickler needs to be imported so that depicklers are registered using the
    # copy_reg module.
    ```
    
    If this import is no longer necessary (was it ever?), then we should delete that comment, too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3463] [PySpark] aggregate and show spil...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2336#issuecomment-55346987
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20185/consoleFull) for   PR 2336 at commit [`1245eb7`](https://github.com/apache/spark/commit/1245eb7df252876b54651393385fb0600130f6da).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3463] [PySpark] aggregate and show spil...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2336#issuecomment-55220824
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/65/consoleFull) for   PR 2336 at commit [`fbe9029`](https://github.com/apache/spark/commit/fbe902942449d8b112a4f6ca1e0be6968218213c).
     * This patch **fails** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3463] [PySpark] aggregate and show spil...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2336#issuecomment-55222663
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20139/consoleFull) for   PR 2336 at commit [`7e4ad04`](https://github.com/apache/spark/commit/7e4ad040c0a7e38bb86128659e85d773137dfbab).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3463] [PySpark] aggregate and show spil...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2336#issuecomment-55220766
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/65/consoleFull) for   PR 2336 at commit [`fbe9029`](https://github.com/apache/spark/commit/fbe902942449d8b112a4f6ca1e0be6968218213c).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3463] [PySpark] aggregate and show spil...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2336#issuecomment-55193605
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/36/consoleFull) for   PR 2336 at commit [`fbe9029`](https://github.com/apache/spark/commit/fbe902942449d8b112a4f6ca1e0be6968218213c).
     * This patch **fails** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3463] [PySpark] aggregate and show spil...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2336#issuecomment-55515004
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20296/consoleFull) for   PR 2336 at commit [`e37df38`](https://github.com/apache/spark/commit/e37df38f32141cb1b303d9ea73ce20ffdf41229e).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3463] [PySpark] aggregate and show spil...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2336#discussion_r17509198
  
    --- Diff: python/pyspark/shuffle.py ---
    @@ -68,6 +68,11 @@ def _get_local_dirs(sub):
         return [os.path.join(d, "python", str(os.getpid()), sub) for d in dirs]
     
     
    +# global stats
    +MemoryBytesSpilled = 0L
    +DiskBytesSpilled = 0L
    +
    +
    --- End diff --
    
    I'd like to merge the worker re-use patch soon, so we should fix this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3463] [PySpark] aggregate and show spil...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2336#issuecomment-55192472
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/35/consoleFull) for   PR 2336 at commit [`fbe9029`](https://github.com/apache/spark/commit/fbe902942449d8b112a4f6ca1e0be6968218213c).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3463] [PySpark] aggregate and show spil...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2336#discussion_r17508565
  
    --- Diff: python/pyspark/shuffle.py ---
    @@ -68,6 +68,11 @@ def _get_local_dirs(sub):
         return [os.path.join(d, "python", str(os.getpid()), sub) for d in dirs]
     
     
    +# global stats
    +MemoryBytesSpilled = 0L
    +DiskBytesSpilled = 0L
    +
    +
    --- End diff --
    
    This will need to be fixed if we reuse Python workers, but it should be okay for now. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3463] [PySpark] aggregate and show spil...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2336#issuecomment-55219205
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/63/consoleFull) for   PR 2336 at commit [`fbe9029`](https://github.com/apache/spark/commit/fbe902942449d8b112a4f6ca1e0be6968218213c).
     * This patch **fails** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3463] [PySpark] aggregate and show spil...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2336#issuecomment-55222753
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20139/consoleFull) for   PR 2336 at commit [`7e4ad04`](https://github.com/apache/spark/commit/7e4ad040c0a7e38bb86128659e85d773137dfbab).
     * This patch **fails** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3463] [PySpark] aggregate and show spil...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2336#discussion_r17457077
  
    --- Diff: core/src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala ---
    @@ -242,7 +242,8 @@ class JobProgressListener(conf: SparkConf) extends SparkListener with Logging {
                 t.taskMetrics)
     
               // Overwrite task metrics
    -          t.taskMetrics = Some(taskMetrics)
    +          // FIXME: deepcopy the metrics, or they will be the same object in local mode
    +          t.taskMetrics = Some(scala.util.Marshal.load[TaskMetrics](scala.util.Marshal.dump(taskMetrics)))
    --- End diff --
    
    I will rebase it after #2338 is merged.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3463] [PySpark] aggregate and show spil...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2336#issuecomment-55516000
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20296/consoleFull) for   PR 2336 at commit [`e37df38`](https://github.com/apache/spark/commit/e37df38f32141cb1b303d9ea73ce20ffdf41229e).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3463] [PySpark] aggregate and show spil...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/2336#issuecomment-55508153
  
    This looks good to me.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org