You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by davies <gi...@git.apache.org> on 2014/08/13 02:32:03 UTC

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

GitHub user davies opened a pull request:

    https://github.com/apache/spark/pull/1912

    [SPARK-1065] [PySpark] improve supporting for large broadcast

    Passing large object by py4j is very slow (cost much memory), so pass broadcast objects via files (similar to parallelize()).
    
    Add an option to keep object in driver (it's False by default) to save memory in driver.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/davies/spark broadcast

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1912.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1912
    
----
commit 62261891e7cfd89935676f77eddf2e2a66f7f9d2
Author: Davies Liu <da...@gmail.com>
Date:   2014-08-13T00:25:55Z

    improve large broadcast
    
    Passing large object by py4j is very slow (cost much memory),
    so pass broadcast objects via files (similar to parallelize()).
    
    Add an option to keep object in driver (it's False by default)
    to save memory in driver.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/1912#issuecomment-52409343
  
    Actually, I'm just going to merge this now and I'll add the docstring as part of a subsequent documentation-improvement PR (I also want to edit some Scala / Java docs, too).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/1912#issuecomment-52001814
  
    failed tests were not related to this PR


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1912#issuecomment-51999152
  
    QA results for PR 1912:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18398/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1912#issuecomment-51996483
  
    QA tests have started for PR 1912. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18398/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/1912#issuecomment-52409390
  
    I've merged this into `master` and `branch-1.1`.  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/1912#issuecomment-52094561
  
    @frol , Yes, thanks again!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1912#issuecomment-52101740
  
    QA results for PR 1912:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds the following public classes (experimental):<br>class CompressedSerializer(FramedSerializer):<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18456/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/1912#issuecomment-52385422
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/1912#issuecomment-52273858
  
    It occurs to me: what if we had .value retrieve and depickle the value from the JVM?  Also, won't we still experience memory leaks in the JVM if we iteratively create broadcast variables, since we will never clean up those pickled values?
    
    One approach is to have .value() depickle the JVM value (so we're not changing the user-facing API) and add a Python equivalent of Broadcast.destroy() for performing permanent cleanup of a broadcast's resources.  What do you think of this approach?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1912#issuecomment-52076888
  
    QA tests have started for PR 1912. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18447/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/1912


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1912#discussion_r16279926
  
    --- Diff: python/pyspark/broadcast.py ---
    @@ -52,17 +47,31 @@ class Broadcast(object):
         Access its value through C{.value}.
         """
     
    -    def __init__(self, bid, value, java_broadcast=None, pickle_registry=None):
    +    def __init__(self, bid, value, java_broadcast=None, pickle_registry=None, keep=True):
             """
             Should not be called directly by users -- use
             L{SparkContext.broadcast()<pyspark.context.SparkContext.broadcast>}
             instead.
             """
    -        self.value = value
             self.bid = bid
    +        if keep:
    +            self.value = value
             self._jbroadcast = java_broadcast
             self._pickle_registry = pickle_registry
    +        self.keep = keep
     
         def __reduce__(self):
             self._pickle_registry.add(self)
             return (_from_id, (self.bid, ))
    +
    +    def __getattr__(self, item):
    +        if item == 'value' and not self.keep:
    +            raise Exception("please create broadcast with keep=True to make"
    +                            " it accessable in driver")
    --- End diff --
    
    Typo: should be spelled "accessible."  Also, maybe the error message could be a little clearer about how broadcast variables are created and why call failed.  I'm thinking of something like "please call sc.broadcast() with keep=True to make values accessible in the driver".


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1912#issuecomment-52377931
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18647/consoleFull) for   PR 1912 at commit [`e06df4a`](https://github.com/apache/spark/commit/e06df4a8c211f53a5b7d176c6ec655033f1419ee).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

Posted by andrewor14 <gi...@git.apache.org>.

Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/1912#issuecomment-52016718
  
    I was talking to Jenkins when I said "test this please", but thanks @davies for adding tests too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1912#issuecomment-52076907
  
    QA results for PR 1912:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds the following public classes (experimental):<br>class CompressedSerializer(FramedSerializer):<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18447/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1912#issuecomment-52110417
  
    QA results for PR 1912:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds the following public classes (experimental):<br>class CompressedSerializer(FramedSerializer):<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18460/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1912#issuecomment-52016367
  
    QA tests have started for PR 1912. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18430/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1912#issuecomment-52019343
  
    QA results for PR 1912:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18430/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/1912#issuecomment-52356505
  
    Hmm, looks like this was affected by the Jenkins timeouts last night.
    
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/1912#issuecomment-52377800
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

Posted by frol <gi...@git.apache.org>.

Github user frol commented on the pull request:

    https://github.com/apache/spark/pull/1912#issuecomment-52119630
  
    @davies Compression improved things, but my tasks have heavy computations inside, so it saved only 10 seconds on a 4.5-minute task and also about 10-20 seconds on a 18-minute task. In both cases I have only 340 partitions.
    
    I'm still investigating where the second copy of my fat object is, because I can easily notice that in comparison with my local tests. And also if I cut my big object twice, the memory consumption decreases as it would be cut 4 times on local run.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1912#issuecomment-52385516
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18666/consoleFull) for   PR 1912 at commit [`e06df4a`](https://github.com/apache/spark/commit/e06df4a8c211f53a5b7d176c6ec655033f1419ee).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1912#discussion_r16199737
  
    --- Diff: python/pyspark/context.py ---
    @@ -562,17 +562,24 @@ def union(self, rdds):
             rest = ListConverter().convert(rest, self._gateway._gateway_client)
             return RDD(self._jsc.union(first, rest), self, rdds[0]._jrdd_deserializer)
     
    -    def broadcast(self, value):
    +    def broadcast(self, value, keep=False):
             """
             Broadcast a read-only variable to the cluster, returning a
             L{Broadcast<pyspark.broadcast.Broadcast>}
    -        object for reading it in distributed functions. The variable will be
    -        sent to each cluster only once.
    +        object for reading it in distributed functions. The variable will
    +        be sent to each cluster only once.
    +
    +        :keep: Keep the `value` in driver or not.
             """
    -        pickleSer = PickleSerializer()
    -        pickled = pickleSer.dumps(value)
    -        jbroadcast = self._jsc.broadcast(bytearray(pickled))
    -        return Broadcast(jbroadcast.id(), value, jbroadcast, self._pickled_broadcast_vars)
    +        ser = CompressedSerializer(PickleSerializer())
    +        # pass large object by py4j is very slow and need much memory
    +        tempFile = NamedTemporaryFile(delete=False, dir=self._temp_dir)
    +        ser.dump_stream([value], tempFile)
    +        tempFile.close()
    +        jbroadcast = self._jvm.PythonRDD.readBroadcastFromFile(self._jsc, tempFile.name)
    +        os.unlink(tempFile.name)
    +        return Broadcast(jbroadcast.id(), value if keep else None,
    --- End diff --
    
    If users' existing code relies on accessing Broadcast.val on the driver, I'm worried that this will confuse them when they receive None.  Maybe we can add a None check to Broadcast.value that throws an exception if they try to access the value in these cases.
    
    I suppose there's still a corner-case where a user might want to broadcast None, so if we're being paranoid we could also pass the `keep` flat to Broadcast and check it before throwing the exception.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/1912#issuecomment-52017380
  
    LoL, I realized this just after pushing the commit :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1912#discussion_r16199114
  
    --- Diff: python/pyspark/broadcast.py ---
    @@ -19,18 +19,13 @@
     >>> from pyspark.context import SparkContext
     >>> sc = SparkContext('local', 'test')
     >>> b = sc.broadcast([1, 2, 3, 4, 5])
    ->>> b.value
    -[1, 2, 3, 4, 5]
    -
    ->>> from pyspark.broadcast import _broadcastRegistry
    --- End diff --
    
    Good call here; it was a bad idea to expose these internals in user-facing module doctests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

Posted by frol <gi...@git.apache.org>.

Github user frol commented on the pull request:

    https://github.com/apache/spark/pull/1912#issuecomment-52251712
  
    @JoshRosen No, I'm not noticing any broadcast performance issues now. PySpark works like a charm again. Thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/1912#issuecomment-52409331
  
    I guess we don't necessarily want to expose `destroy()` to the end-user, since it's private in the Scala APIs.  I suppose we might still be leaking broadcast variables in the driver's JVM, but I think that's a problem that affects Scala/Java jobs as well, so maybe we can address it more generally in a separate PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

Posted by andrewor14 <gi...@git.apache.org>.

Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/1912#issuecomment-52015135
  
    test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1912#discussion_r16199423
  
    --- Diff: python/pyspark/rdd.py ---
    @@ -1809,7 +1809,8 @@ def _jrdd(self):
                 self._jrdd_deserializer = NoOpSerializer()
             command = (self.func, self._prev_jrdd_deserializer,
                        self._jrdd_deserializer)
    -        pickled_command = CloudPickleSerializer().dumps(command)
    +        ser = CompressedSerializer(CloudPickleSerializer())
    --- End diff --
    
    This is a good idea.  It wouldn't surprise me if the pickle data was highly compressible due to frequently-occuring groups of pickle opcodes, etc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

Posted by frol <gi...@git.apache.org>.

Github user frol commented on the pull request:

    https://github.com/apache/spark/pull/1912#issuecomment-52079674
  
    @davies I am about to test it again with CompressedSerializer. Am I right that I don't need to change anything in my project, but just rebuild Spark?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1912#issuecomment-52276573
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18598/consoleFull) for   PR 1912 at commit [`e06df4a`](https://github.com/apache/spark/commit/e06df4a8c211f53a5b7d176c6ec655033f1419ee).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

Posted by frol <gi...@git.apache.org>.

Github user frol commented on the pull request:

    https://github.com/apache/spark/pull/1912#issuecomment-52122539
  
    @davies I'm talking about memory in Python workers and it is my issue. (I figured out that my local test had a mistake and after I fix it local test and Spark Python workers consume the same amount of memory). I'm sorry to confuse you.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/1912#issuecomment-52099498
  
    This looks good to me and I'm really glad to read the [JIRA comments](https://issues.apache.org/jira/browse/SPARK-1065?focusedCommentId=14095063&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14095063) saying how it sped things up.
    
    I left one minor usability-related comment, but otherwise this looks good to me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1912#issuecomment-52103597
  
    QA tests have started for PR 1912. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18460/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1912#issuecomment-52386383
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18666/consoleFull) for   PR 1912 at commit [`e06df4a`](https://github.com/apache/spark/commit/e06df4a8c211f53a5b7d176c6ec655033f1419ee).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `abstract class Serializer `
      * `abstract class SerializerInstance `
      * `abstract class SerializationStream `
      * `abstract class DeserializationStream `
      * `class CompressedSerializer(FramedSerializer):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/1912#issuecomment-52248668
  
    @frol After fixing your local test, are you still noticing any broadcast performance issues?  If you still see any odd behavior, could you post a small script or set of pyspark shell commands so we can test it out?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1912#issuecomment-52379131
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18647/consoleFull) for   PR 1912 at commit [`e06df4a`](https://github.com/apache/spark/commit/e06df4a8c211f53a5b7d176c6ec655033f1419ee).
     * This patch **fails** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `trait TaskCompletionListener extends EventListener `
      * `class AvroWrapperToJavaConverter extends Converter[Any, Any] `
      * `class CompressedSerializer(FramedSerializer):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

Posted by andrewor14 <gi...@git.apache.org>.

Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1912#discussion_r16159823
  
    --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala ---
    @@ -315,6 +315,15 @@ private[spark] object PythonRDD extends Logging {
         JavaRDD.fromRDD(sc.sc.parallelize(objs, parallelism))
       }
     
    +  def readBroadcastFromFile(sc: JavaSparkContext, filename: String):
    +  Broadcast[Array[Byte]] = {
    --- End diff --
    
    does this fit in the line above?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1912#issuecomment-52095557
  
    QA tests have started for PR 1912. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18456/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/1912#issuecomment-52277037
  
    I had add Broadcast.unpersist(blocking=False).
    
    Because we have an copy in disks, so read it from there when user want to access it driver, then we can keep the SparkContext.broadcast() unchanged.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/1912#issuecomment-52121865
  
    @frol , The big win of compression maybe save the memory in JVM. It's also a win if it does not increase the runtime. If the future, we could try LZ4, it may help a little bit about runtime, but will not contribute much in your case.
    
    What is the memory you are talking about? in Python driver, JVM, or Python worker?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1912#discussion_r16200173
  
    --- Diff: python/pyspark/context.py ---
    @@ -562,17 +562,24 @@ def union(self, rdds):
             rest = ListConverter().convert(rest, self._gateway._gateway_client)
             return RDD(self._jsc.union(first, rest), self, rdds[0]._jrdd_deserializer)
     
    -    def broadcast(self, value):
    +    def broadcast(self, value, keep=False):
             """
             Broadcast a read-only variable to the cluster, returning a
             L{Broadcast<pyspark.broadcast.Broadcast>}
    -        object for reading it in distributed functions. The variable will be
    -        sent to each cluster only once.
    +        object for reading it in distributed functions. The variable will
    +        be sent to each cluster only once.
    +
    +        :keep: Keep the `value` in driver or not.
             """
    -        pickleSer = PickleSerializer()
    -        pickled = pickleSer.dumps(value)
    -        jbroadcast = self._jsc.broadcast(bytearray(pickled))
    -        return Broadcast(jbroadcast.id(), value, jbroadcast, self._pickled_broadcast_vars)
    +        ser = CompressedSerializer(PickleSerializer())
    +        # pass large object by py4j is very slow and need much memory
    +        tempFile = NamedTemporaryFile(delete=False, dir=self._temp_dir)
    +        ser.dump_stream([value], tempFile)
    +        tempFile.close()
    +        jbroadcast = self._jvm.PythonRDD.readBroadcastFromFile(self._jsc, tempFile.name)
    +        os.unlink(tempFile.name)
    +        return Broadcast(jbroadcast.id(), value if keep else None,
    --- End diff --
    
    Yes, we could show better message when user try to access b.value in driver.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1912#discussion_r16327343
  
    --- Diff: python/pyspark/broadcast.py ---
    @@ -52,17 +50,38 @@ class Broadcast(object):
         Access its value through C{.value}.
         """
     
    -    def __init__(self, bid, value, java_broadcast=None, pickle_registry=None):
    +    def __init__(self, bid, value, java_broadcast=None,
    +                 pickle_registry=None, path=None):
             """
             Should not be called directly by users -- use
             L{SparkContext.broadcast()<pyspark.context.SparkContext.broadcast>}
             instead.
             """
    -        self.value = value
             self.bid = bid
    +        if path is None:
    +            self.value = value
             self._jbroadcast = java_broadcast
             self._pickle_registry = pickle_registry
    +        self.path = path
    +
    +    def unpersist(self, blocking=False):
    --- End diff --
    
    Can you add a docstring?  It's fine to just copy it over from the Scala equivalent.  In this case:
    
    ```scala
      /**
       * Delete cached copies of this broadcast on the executors. If the broadcast is used after
       * this is called, it will need to be re-sent to each executor.
       * @param blocking Whether to block until unpersisting has completed
       */
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org