You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by pwoody <gi...@git.apache.org> on 2016/05/18 20:39:53 UTC

[GitHub] spark pull request: [SPARK-15038] Add option to lazily broadcast a...

GitHub user pwoody opened a pull request:

    https://github.com/apache/spark/pull/13178

    [SPARK-15038] Add option to lazily broadcast at SQL execution time

    ## What changes were proposed in this pull request?
    
    This allows the user to specify whether to do the broadcasting of a small DataFrame asynchronously at query preparation time or during execution time.
    
    ## How was this patch tested?
    
    Manual testing 


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/pwoody/spark feature/lazyBroadcast

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13178.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13178
    
----
commit 237cd561960880ac5d2ddd9bc6fbd6936f2fa1ba
Author: Patrick Woody <pw...@palantir.com>
Date:   2016-05-18T20:34:59Z

    [SPARK-15038] Add option to lazily broadcast at SQL execution time

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15038][SQL] Add option to lazily broadc...

Posted by pwoody <gi...@git.apache.org>.
Github user pwoody commented on the pull request:

    https://github.com/apache/spark/pull/13178#issuecomment-220170469
  
    I could be wrong, but I believe the main difference between this patch and limiting the threadpool is that the broadcasts will still be retained in the memory manager by the reference in the node (which should stick around until execution is completed). The change here should get GC'd and properly evicted from the memory manager as each node gets executed.
    
    Re: bugfix v.s. improvement - this issue prevents certain large queries (100s of broadcasts) from reasonably being completed with DataFrames. This change might be an improvement, but I think there should be a backport of something to 1.6.x to mitigate this issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15038][SQL] Add option to lazily broadc...

Posted by hvanhovell <gi...@git.apache.org>.
Github user hvanhovell commented on the pull request:

    https://github.com/apache/spark/pull/13178#issuecomment-220152418
  
    @pwoody any reason why you did for 1.6?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15038][SQL] Add option to lazily broadc...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13178#issuecomment-220151725
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15038][SQL] Add option to lazily broadc...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/13178#issuecomment-220178472
  
    Can you explain more about the memory differences?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15038][SQL] Add option to lazily broadc...

Posted by hvanhovell <gi...@git.apache.org>.
Github user hvanhovell commented on the pull request:

    https://github.com/apache/spark/pull/13178#issuecomment-220157698
  
    This is an improvement not a bug fix; we only backport those in rare cases. It could be added to Spark 2.1 since we are already in QA for Spark 2.0.
    
    In Spark 2.0 we did some work in this department and we have added a `BroadcastExchangeExec` node which will take care of any broadcasting. I think it will be quite easy to add this feature there.
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15038][SQL] Add option to lazily broadc...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/13178#issuecomment-220166142
  
    hm maybe instead of this, we just need to set the size of the thread pool for broadcasting? if it is set to 1, then it effectively turns it into a sequential execution.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15038][SQL] Add option to lazily broadc...

Posted by pwoody <gi...@git.apache.org>.
Github user pwoody commented on the pull request:

    https://github.com/apache/spark/pull/13178#issuecomment-220153438
  
    Yeah - I was hoping for this to be a part of a 1.6.2 release. Will happily do this against master as well - I'm not sure of the protocol when there is divergence in the diff.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15038][SQL] Add option to lazily broadc...

Posted by pwoody <gi...@git.apache.org>.
Github user pwoody commented on the pull request:

    https://github.com/apache/spark/pull/13178#issuecomment-220473754
  
    Hey apologies for the late reply!
    
    There are effectively two failure modes w.r.t the memory manager:
    1. If you retain the SparkPlan then the broadcast references for the eager/async broadcasts get retained as well and aren't cleaned by the ContextCleaner.
    2. With async, you will be pushing many broadcasts onto disk storage before they get used. In normal execution they will simply hit memory and be issued out to the nodes (almost) immediately.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13178: [SPARK-15038][SQL] Add option to lazily broadcast...

Posted by pwoody <gi...@git.apache.org>.
Github user pwoody closed the pull request at:

    https://github.com/apache/spark/pull/13178


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org