You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by mmalohlava <gi...@git.apache.org> on 2014/10/07 09:22:00 UTC

[GitHub] spark pull request: [SPARK-3270] Spark API for Application Extensi...

GitHub user mmalohlava opened a pull request:

    https://github.com/apache/spark/pull/2691

    [SPARK-3270] Spark API for Application Extensions

    SPARK-3270: Initial proposal of application extensions.
        
    The change set introduces:
       * Spark extension API to implement
       * hook into Executor to handle extension lifecycle
        * a method to specify extension via SparkConf
        * a 'spark.extensions' configuration variable to pass extension
        list to spark context
        * a test verifying that extension is correctly started inside executor lifecycle
        
    For more details please folow SPARK-3270 or design document
        https://docs.google.com/document/d/1dHF9zi7GzFbYnbV2PwaOQ2eLPoTeiN9IogUe4PAOtrQ/edit?usp=sharing

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/0xdata/perrier core_ext

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2691.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2691
    
----
commit 255357e7f1451b592bdd7374b5007aa3ce63690b
Author: mmalohlava <mi...@gmail.com>
Date:   2014-10-03T01:53:02Z

    SPARK-3270: Initial proposal of application extensions.
    
    The commit introduces:
     - Spark extension API to implement
     - hook into Executor to handle extension lifecycle
     - a method to specify extension via SparkConf
     - a 'spark.extensions' configuration variable to pass extension
    list to spark context
    
    For more details please folow SPARK-3270 or design document
    https://docs.google.com/document/d/1dHF9zi7GzFbYnbV2PwaOQ2eLPoTeiN9IogUe4PAOtrQ/edit?usp=sharing

commit 532d352936b47c9b38635976aa33e9010fd6e81a
Author: mmalohlava <mi...@gmail.com>
Date:   2014-10-07T00:02:06Z

    SPARK-3270 : test suite for application extension
    
    The basic test suite verifying that a given extension
    is started on all executors.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3270] Spark API for Application Extensi...

Posted by mmalohlava <gi...@git.apache.org>.
Github user mmalohlava closed the pull request at:

    https://github.com/apache/spark/pull/2691


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3270] Spark API for Application Extensi...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/2691#issuecomment-58146963
  
    This seems quite heavyweight compared to Patrick's suggestion of just using a static object. Why the need for custom logic to load classes? (which even opens up security questions)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3270] Spark API for Application Extensi...

Posted by mmalohlava <gi...@git.apache.org>.
Github user mmalohlava commented on the pull request:

    https://github.com/apache/spark/pull/2691#issuecomment-69623687
  
    Hi @pwendell , thanks for comment!
    
    Exactly `invokeOnEachExecutor ` is what we need! 
    
    I am now using a private RDD and invoking tasks on individual executors.
    Nevertheless, publicly available `invokeOnEachExecutor` API method would be great.
    
    Thanks for your suggestion!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3270] Spark API for Application Extensi...

Posted by mmalohlava <gi...@git.apache.org>.
Github user mmalohlava commented on the pull request:

    https://github.com/apache/spark/pull/2691#issuecomment-61567145
  
    Sorry for delayed answer. I was trying to provide better solution without modifying Spark.
    
    However, regarding Sean's question:
      * In our case we need to collect actual distributed state (approx. number of executors) of the cluster to properly initialize services on all available executors in cluster. Big picture of our use-case: the proposed solution starts defined service at each executor, the service exchange info with master (collect number of available executors + executor ids), and based on that, we reconfigure services in cluster (they require number of available Spark executors).
    
      * I do not see a major security problem in class loading, since Spark is already doing class loading in executor from class path specified via `--jars` and `--files` parameters. The proposed solution is using  the same mechanism.
    
    Nevertheless, in the meantime i was experimenting with solution based on Patrick's idea. It works in the following way: 
      * create a dummy RDD with lot of partitions (i.e., trying to force scheduler to plan execution on all available executors) 
      * running `map` op on RDD trying to collect collect unique executors ids and aprox. number of executors
      * running another `map` which starts our service only on collected executors
    
    *The advantage of this solution:*
      * does not need any modification of Spark infrastructure
    
    *The major disadvantage of this solution:*
      * directly depends on task scheduling, in worst case it will plan execution of the initialization only on 1 executor from all available executors
      * hidden solution which does not expose running services, it collects only approximation of state.
      * overhead of creating dummy RDD with many partitions and running two map operations
    
    From my point of view, it would be much more clean and beneficial to have solution which explicitly allows for interception of executor lifecycle.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3270] Spark API for Application Extensi...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2691#issuecomment-58146366
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3270] Spark API for Application Extensi...

Posted by mmalohlava <gi...@git.apache.org>.
Github user mmalohlava commented on the pull request:

    https://github.com/apache/spark/pull/2691#issuecomment-70600014
  
    Sure, let me create PR for this feature.
    
    In meantime, here is a prototype (tailored to our context) - a private `InvokeOnAllNodesRDD` (https://github.com/h2oai/sparkling-water/blob/master/core/src/main/scala/org/apache/spark/h2o/InvokeOnNodesRDD.scala) which is holding information about location of individual partitions. Nevertheless, in this case i have to provide list of executors in advance - it is collected based on list of blockmanagers and a few nasty round trips around actual cluster (not show in the code).
    
    Does it fit to your idea?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3270] Spark API for Application Extensi...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2691#issuecomment-96770169
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3270] Spark API for Application Extensi...

Posted by mmalohlava <gi...@git.apache.org>.
Github user mmalohlava commented on the pull request:

    https://github.com/apache/spark/pull/2691#issuecomment-96894507
  
    @srowen right now we are using the idea proposed by Patrick above. I would be more happy to have a way to listen to spark executor lifecycle, but it is not blocking issue right now, so closing this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3270] Spark API for Application Extensi...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/2691#issuecomment-68088737
  
    Hey @mmalohlava, what if we exposed a mechanism for running a code block once on each active executor? This is something that has been requested already for other reasons in Spark, and it's a somewhat narrower API to expose. Then once you are about to run your application you could just invoke some static initialization on each executor using this mechanism.
    
    For instance:
    
    ```
    /** Invoke function f() on each executor and return the result of f. */
    sc.invokeOnEachExecutor(f => T): Map[String, T] 
    ```
    
    You could implement this using a custom (private) RDD type that creates tasks for each executor with the location preference.
    
    Would that work?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3270] Spark API for Application Extensi...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2691#issuecomment-58572830
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3270] Spark API for Application Extensi...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/2691#issuecomment-96775943
  
    @mmalohlava Is there any update on this? if you're not going to take it forward do you mind closing this PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org