You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by chesterxgchen <gi...@git.apache.org> on 2014/10/13 20:56:15 UTC

[GitHub] spark pull request: [SPARK-3913] Spark Yarn Client API change to e...

GitHub user chesterxgchen opened a pull request:

    https://github.com/apache/spark/pull/2786

    [SPARK-3913] Spark Yarn Client API change to expose Yarn Resource Capacity, Yarn Application Listener and KillApplication APIs

    Spark Yarn Client API change to expose Yarn Resource Capacity, Yarn Application Listener and KillApplication APIs
    
    When working with Spark with Yarn cluster mode, we have following issues:
    1) We don't know how much yarn max capacity ( memory and cores) before we specify the number of executor and memories for spark drivers and executors. We we set a big number, the job can potentially exceeds the limit and got killed.
    It would be better we let the application know that the yarn resource capacity a head of time and the spark config can adjusted dynamically.
    2) Once job started, we would like some feedback from yarn application. Currently, the spark client basically block the call and returns when the job is finished or failed or killed.
    If the job runs for few hours, we have no idea how far it has gone, the progress and resource usage, tracking URL etc. This Pull Request will not complete solve the issue #2, but it will allow expose Yarn Application status: such as when the job is started, killed, finished, the tracking URL etc, some limited progress reporting ( for CDH5 we found the progress only reports 0, 10 and 100%)
    
    I will have another Pull Request to address the Yarn Application and Spark Job communication issue, that's not covered here.
    
    3) If we decide to stop the spark job, the Spark Yarn Client expose a stop method. But the stop method, in many cases, does not stop the yarn application.
    
      So we need to expose the yarn client's killApplication() API to spark client.
    
    The proposed change is to change Client Constructor, change the first argument from ClientArguments to
     YarnResourceCapacity => ClientArguments
    
     Were YarnResourceCapacity contains yarn's max memory and virtual cores as well as overheads.
    
    This allows application to adjust the memory and core settings accordingly.
    
    For existing application that ignore the YarnResourceCapacity the
    
    def toArgs (capacity: YarnResourceCapacity) = new ClientArguments(...)
    
     We also defined the YarnApplicationListener interface that expose some of the information about YarnApplicationReport.
    
      Client.addYarnApplicaitonListener(listerner)
      will allow them to get call back at different state of the application, so they can react accordingly.
    
      For example, onApplicationInit() the callback will invoked when the AppId is available but application is not yet started. Once can use this AppId to kill the application if the run is not longer desired.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/AlpineNow/spark SPARK-3913

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2786.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2786
    
----
commit fd66c16a34af149e16b2af8de742044ea32dd332
Author: chesterxgchen <ch...@alpinenow.com>
Date:   2014-10-12T03:33:05Z

    [SPARK-3913]
    Spark Yarn Client API change to expose Yarn Resource Capacity, Yarn Application Listener and KillApplication APIs
    
    When working with Spark with Yarn cluster mode, we have following issues:
    1) We don't know how much yarn max capacity ( memory and cores) before we specify the number of executor and memories for spark drivers and executors. We we set a big number, the job can potentially exceeds the limit and got killed.
    It would be better we let the application know that the yarn resource capacity a head of time and the spark config can adjusted dynamically.
    2) Once job started, we would like some feedback from yarn application. Currently, the spark client basically block the call and returns when the job is finished or failed or killed.
    If the job runs for few hours, we have no idea how far it has gone, the progress and resource usage, tracking URL etc. This Pull Request will not complete solve the issue #2, but it will allow expose Yarn Application status: such as when the job is started, killed, finished, the tracking URL etc, some limited progress reporting ( for CDH5 we found the progress only reports 0, 10 and 100%)
    
       I will have another Pull Request to address the Yarn Application and Spark Job communication issue, that's not covered here.
    3) If we decide to stop the spark job, the Spark Yarn Client expose a stop method. But the stop method, in many cases, does not the yarn application.
    
      So we need to expose the yarn client's killApplication() API to spark client.
    
    The proposed change is to change Client Constructor, change the first argument from ClientArguments to
     YarnResourceCapacity => ClientArguments
    
     Were YarnResourceCapacity contains yarn's max memory and virtual cores as well as overheads.
    
    This allows application to adjust the memory and core settings accordingly.
    
    For existing application that ignore the YarnResourceCapacity the
    
    def toArgs (capacity: YarnResourceCapacity) = new ClientArguments(...)
    
     We also defined the YarnApplicationListener interface that expose some of the information about YarnApplicationReport.
    
      Client.addYarnApplicaitonListener(listern)
      will allow them to get call back at different state of the application, so they can react accordingly.
    
      For example, onApplicationInit() the callback will invoked when the AppId is available but application is not yet started. Once can use this AppId to kill the application if the run is not longer desired.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3913] Spark Yarn Client API change to e...

Posted by chesterxgchen <gi...@git.apache.org>.

Github user chesterxgchen commented on the pull request:

    https://github.com/apache/spark/pull/2786#issuecomment-73268235
  
    Sorry, to take so long on this, as I went to working on Hadoop Kerberos authentication Implementation, so I did not get back to this until now. 
    I noticed that Sandy changed the Yarn module and consolidated the yarn-common/yarn-alpha and yarn-stable to yarn.  My original PR is no longer compatible with new code base (1.3 or master).  I have re-worked on this PR, I might merged the two into one. will update this soon. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3913] Spark Yarn Client API change to e...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2786#issuecomment-58938127
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3913] Spark Yarn Client API change to e...

Posted by chesterxgchen <gi...@git.apache.org>.

Github user chesterxgchen commented on the pull request:

https://github.com/apache/spark/pull/2786#issuecomment-62060857

Tom,

thanks for reviewing.

I am still working on the second PR, which I haven't submitted yet. The code is currently used in our application and I am pulling them out from our code and make a PR for it. The current code only uses Akka to do the communication, I would like to add the Netty support as well before I submit the Pull Request, that's why I haven't submit it yet.

The followings are the use cases in our application, which show how the new APIs are used. I assume other applications will have the similar use cases.

Our application doesn't use spark-submit command line to run spark. We submit both hadoop and spark job directly from our servlet application (jetty). We are deploying in Yarn Cluster Mode. We invoke the Spark Client ( in yarn module) directly. Client can't call System.exists, which will shutdown the jetty JVM.

Our application will submit and stop spark job, monitoring the spark job progress, get the states from the spark jobs ( for example, bad data counters ), logging and exceptions. So far the communication is one way (direction) after the job is submitted; we will move to two-ways communication soon. ( for example, a long running spark context, with different short spark actions: distinct counts, samplings, filters, transformations, etc. a pipeline of actions but need to be feedback on each action ( visualization etc.) )

In this particular Pull Request, we only address very limited requirements, the next PR will address the rest of communication mentioned above.

1) Get Yarn Container Capacities before submit Yarn Applications
In our Spark Job, we can use this callback in two ways:
A) we cap the request memory usage if the request is too large. For example, if the spark.executor.memory supplied by client is larger than the Yarn Container max memory, we reset the spark.executor.memory to yarn max container max memory minus over head and send a message to the Application ( UI message) tell the user that we reset the memory. Or we could simply throw exception without submit the job.
users might be use the information about virtual cores to do other validation.

2) We can dynamically estimate the executor memory based on data size ( if you have the information from prev processing steps) and max memory available; rather than directly use the fix memory size and potentially get kill if they are too large.

2) Add some callback via listener to monitoring Yarn application progress
We are using the listener call back do show progress ( limited information)
1) We have tracking URL from Yarn that can bring us directly to the Hadoop Cluster Application management page.
As soon as the Yarn container is created and job is submitted, we have tracking URL from Yarn ( we need to watch out for invalid URL), at this point you can put the URL in the UI, even the Spark job is not started yet.
2) We display the progress bar on the UI with the callback
For CDH5, we only got 0%, 10% and 100%, not very useful, but still some earlier feedback for customer.
3)We get the Yarn Application ID when the job is started, which can be used for tracking progress or kill the app.
( with next PR, we are able to directly using the tracking URL to open the spark UI page, show spark job iterations and spark specific progress etc., currently all above are implemented in our application.)

3) expose Yarn Kill Application API

Yes, you can directly invoke from command line with
yarn kill -applicationId appid.

But since we need to call from our application, we need a API to do this. In our application, if client start the job and then decided to stop it ( running too long, change parameters etc.), we have to use kill API to kill it, as stop API doesn't stop it.

Hope this will give you a better picture as to why this PR is important to us.

I will move faster with next PR mentioned.

Thanks
Chester

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3913] Spark Yarn Client API change to e...

Posted by tgravescs <gi...@git.apache.org>.

Github user tgravescs commented on the pull request:

    https://github.com/apache/spark/pull/2786#issuecomment-61991579
  
    what pull request are you referring to here?
    
    "I will have another Pull Request to address the Yarn Application and Spark Job communication issue, that's not covered here."
    
    The things you mention are all useful but perhaps I'm not seeing the bigger picture on how you view these being used?   You added interface to addApplicationListener but how do you see a user doing that or how is that tied to the user? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3913] Spark Yarn Client API change to e...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2786#issuecomment-96770160
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3913] Spark Yarn Client API change to e...

Posted by chesterxgchen <gi...@git.apache.org>.

Github user chesterxgchen commented on the pull request:

https://github.com/apache/spark/pull/2786#issuecomment-62450789

>>We don't officially support calling directly in to the yarn Client at this point.

So far, we haven't done this yet. As the communication is one-way push from Server to Application. But we won't like to do something in our next release of application. My next PR, would setup of the communication channel to enable this possibility.

>>I think I would like to see the second pr or an overall design before putting this piece in to see how things really fit together

Technically, the second part PR is not directly related to the this PR, event though we used both changes together in our Application.

In nutshell, the 2nd PR is simply the following:

1) When Submitting the Spark Job to Yarn, send the application host and port ( akka URL) in the arguments to the Client class;

2) In the spark job, try to connect to the Application with the host and port to established the handshake.
In our case, simply resolve the Akka actor via Akka Selection on given Akka URL.

3) Once the connection is reestablished, we can now send the message from a actor ( created from SparkContext's actor system) to the remote actor listener which listen to the message.

4) The spark job can be defined in a newly defined submain () method ( new trait) and exception throw can be directly caught and send as error message to the remote listener before re-throw again. The exception is relayed to Application side to stop the calling job and display on the UI

5) All logs can be redirected to both to listener and yarn container console ( using PrintStream re-direct to overwrite println)

6) create a new SparkJob Listener to catch the same job status in the spark UI to the listener, which then re-displace the progress and job iteration tasks in the application UI with spark tracking URL.

7) Other application stats (such as error counter) can be send as task message, so the application can update the application directly.

I am currently working on isolate the AKKA piece to so that the Netty can be used as the communication layer. In this way, large data size can be transferred. I will make it configurable for people to plugin other network protocols.

Due to our own release schedule, I was not able to work as fast as I hoped. But hope this give you the sense what's the overall PR is about.

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3913] Spark Yarn Client API change to e...

Posted by tgravescs <gi...@git.apache.org>.

Github user tgravescs commented on the pull request:

    https://github.com/apache/spark/pull/2786#issuecomment-62394796
  
    Thanks for the explanation.  A couple of things.
    
    - We don't officially support calling directly in to the yarn Client at this point. 
    - I think I would like to see the second pr or an overall design before putting this piece in to see how things really fit together.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3913] Spark Yarn Client API change to e...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2786#issuecomment-58954304
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3913] Spark Yarn Client API change to e...

Posted by witgo <gi...@git.apache.org>.

Github user witgo commented on the pull request:

    https://github.com/apache/spark/pull/2786#issuecomment-58986316
  
    Cool, this is a very good improvement.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org