You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Mingyu Kim <mk...@palantir.com> on 2016/06/14 00:30:20 UTC

Utilizing YARN AM RPC port field

Hi all,

YARN provides a way for AppilcationMaster to register a RPC port so that a client outside the YARN cluster can reach the application for any RPCs, but Spark’s YARN AMs simply register a dummy port number of 0. (See https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnRMClient.scala#L74) This is useful for the long-running Spark application usecases where jobs are submitted via a form of RPC to an already started Spark context running in YARN cluster mode. Spark job server (https://github.com/spark-jobserver/spark-jobserver) and Livy (https://github.com/cloudera/hue/tree/master/apps/spark/java) are good open-source examples of these usecases. The current work-around is to have the Spark AM make a call back to a configured URL with the port number of the RPC server for the client to communicate with the AM.

Utilizing YARN AM RPC port allows the port number reporting to be done in a secure way (i.e. With AM RPC port field and Kerberized YARN cluster, you don’t need to re-invent a way to verify the authenticity of the port number reporting.) and removes the callback from YARN cluster back to a client, which means you can operate YARN in a low-trust environment and run other client applications behind a firewall.

A couple of proposals for utilizing YARN AM RPC port I have are, (Note that you cannot simply pre-configure the port number and pass it to Spark AM via configuration because of potential port conflicts on the YARN node)

· Start-up an empty Jetty server during Spark AM initialization, set the port number when registering AM with RM, and pass a reference to the Jetty server into the Spark application (e.g. through SparkContext) for the application to dynamically add servlet/resources to the Jetty server.

· Have an optional static method in the main class (e.g. initializeRpcPort()) which optionally sets up a RPC server and returns the RPC port. Spark AM can call this method, register the port number to RM and continue on with invoking the main method. I don’t see this making a good API, though.

I’m curious to hear what other people think. Would this be useful for anyone? What do you think about the proposals? Please feel free to suggest other ideas. Thanks!

Mingyu

Re: Utilizing YARN AM RPC port field

Posted by Mingyu Kim <mk...@palantir.com>.

FYI, I just filed https://issues.apache.org/jira/browse/SPARK-15974.

Mingyu

From: Mingyu Kim <mk...@palantir.com>
Date: Tuesday, June 14, 2016 at 2:13 PM
To: Steve Loughran <st...@hortonworks.com>
Cc: "dev@spark.apache.org" <de...@spark.apache.org>, Matt Cheah <mc...@palantir.com>
Subject: Re: Utilizing YARN AM RPC port field

Thanks for the pointers, Steve!

The first option sounds like a the most light-weight and non-disruptive option among them. So, we can add a configuration that enables socket initialization, Spark AM will create a ServerSocket if the socket init is enabled and set it on SparkContext

If there are no objections, I can file a bug and find time to tackle it myself. 

Mingyu

From: Steve Loughran <st...@hortonworks.com>
Date: Tuesday, June 14, 2016 at 4:55 AM
To: Mingyu Kim <mk...@palantir.com>
Cc: "dev@spark.apache.org" <de...@spark.apache.org>, Matt Cheah <mc...@palantir.com>
Subject: Re: Utilizing YARN AM RPC port field

On 14 Jun 2016, at 01:30, Mingyu Kim <mk...@palantir.com> wrote:

Hi all,

YARN provides a way for AppilcationMaster to register a RPC port so that a client outside the YARN cluster can reach the application for any RPCs, but Spark’s YARN AMs simply register a dummy port number of 0. (See https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnRMClient.scala#L74) This is useful for the long-running Spark application usecases where jobs are submitted via a form of RPC to an already started Spark context running in YARN cluster mode. Spark job server (https://github.com/spark-jobserver/spark-jobserver) and Livy (https://github.com/cloudera/hue/tree/master/apps/spark/java) are good open-source examples of these usecases. The current work-around is to have the Spark AM make a call back to a configured URL with the port number of the RPC server for the client to communicate with the AM.

Utilizing YARN AM RPC port allows the port number reporting to be done in a secure way (i.e. With AM RPC port field and Kerberized YARN cluster, you don’t need to re-invent a way to verify the authenticity of the port number reporting.) and removes the callback from YARN cluster back to a client, which means you can operate YARN in a low-trust environment and run other client applications behind a firewall.

A couple of proposals for utilizing YARN AM RPC port I have are, (Note that you cannot simply pre-configure the port number and pass it to Spark AM via configuration because of potential port conflicts on the YARN node)

·         Start-up an empty Jetty server during Spark AM initialization, set the port number when registering AM with RM, and pass a reference to the Jetty server into the Spark application (e.g. through SparkContext) for the application to dynamically add servlet/resources to the Jetty server.

·         Have an optional static method in the main class (e.g. initializeRpcPort()) which optionally sets up a RPC server and returns the RPC port. Spark AM can call this method, register the port number to RM and continue on with invoking the main method. I don’t see this making a good API, though.

I’m curious to hear what other people think. Would this be useful for anyone? What do you think about the proposals? Please feel free to suggest other ideas. Thanks!

It's a recurrent irritation of mine that you can't ever change the HTTP/RPC ports of a YARN AM after launch; it creates a complex startup state where you can't register until your IPC endpoints are up.

Tactics

-Create a socket on an empty port, register it, hand off the port to the RPC setup code as the chosen port. Ideally, support a range to scan, so that systems which only open a specific range of ports, e.g. 6500-6800 can have those ports only scanned. We've done this in other projects.

-serve up the port binding info via a REST API off the AM web; clients hit the (HEAD/GET only RM Proxy), ask for the port, work on it. Nonstandard; could be extensible with other binding information. (TTL of port caching, ....)

-Use the YARN-913 ZK based registry to register/lookup bindings. This is used in various YARN apps to register service endpoints (RPC, Rest); there's work ongoing for DNS support. this would allow you to use DNS against a specific DNS server to get the endpoints. Works really well with containerized deployments where the apps come up with per-container IPaddresses and fixed ports.

Although you couldn't get the latter into the spark-yarn codeitself (needs Hadoop 2.6+), you can plug in support via the extension point implemented in SPARK-11314., I've actually thought of doing that for a while...just been too busy.

-Just fix the bit of the YARN api that forces you to know your endpoints in advance. People will appreciate it, though it will take a while to trickle downstream.

Re: Utilizing YARN AM RPC port field

Posted by Mingyu Kim <mk...@palantir.com>.

Thanks for the pointers, Steve!

 

The first option sounds like a the most light-weight and non-disruptive option among them. So, we can add a configuration that enables socket initialization, Spark AM will create a ServerSocket if the socket init is enabled and set it on SparkContext

 

If there are no objections, I can file a bug and find time to tackle it myself. 

 

Mingyu

 

From: Steve Loughran <st...@hortonworks.com>
Date: Tuesday, June 14, 2016 at 4:55 AM
To: Mingyu Kim <mk...@palantir.com>
Cc: "dev@spark.apache.org" <de...@spark.apache.org>, Matt Cheah <mc...@palantir.com>
Subject: Re: Utilizing YARN AM RPC port field

 

 

On 14 Jun 2016, at 01:30, Mingyu Kim <mk...@palantir.com> wrote:

 

Hi all,

 

YARN provides a way for AppilcationMaster to register a RPC port so that a client outside the YARN cluster can reach the application for any RPCs, but Spark’s YARN AMs simply register a dummy port number of 0. (See https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnRMClient.scala#L74) This is useful for the long-running Spark application usecases where jobs are submitted via a form of RPC to an already started Spark context running in YARN cluster mode. Spark job server (https://github.com/spark-jobserver/spark-jobserver) and Livy (https://github.com/cloudera/hue/tree/master/apps/spark/java) are good open-source examples of these usecases. The current work-around is to have the Spark AM make a call back to a configured URL with the port number of the RPC server for the client to communicate with the AM.

 

Utilizing YARN AM RPC port allows the port number reporting to be done in a secure way (i.e. With AM RPC port field and Kerberized YARN cluster, you don’t need to re-invent a way to verify the authenticity of the port number reporting.) and removes the callback from YARN cluster back to a client, which means you can operate YARN in a low-trust environment and run other client applications behind a firewall.

 

A couple of proposals for utilizing YARN AM RPC port I have are, (Note that you cannot simply pre-configure the port number and pass it to Spark AM via configuration because of potential port conflicts on the YARN node)

 

·         Start-up an empty Jetty server during Spark AM initialization, set the port number when registering AM with RM, and pass a reference to the Jetty server into the Spark application (e.g. through SparkContext) for the application to dynamically add servlet/resources to the Jetty server.

·         Have an optional static method in the main class (e.g. initializeRpcPort()) which optionally sets up a RPC server and returns the RPC port. Spark AM can call this method, register the port number to RM and continue on with invoking the main method. I don’t see this making a good API, though.

 

I’m curious to hear what other people think. Would this be useful for anyone? What do you think about the proposals? Please feel free to suggest other ideas. Thanks!

 

 

It's a recurrent irritation of mine that you can't ever change the HTTP/RPC ports of a YARN AM after launch; it creates a complex startup state where you can't register until your IPC endpoints are up.

 

Tactics

 

-Create a socket on an empty port, register it, hand off the port to the RPC setup code as the chosen port. Ideally, support a range to scan, so that systems which only open a specific range of ports, e.g. 6500-6800 can have those ports only scanned. We've done this in other projects.

 

-serve up the port binding info via a REST API off the AM web; clients hit the (HEAD/GET only RM Proxy), ask for the port, work on it. Nonstandard; could be extensible with other binding information. (TTL of port caching, ....)

 

-Use the YARN-913 ZK based registry to register/lookup bindings. This is used in various YARN apps to register service endpoints (RPC, Rest); there's work ongoing for DNS support. this would allow you to use DNS against a specific DNS server to get the endpoints. Works really well with containerized deployments where the apps come up with per-container IPaddresses and fixed ports.

Although you couldn't get the latter into the spark-yarn codeitself (needs Hadoop 2.6+), you can plug in support via the extension point implemented in SPARK-11314., I've actually thought of doing that for a while...just been too busy.

 

-Just fix the bit of the YARN api that forces you to know your endpoints in advance. People will appreciate it, though it will take a while to trickle downstream.

Re: Utilizing YARN AM RPC port field

Posted by Steve Loughran <st...@hortonworks.com>.

On 14 Jun 2016, at 01:30, Mingyu Kim <mk...@palantir.com>> wrote:

Hi all,

• Start-up an empty Jetty server during Spark AM initialization, set the port number when registering AM with RM, and pass a reference to the Jetty server into the Spark application (e.g. through SparkContext) for the application to dynamically add servlet/resources to the Jetty server.
• Have an optional static method in the main class (e.g. initializeRpcPort()) which optionally sets up a RPC server and returns the RPC port. Spark AM can call this method, register the port number to RM and continue on with invoking the main method. I don’t see this making a good API, though.

I’m curious to hear what other people think. Would this be useful for anyone? What do you think about the proposals? Please feel free to suggest other ideas. Thanks!

It's a recurrent irritation of mine that you can't ever change the HTTP/RPC ports of a YARN AM after launch; it creates a complex startup state where you can't register until your IPC endpoints are up.

Tactics

-Create a socket on an empty port, register it, hand off the port to the RPC setup code as the chosen port. Ideally, support a range to scan, so that systems which only open a specific range of ports, e.g. 6500-6800 can have those ports only scanned. We've done this in other projects.

-serve up the port binding info via a REST API off the AM web; clients hit the (HEAD/GET only RM Proxy), ask for the port, work on it. Nonstandard; could be extensible with other binding information. (TTL of port caching, ....)

-Use the YARN-913 ZK based registry to register/lookup bindings. This is used in various YARN apps to register service endpoints (RPC, Rest); there's work ongoing for DNS support. this would allow you to use DNS against a specific DNS server to get the endpoints. Works really well with containerized deployments where the apps come up with per-container IPaddresses and fixed ports.
Although you couldn't get the latter into the spark-yarn codeitself (needs Hadoop 2.6+), you can plug in support via the extension point implemented in SPARK-11314., I've actually thought of doing that for a while...just been too busy.

-Just fix the bit of the YARN api that forces you to know your endpoints in advance. People will appreciate it, though it will take a while to trickle downstream.