You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Masood Krohy <ma...@intact.net> on 2016/10/24 18:34:23 UTC

Getting the IP address of Spark Driver in yarn-cluster mode

Hi everyone,
Is there a way to set the IP address/hostname that the Spark Driver is 
going to be running on when launching a program through spark-submit in 
yarn-cluster mode (PySpark 1.6.0)?
I do not see an option for this. If not, is there a way to get this IP 
address after the Spark app has started running? (through an API call at 
the beginning of the program to be used in the rest of the program). 
spark-submit outputs “ApplicationMaster host: 10.0.0.9” in the console 
(and changes on every run due to yarn cluster mode) and I am wondering if 
this can be accessed within the program. It does not seem to me that a 
YARN node label can be used to tie the Spark Driver/AM to a node, while 
allowing the Executors to run on all the nodes.
I am running a parameter server along with the Spark Driver that needs to 
be contacted during the program execution; I need the Driver’s IP so that 
other executors can call back to this server. I need to stick to the 
yarn-cluster mode.
Thanks for any hints in advance.
Masood
PS: the closest code I was able to write is this which is not outputting 
what I need:
print sc.statusTracker().getJobInfo( 
sc.statusTracker().getActiveJobsIds()[0] )
# output in YARN stdout log: SparkJobInfo(jobId=4, stageIds=JavaObject 
id=o101, status='SUCCEEDED')

------------------------------
Masood Krohy, Ph.D.
Data Scientist, Intact Lab-R&D
Intact Financial Corporation

Re: Getting the IP address of Spark Driver in yarn-cluster mode

Posted by Masood Krohy <ma...@intact.net>.
Thanks Steve.

Here is the Python pseudo code that got it working for me:

  import time; 
  import urllib2
  nodes= ({'worker1_hostname':'worker1_ip', ... })
  YARN_app_queue = 'default'
  YARN_address = 'http://YARN_IP:8088'

  YARN_app_startedTimeBegin = str(int(time.time() - 3600)) # We allow 
3,600 sec from start of the app up to this point

  requestedURL = (YARN_address + 
 '/ws/v1/cluster/apps?states=RUNNING&applicationTypes=SPARK&limit=1' + 
                  '&queue=' + YARN_app_queue + 
                  '&startedTimeBegin=' + YARN_app_startedTimeBegin)
  print 'Sent request to YARN: ' + requestedURL
  response = urllib2.urlopen(requestedURL)
  html = response.read()
  amHost_start = html.find('amHostHttpAddress') + 
len('amHostHttpAddress":"')
  amHost_length = len('worker1_hostname')
  amHost = html[amHost_start : amHost_start + amHost_length]
  print 'amHostHttpAddress is: ' + amHost
  try:
      self.websock = ...
      print 'Connected to server running on %s' % nodes[amHost] 
  except:
      print 'Could not connect to server on %s' % nodes[amHost]        



------------------------------
Masood Krohy, Ph.D.
Data Scientist, Intact Lab-R&D
Intact Financial Corporation




De :    Steve Loughran <st...@hortonworks.com>
A :     Masood Krohy <ma...@intact.net>
Cc :    "user@spark.apache.org" <us...@spark.apache.org>
Date :  2016-10-24 17:09
Objet : Re: Getting the IP address of Spark Driver in yarn-cluster mode




On 24 Oct 2016, at 19:34, Masood Krohy <ma...@intact.net> wrote:

Hi everyone, 
Is there a way to set the IP address/hostname that the Spark Driver is 
going to be running on when launching a program through spark-submit in 
yarn-cluster mode (PySpark 1.6.0)? 
I do not see an option for this. If not, is there a way to get this IP 
address after the Spark app has started running? (through an API call at 
the beginning of the program to be used in the rest of the program). 
spark-submit outputs “ApplicationMaster host: 10.0.0.9” in the console 
(and changes on every run due to yarn cluster mode) and I am wondering if 
this can be accessed within the program. It does not seem to me that a 
YARN node label can be used to tie the Spark Driver/AM to a node, while 
allowing the Executors to run on all the nodes. 



you can grab it off the YARN API itself; there's a REST view as well as a 
fussier RPC level. That is, assuming you want the web view, which is what 
is registered. 

If you know the application ID, you can also construct a URL through the 
YARN proxy; any attempt to talk direct to the AM is going to get 302'd 
back there anyway so any kerberos credentials can be verified.




Re: Getting the IP address of Spark Driver in yarn-cluster mode

Posted by Steve Loughran <st...@hortonworks.com>.
On 24 Oct 2016, at 19:34, Masood Krohy <ma...@intact.net>> wrote:

Hi everyone,

Is there a way to set the IP address/hostname that the Spark Driver is going to be running on when launching a program through spark-submit in yarn-cluster mode (PySpark 1.6.0)?

I do not see an option for this. If not, is there a way to get this IP address after the Spark app has started running? (through an API call at the beginning of the program to be used in the rest of the program). spark-submit outputs “ApplicationMaster host: 10.0.0.9” in the console (and changes on every run due to yarn cluster mode) and I am wondering if this can be accessed within the program. It does not seem to me that a YARN node label can be used to tie the Spark Driver/AM to a node, while allowing the Executors to run on all the nodes.



you can grab it off the YARN API itself; there's a REST view as well as a fussier RPC level. That is, assuming you want the web view, which is what is registered.

If you know the application ID, you can also construct a URL through the YARN proxy; any attempt to talk direct to the AM is going to get 302'd back there anyway so any kerberos credentials can be verified.