You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mesos.apache.org by be...@apache.org on 2011/06/05 05:24:01 UTC

svn commit: r1131568 - in /incubator/mesos/trunk/frameworks/torque: ./ README.txt start_pbs_mom.py start_pbs_mom.sh torquesched.py torquesched.sh

Author: benh
Date: Sun Jun  5 03:24:01 2011
New Revision: 1131568

URL: http://svn.apache.org/viewvc?rev=1131568&view=rev
Log:
torque

Added:
    incubator/mesos/trunk/frameworks/torque/
    incubator/mesos/trunk/frameworks/torque/README.txt
    incubator/mesos/trunk/frameworks/torque/start_pbs_mom.py   (with props)
    incubator/mesos/trunk/frameworks/torque/start_pbs_mom.sh   (with props)
    incubator/mesos/trunk/frameworks/torque/torquesched.py
    incubator/mesos/trunk/frameworks/torque/torquesched.sh   (with props)

Added: incubator/mesos/trunk/frameworks/torque/README.txt
URL: http://svn.apache.org/viewvc/incubator/mesos/trunk/frameworks/torque/README.txt?rev=1131568&view=auto
==============================================================================
--- incubator/mesos/trunk/frameworks/torque/README.txt (added)
+++ incubator/mesos/trunk/frameworks/torque/README.txt Sun Jun  5 03:24:01 2011
@@ -0,0 +1,50 @@
+Nexus TORQUE framework readme
+--------------------------------------------
+This framework is a wrapper around the Torque cluster resource manager for the cluster which integrates with a cluster scheduler such as pbs_sched, maui, or moab.
+
+Installing TORQUE:
+------------------
+
+option #1) install from source
+
+option #2) sudo apt-get install torque-server, torque-mom (also potentially relevant: torque-dev, torque-client)
+
+
+==Structure Overview of the Framework==
+---------------------------------------
+
+==FRAMEWORK EXECUTOR==
+The nexus executor for this framework will run pbs_mom and tell it to look at the framework scheduler as its host
+
+can interact with pbs_mom using `momctl`
+
+==FRAMEWORK SCHEDULER==
+The torque FW scheduler is responsible for managing the pbs_server daemon. This includees starting it if necessary, adding and removing nodes using the `pbsmgr` command as it accepts resource offers to dynamically grow (but never beyond its 'safe allocation') or shrink or releases resources.
+
+For example, the FW scheduler can shrink nexus tasks (i.e. torque compute nodes) or killing them (i.e. remove compute notes) in response to a the scheduler queue becoming more (or entirely) empty in order to free resources for another active framework in the cluster (while maintining some minimum number of resources). It can also relaunch and grow tasks to account for increased queue lengths
+
+The minimum number of resources torque should hang on to is configurable:
+MIN_TORQUE_COMPUTE_NODE = resource vector of resources per compute node min
+MIN_TORQUE_CLUSTER_SIZE = # compute nodes to keep around min
+
+===Torque Scheduler===
+The framework can use whichever torque compatible scheduler that is desired. By default it will use the default torque fifo scheduler (pbs_sched)
+
+Permissions:
+------------
+
+****Currently****
+As of right now, to run this framework, the node that the framework scheduler is run on will need to have pbs_server installed.
+
+The framework scheduler will launch pbs_server for you, I *think* you need to be root to run the pbs_server daemon (I haven't figured out how to do it otherwise). 
+
+****Future Alternative****
+An alternate way that we could structure this framework is to require that pbs_server is running on some server already but that it is running with the intention to be fully managed by the torque nexus framework. I think that the management commands can be set up to work for non root users, which the framework could then run as. Then the nexus torque framework scheduler would take the address of the pbs_server as a parameter and would assume it has permissions to add and remove nodes from the server 
+
+
+TODO:
+-----
+- explore an install a torque UI
+ -- maybe PBSWeb  (http://www.clusterresources.com/pipermail/torqueusers/2004-March/000411.html) or apt-get install torque-gui (which has gui clients)
+- figure out permissions better (this page http://www.clusterresources.com/torquedocs21/commands/qrun.shtml mentions "PBS Operation or Manager privilege.")
+- might want to add mpi to this framework too (so that people can submit mpi jobs to torque framework)

Added: incubator/mesos/trunk/frameworks/torque/start_pbs_mom.py
URL: http://svn.apache.org/viewvc/incubator/mesos/trunk/frameworks/torque/start_pbs_mom.py?rev=1131568&view=auto
==============================================================================
--- incubator/mesos/trunk/frameworks/torque/start_pbs_mom.py (added)
+++ incubator/mesos/trunk/frameworks/torque/start_pbs_mom.py Sun Jun  5 03:24:01 2011
@@ -0,0 +1,92 @@
+#!/usr/bin/env python
+import nexus
+import sys
+import time
+import os
+import atexit
+
+from subprocess import *
+
+PBS_MOM_EXE = "/usr/local/sbin/pbs_mom"
+MOMCTL_EXE = "/usr/local/sbin/momctl"
+TORQUE_CFG = "/var/spool/torque"
+PBS_MOM_CONF_FILE = TORQUE_CFG + "/mom_priv/config"
+
+def cleanup():
+  try:
+    # TODO(*): This will kill ALL mpds...oops.
+    print "cleanup"
+    os.waitpid(Popen("momctl -s", shell=True).pid, 0)
+  except Exception, e:
+    print e
+    None
+
+class MyExecutor(nexus.Executor):
+  def __init__(self):
+    nexus.Executor.__init__(self)
+
+  def init(self, arg):
+    print "in torque executor init"
+    self.pbs_server_ip = arg.data
+
+  def startTask(self, task):
+    print "Running task %d" % task.taskId
+    
+    #TODO: check to see if pbs_mom is installed, if not install it
+    #   simply requires getting and running the installer that gets
+    #   built when you make torque, like torque-package-mom-linux-ia64.sh
+    #   or maybe apt-get install pbs-mom
+
+    #if Popen("which pbs_mom", shell=True).wait() != 0:
+    #print "pbs_mom command not found, tring to apt-get install torque-mom"
+    #try:
+    #  print Popen("apt-get -y install torque-mom",shell=True).wait()
+    #except:
+    #  print "error installing pbs_mom, please install it on compute nodes"
+    #  None
+
+    print "checking pbs_mom conf file " + PBS_MOM_CONF_FILE + " is it a file? "\
+           + str(os.path.isfile(PBS_MOM_CONF_FILE))
+    #TODO: check to see if file exists, if it does, check to see that it is correct
+    #      else create it (right now we overwrite it no matter what)
+    if not os.path.isfile(PBS_MOM_CONF_FILE):
+      print PBS_MOM_CONF_FILE + " file not found, about to create it"
+    else:
+      print "about to overwrite file " + PBS_MOM_CONF_FILE + " to update "\
+            "pbs_server on this node"
+
+    print "adding line to conf file: $pbsserver " + self.pbs_server_ip + "\n"
+    FILE = open(PBS_MOM_CONF_FILE,'w')
+    FILE.write("$pbsserver " + self.pbs_server_ip + "\n")
+    FILE.write("$logevent 255 #bitmap of which events to log\n")
+
+    FILE.close()
+   
+    print "overwrote pbs_mom config file, its contents now are:"
+    FILE = open(PBS_MOM_CONF_FILE,'r')
+    for line in FILE: print line + "\n"
+    FILE.close()
+
+    #try killing pbs_mom in case we changed the config
+    if Popen("momctl -s",shell=True).wait() != 0:
+      print "tried to kill pbs_mom, but it was not running"
+
+    #run pbs_mom
+    print "running pbs_mom on compute node"
+    Popen("pbs_mom", shell=True)
+
+  def killTask(self, tid):
+    sys.exit(1)
+
+  def shutdown(self):
+    print "shutdown"
+    cleanup()
+
+  def error(self, code, message):
+    print "Error: %s" % message
+
+if __name__ == "__main__":
+  print "Starting pbs_mom executor"
+  atexit.register(cleanup)
+  executor = MyExecutor()
+  executor.run()

Propchange: incubator/mesos/trunk/frameworks/torque/start_pbs_mom.py
------------------------------------------------------------------------------
    svn:executable = *

Added: incubator/mesos/trunk/frameworks/torque/start_pbs_mom.sh
URL: http://svn.apache.org/viewvc/incubator/mesos/trunk/frameworks/torque/start_pbs_mom.sh?rev=1131568&view=auto
==============================================================================
--- incubator/mesos/trunk/frameworks/torque/start_pbs_mom.sh (added)
+++ incubator/mesos/trunk/frameworks/torque/start_pbs_mom.sh Sun Jun  5 03:24:01 2011
@@ -0,0 +1,11 @@
+#!/bin/bash
+
+PYTHON=python
+
+if [ "`uname`" == "SunOS" ]; then
+  PYTHON=python2.6
+fi
+
+export PYTHONPATH=`dirname $0`/../../src/swig/python:$PYTHONPATH
+
+$PYTHON `dirname $0`/start_pbs_mom.py $@

Propchange: incubator/mesos/trunk/frameworks/torque/start_pbs_mom.sh
------------------------------------------------------------------------------
    svn:executable = *

Added: incubator/mesos/trunk/frameworks/torque/torquesched.py
URL: http://svn.apache.org/viewvc/incubator/mesos/trunk/frameworks/torque/torquesched.py?rev=1131568&view=auto
==============================================================================
--- incubator/mesos/trunk/frameworks/torque/torquesched.py (added)
+++ incubator/mesos/trunk/frameworks/torque/torquesched.py Sun Jun  5 03:24:01 2011
@@ -0,0 +1,166 @@
+#!/usr/bin/env python
+
+import nexus
+import os
+import sys
+import time
+import httplib
+import Queue
+
+from optparse import OptionParser
+from subprocess import *
+from socket import gethostname
+
+SAFE_ALLOCATION = {"cpus":5,"mem":134217728} #just set statically for now, 128MB
+MIN_SLOT_SIZE = {"cpus":"1","mem":1073741824} #1GB
+
+TORQUE_DL_URL = "http://www.clusterresources.com/downloads/torque/torque-2.4.6.tar.gz"
+TORQUE_UNCOMPRESSED_DIR = "torque-2.4.6"
+
+TORQUE_CFG = "/var/spool/torque"
+PBS_SERVER_CONF_FILE = TORQUE_CFG + "/server_priv/nodes" #hopefully can use qmgr and not edit this
+
+TORQUE_INSTALL_DIR = "/usr/local/sbin"
+PBS_SERVER_EXE = TORQUE_INSTALL_DIR + "/pbs_server"
+SCHEDULER_EXE = TORQUE_INSTALL_DIR + "/pbs_sched"
+QMGR_EXE = "/usr/local/bin/qmgr"
+
+
+class MyScheduler(nexus.Scheduler):
+  def __init__(self, ip):
+    nexus.Scheduler.__init__(self)
+    self.id = 0
+    self.ip = ip 
+    self.servers = {}
+    self.overloaded = False
+  
+
+  def getExecutorInfo(self, driver):
+    execPath = os.path.join(os.getcwd(), "start_pbs_mom.sh")
+    initArg = ip # tell executor which node the pbs_server is running on
+    print "in getExecutorInfo, setting execPath = " + execPath + " and initArg = " + initArg
+    return nexus.ExecutorInfo(execPath, initArg)
+
+  def registered(self, driver, fid):
+    print "Nexus torque+pbs scheduler registered as framework #%s" % fid
+
+   
+  #DESIGN PLAN:
+  #
+  #for each slot in the offer
+  #  if we are at safe allocation, don't accept any more of these slot offers
+  #  else if the slot is not >= min-slot size, reject
+  #  else if we have already set up some resources on this machine and started
+  #          pbs_mom on it, don't accept more resources.
+  #            TODO: Eventually, set up max and min resources per compute node and
+  #            accept resources on an existing compute node and just update
+  #            the config settings for the torque daemon on that node to use
+  #            more resources on it.
+  #  else accept the offer and launch pbs_mom on the node
+  def resourceOffer(self, driver, oid, slave_offers):
+    print "Got slot offer %d" % oid
+    tasks = []
+    for offer in slave_offers:
+      # for a first step, just keep track of whether we have started a pbs_mom
+      # on this node before or not, if not then accept the slot and launch one
+      if not offer.host in self.servers.values(): 
+        #TODO: check to see if slot is big enough 
+        print "setting up params"
+        #print "params = cpus: "# + MIN_SLOT_SIZE["cpus"]# +  ", mem: " + MIN_SLOT_SIZE["mem"]
+        params = {"cpus": "%d" % 1, "mem": "%d" % 1073741824}
+        td = nexus.TaskDescription(
+            self.id, offer.slaveId, "task %d" % self.id, params, "")
+        tasks.append(td)
+        self.servers[self.id] = offer.host
+        regComputeNode(offer.host)
+        self.id += 1
+        print "self.id now set to " + str(self.id)
+      else:
+        print "Rejecting slot because we've aleady launched pbs_mom on that node"
+    driver.replyToOffer(oid, tasks, {"timeout": "-1"})
+
+#  def statusUpdate(self, status):
+#    if status.taskId in self.servers.keys():
+#      if status.state == nexus.TASK_FINISHED:
+#        del self.servers[status.taskId]
+#        reconfigured = True
+#    if reconfigured:
+#      self.reviveOffers()
+
+def regComputeNode(new_node):
+  print "in reg"
+  print "chdir"
+  os.chdir(TORQUE_CFG)
+  # first grep the server conf file to be sure this node isn't already
+  # registered with the server
+  f = open(PBS_SERVER_CONF_FILE, 'r')
+  for line in f:
+    if line.find(new_node) != -1:
+      print "ERROR! Trying to register compute node which "\
+            "is already registered, aborting node register"
+      return
+  f.close()
+
+  #add node to server
+  print("adding a node to pbs_server using: qmgr -c create node " + new_node)
+  Popen(QMGR_EXE + ' "-c create node ' + new_node + '"',shell=True, stdout=PIPE).stdout
+
+#    #add line to server_priv/nodes file for new_node
+#    f = open(PBS_SERVER_CONF_FILE,'a')
+#    f.write(new_node+"\n")
+#    f.close
+
+def unregComputeNode(node_name):
+  #remove node from server
+  print("removing a node from pbs_server using: qmgr -c delete node " + node)
+  print Popen(QMGR_EXE + ' "-c delete node ' + node + '"').stdout
+def getFrameworkName(self, driver):
+  return "Nexus torque Framework"
+
+if __name__ == "__main__":
+  parser = OptionParser(usage = "Usage: %prog nexus_master")
+
+  (options,args) = parser.parse_args()
+  if len(args) < 1:
+    print >> sys.stderr, "At least one parameter required."
+    print >> sys.stderr, "Use --help to show usage."
+    exit(2)
+
+  print "setting up pbs_server"
+ 
+  #if torque isn't installed, install our own
+  #print "Torque doesn't seem to be installed. Downloading and installing it"
+  #Popen("curl " + TORQUE_DL_URL + " > torque.tgz)
+  #Popen("tar xzf torque.tgz")
+  #os.chdir(TORQUE_UNCOMPRESSED_DIR)
+  #Popen(os.getcwd()+"/configure --prefix=/usr/local")
+  #Popen("make")
+  #Popen("make install")
+  #Popen(os.getcwd()+"torque.setup root")
+  
+  try:
+    pbs_server = Popen(PBS_SERVER_EXE)
+  except OSError,e:
+    print >>sys.stderr, "Error starting pbs server"
+    print >>sys.stderr, e
+    exit(2)
+
+  try:
+    pbs_scheduler = Popen(SCHEDULER_EXE)
+  except OSError,e:
+    print >>sys.stderr, "Error starting scheduler"
+    print >>sys.stderr, e
+    exit(2)
+
+  ip = Popen("hostname -i", shell=True, stdout=PIPE).stdout.readline().rstrip()
+  print "Remembering IP address of scheduler (" + ip + ")"
+
+  print "Connecting to nexus master %s" % args[0]
+
+  sched = MyScheduler(ip)
+  nexus.NexusSchedulerDriver(sched, args[0]).run()
+
+  #WARNING: this is new in python 2.6
+  pbs_server.terminate() #don't leave pbs_server running, not sure if this works
+
+  print "Finished!"

Added: incubator/mesos/trunk/frameworks/torque/torquesched.sh
URL: http://svn.apache.org/viewvc/incubator/mesos/trunk/frameworks/torque/torquesched.sh?rev=1131568&view=auto
==============================================================================
--- incubator/mesos/trunk/frameworks/torque/torquesched.sh (added)
+++ incubator/mesos/trunk/frameworks/torque/torquesched.sh Sun Jun  5 03:24:01 2011
@@ -0,0 +1,6 @@
+#!/bin/bash
+PYTHON=python
+if [ "`uname`" == "SunOS" ]; then
+  PYTHON=python2.6
+fi
+PYTHONPATH=`dirname $0`/../../src/swig/python $PYTHON ./torquesched.py $@

Propchange: incubator/mesos/trunk/frameworks/torque/torquesched.sh
------------------------------------------------------------------------------
    svn:executable = *