You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Vikram Kone <vi...@gmail.com> on 2016/10/13 22:13:29 UTC

How to spark-submit using python subprocess module?

I have a python script that is used to submit spark jobs using the
spark-submit tool. I want to execute the command and write the output both
to STDOUT and a logfile in real time. i'm using python 2.7 on a ubuntu
server.

This is what I have so far in my SubmitJob.py script

#!/usr/bin/python
# Submit the commanddef submitJob(cmd, log_file):
    with open(log_file, 'w') as fh:
        process = subprocess.Popen(cmd, stdout=subprocess.PIPE,
stderr=subprocess.STDOUT)
        while True:
            output = process.stdout.readline()
            if output == '' and process.poll() is not None:
                break
            if output:
                print output.strip()
                fh.write(output)
        rc = process.poll()
        return rc
if __name__ == "__main__":
    cmdList = ["dse", "spark-submit", "--spark-master",
"spark://127.0.0.1:7077", "--class", "com.spark.myapp", "./myapp.jar"]
    log_file = "/tmp/out.log"
    exist_status = submitJob(cmdList, log_file)
    print "job finished with status ",exist_status

The strange thing is, when I execute the same command directly in the shell
it works fine and produces output on screen as the program proceeds.

So it looks like something is wrong in the way I'm using the
subprocess.PIPE for stdout and writing the file.

What's the current recommended way to use subprocess module for writing to
stdout and log file in real time line by line? I see a lot of different
options on the internet but not sure which is correct or latest.

Is there  anything specific to the way spark-submit buffers the stdout that
I need to take care of?

thanks