You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by "Connell, Chuck" <Ch...@nuance.com> on 2012/07/11 22:48:26 UTC

Extra output files from mapper ?

I am using MapReduce streaming with Python code. It works fine, for basic for stdin and stdout.

But I have a mapper-only application that also emits some other output files. So in addition to stdout, the program also creates files named output1.txt and output2.txt. My code seems to be running correctly, and I suspect the proper output files are being created somewhere, but I cannot find them after the job finishes.

I tried using the -files option to create a link to the location I want the file, but no luck. I tried using some of the -jobconf options to change the various working directories, but no luck.

Thank you.

Chuck Connell
Nuance R&D Data Team
Burlington, MA

Re: Extra output files from mapper ?

Posted by Harsh J <ha...@cloudera.com>.

You can ship the module along with a symlink and have Python
auto-import it since "." is always on PATH? I can imagine that helping
you get Pydoop on a cluster without Pydoop on all nodes (or other
libs).

On Thu, Jul 12, 2012 at 11:08 PM, Connell, Chuck
<Ch...@nuance.com> wrote:
> Thanks yet again. Since my goal is to run an existing Python program, as is, under MR, it looks like I need the os.system(copy-local-to-hdfs) technique.
>
> Chuck
>
>
>
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com]
> Sent: Thursday, July 12, 2012 1:15 PM
> To: mapreduce-user@hadoop.apache.org
> Subject: Re: Extra output files from mapper ?
>
> Unfortunately Python does not recognize hdfs:// URIs. It isn't a standard like HTTP is, so to say, at least not yet :)
>
> You can instead use Pydoop's HDFS APIs though http://pydoop.sourceforge.net/docs/api_docs/hdfs_api.html#hdfs-api.
> Pydoop authors are pretty active and do releases from time to time.
> See the open() method in the API, and use that with the write flag (pythonic style).
>
> On Thu, Jul 12, 2012 at 9:31 PM, Connell, Chuck <Ch...@nuance.com> wrote:
>> Thank you. I will try that.
>>
>> A related question... Shouldn't I just be able to create HDFS files directly from a Python open statement, when running within MR, like this? It does not seem to work as intended.
>>
>> outfile1 = open("hdfs://localhost/tmp/out1.txt", 'w')
>>
>> Chuck
>>
>>
>>
>> -----Original Message-----
>> From: Harsh J [mailto:harsh@cloudera.com]
>> Sent: Thursday, July 12, 2012 10:58 AM
>> To: mapreduce-user@hadoop.apache.org
>> Subject: Re: Extra output files from mapper ?
>>
>> Chuck,
>>
>> Note that the regular file opens from within an MR program (be it streaming or be it Java), will create files on the local file system of the node the task executed on.
>>
>> Hence, at the end of your script, move them to HDFS after closing them.
>>
>> Something like:
>>
>> os.system("hadoop fs -put outfile1.txt /path/on/hdfs/file.txt")
>>
>> (Or via a python lib API for HDFS)
>>
>> On Thu, Jul 12, 2012 at 8:08 PM, Connell, Chuck <Ch...@nuance.com> wrote:
>>> Here is a test case...
>>>
>>>
>>>
>>>
>>>
>>> The Python code (file_io.py) that I want to run as a map-only job is below.
>>> It takes one input file (not stdin) and creates two output files (not
>>> stdout).
>>>
>>>
>>>
>>> #!/usr/bin/env python
>>>
>>>
>>>
>>> import sys
>>>
>>>
>>>
>>> infile = open(sys.argv[1], 'r')
>>>
>>> outfile1 = open(sys.argv[2], 'w')
>>>
>>> outfile2 = open(sys.argv[3], 'w')
>>>
>>>
>>>
>>> for line in infile:
>>>
>>>      sys.stdout.write(line)  # just to verify that infile is being
>>> read correctly
>>>
>>>      outfile1.write("1. " + line)
>>>
>>>      outfile2.write("2. " + line)
>>>
>>>
>>>
>>>
>>>
>>> But since MapReduce streaming likes to use stdio, I put my job in a
>>> Python wrapper (file_io_wrap.py):
>>>
>>>
>>>
>>> #!/usr/bin/env python
>>>
>>>
>>>
>>> import sys
>>>
>>> from subprocess import call
>>>
>>>
>>>
>>> # Eat input stream on stdin
>>>
>>> line = sys.stdin.readline()
>>>
>>> while line:
>>>
>>>     line = sys.stdin.readline()
>>>
>>>
>>>
>>> # Call real program.
>>>
>>> status = call (["python", "file_io.py", "in1.txt", "out1.txt",
>>> "out2.txt"])
>>>
>>>
>>>
>>> # Write to stdout.
>>>
>>> if status==0:
>>>
>>>      sys.stdout.write("Success.")
>>>
>>> else:
>>>
>>>      sys.stdout.write("Subprocess call failed.")
>>>
>>>
>>>
>>>
>>>
>>> Finally, I call the streaming job from this shell script...
>>>
>>>
>>>
>>> #!/bin/bash
>>>
>>>
>>>
>>> #Find latest streaming jar.
>>>
>>> STREAM="hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-streaming*.jar"
>>>
>>>
>>>
>>> # Input file should explicitly use hdfs: to avoid confusion with
>>> local file
>>>
>>> # Output dir should not exist.
>>>
>>> # The mapper and reducer should explicitly state "python XXX.py"
>>> rather than just "XXX.py"
>>>
>>>
>>>
>>> $STREAM  \
>>>
>>> -files "hdfs://localhost/tmp/input/in1.txt#in1.txt" \
>>>
>>> -files "hdfs://localhost/tmp/out1.txt#out1.txt" \
>>>
>>> -files "hdfs://localhost/tmp/out2.txt#out2.txt" \
>>>
>>> -file file_io_wrap.py \
>>>
>>> -file file_io.py \
>>>
>>> -input "hdfs://localhost/tmp/input/empty.txt" \
>>>
>>> -mapper "python file_io_wrap.py" \
>>>
>>> -reducer NONE \
>>>
>>> -output /tmp/output20
>>>
>>>
>>>
>>>
>>>
>>> The result is that the whole job runs correctly and the input file is
>>> read correctly. I can see a copy of the  input file in part-0000. But
>>> the output files (out1.txt and out2.txt) are nowhere to be found. I
>>> suspect they were created somewhere, but where? And how can I control where they are created?
>>>
>>>
>>>
>>> Thank you,
>>>
>>> Chuck Connell
>>>
>>> Nuance R&D Data Team
>>>
>>> Burlington, MA
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> From: Connell, Chuck [mailto:Chuck.Connell@nuance.com]
>>> Sent: Wednesday, July 11, 2012 4:48 PM
>>> To: mapreduce-user@hadoop.apache.org
>>> Subject: Extra output files from mapper ?
>>>
>>>
>>>
>>> I am using MapReduce streaming with Python code. It works fine, for
>>> basic for stdin and stdout.
>>>
>>>
>>>
>>> But I have a mapper-only application that also emits some other
>>> output files. So in addition to stdout, the program also creates
>>> files named output1.txt and output2.txt. My code seems to be running
>>> correctly, and I suspect the proper output files are being created
>>> somewhere, but I cannot find them after the job finishes.
>>>
>>>
>>>
>>> I tried using the -files option to create a link to the location I
>>> want the file, but no luck. I tried using some of the -jobconf
>>> options to change the various working directories, but no luck.
>>>
>>>
>>>
>>> Thank you.
>>>
>>>
>>>
>>> Chuck Connell
>>>
>>> Nuance R&D Data Team
>>>
>>> Burlington, MA
>>>
>>>
>>
>>
>>
>> --
>> Harsh J
>
>
>
> --
> Harsh J



-- 
Harsh J

RE: Extra output files from mapper ?

Posted by "Connell, Chuck" <Ch...@nuance.com>.

Thanks yet again. Since my goal is to run an existing Python program, as is, under MR, it looks like I need the os.system(copy-local-to-hdfs) technique.

Chuck 



-----Original Message-----
From: Harsh J [mailto:harsh@cloudera.com] 
Sent: Thursday, July 12, 2012 1:15 PM
To: mapreduce-user@hadoop.apache.org
Subject: Re: Extra output files from mapper ?

Unfortunately Python does not recognize hdfs:// URIs. It isn't a standard like HTTP is, so to say, at least not yet :)

You can instead use Pydoop's HDFS APIs though http://pydoop.sourceforge.net/docs/api_docs/hdfs_api.html#hdfs-api.
Pydoop authors are pretty active and do releases from time to time.
See the open() method in the API, and use that with the write flag (pythonic style).

On Thu, Jul 12, 2012 at 9:31 PM, Connell, Chuck <Ch...@nuance.com> wrote:
> Thank you. I will try that.
>
> A related question... Shouldn't I just be able to create HDFS files directly from a Python open statement, when running within MR, like this? It does not seem to work as intended.
>
> outfile1 = open("hdfs://localhost/tmp/out1.txt", 'w')
>
> Chuck
>
>
>
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com]
> Sent: Thursday, July 12, 2012 10:58 AM
> To: mapreduce-user@hadoop.apache.org
> Subject: Re: Extra output files from mapper ?
>
> Chuck,
>
> Note that the regular file opens from within an MR program (be it streaming or be it Java), will create files on the local file system of the node the task executed on.
>
> Hence, at the end of your script, move them to HDFS after closing them.
>
> Something like:
>
> os.system("hadoop fs -put outfile1.txt /path/on/hdfs/file.txt")
>
> (Or via a python lib API for HDFS)
>
> On Thu, Jul 12, 2012 at 8:08 PM, Connell, Chuck <Ch...@nuance.com> wrote:
>> Here is a test case...
>>
>>
>>
>>
>>
>> The Python code (file_io.py) that I want to run as a map-only job is below.
>> It takes one input file (not stdin) and creates two output files (not 
>> stdout).
>>
>>
>>
>> #!/usr/bin/env python
>>
>>
>>
>> import sys
>>
>>
>>
>> infile = open(sys.argv[1], 'r')
>>
>> outfile1 = open(sys.argv[2], 'w')
>>
>> outfile2 = open(sys.argv[3], 'w')
>>
>>
>>
>> for line in infile:
>>
>>      sys.stdout.write(line)  # just to verify that infile is being 
>> read correctly
>>
>>      outfile1.write("1. " + line)
>>
>>      outfile2.write("2. " + line)
>>
>>
>>
>>
>>
>> But since MapReduce streaming likes to use stdio, I put my job in a 
>> Python wrapper (file_io_wrap.py):
>>
>>
>>
>> #!/usr/bin/env python
>>
>>
>>
>> import sys
>>
>> from subprocess import call
>>
>>
>>
>> # Eat input stream on stdin
>>
>> line = sys.stdin.readline()
>>
>> while line:
>>
>>     line = sys.stdin.readline()
>>
>>
>>
>> # Call real program.
>>
>> status = call (["python", "file_io.py", "in1.txt", "out1.txt",
>> "out2.txt"])
>>
>>
>>
>> # Write to stdout.
>>
>> if status==0:
>>
>>      sys.stdout.write("Success.")
>>
>> else:
>>
>>      sys.stdout.write("Subprocess call failed.")
>>
>>
>>
>>
>>
>> Finally, I call the streaming job from this shell script...
>>
>>
>>
>> #!/bin/bash
>>
>>
>>
>> #Find latest streaming jar.
>>
>> STREAM="hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-streaming*.jar"
>>
>>
>>
>> # Input file should explicitly use hdfs: to avoid confusion with 
>> local file
>>
>> # Output dir should not exist.
>>
>> # The mapper and reducer should explicitly state "python XXX.py"
>> rather than just "XXX.py"
>>
>>
>>
>> $STREAM  \
>>
>> -files "hdfs://localhost/tmp/input/in1.txt#in1.txt" \
>>
>> -files "hdfs://localhost/tmp/out1.txt#out1.txt" \
>>
>> -files "hdfs://localhost/tmp/out2.txt#out2.txt" \
>>
>> -file file_io_wrap.py \
>>
>> -file file_io.py \
>>
>> -input "hdfs://localhost/tmp/input/empty.txt" \
>>
>> -mapper "python file_io_wrap.py" \
>>
>> -reducer NONE \
>>
>> -output /tmp/output20
>>
>>
>>
>>
>>
>> The result is that the whole job runs correctly and the input file is 
>> read correctly. I can see a copy of the  input file in part-0000. But 
>> the output files (out1.txt and out2.txt) are nowhere to be found. I 
>> suspect they were created somewhere, but where? And how can I control where they are created?
>>
>>
>>
>> Thank you,
>>
>> Chuck Connell
>>
>> Nuance R&D Data Team
>>
>> Burlington, MA
>>
>>
>>
>>
>>
>>
>>
>> From: Connell, Chuck [mailto:Chuck.Connell@nuance.com]
>> Sent: Wednesday, July 11, 2012 4:48 PM
>> To: mapreduce-user@hadoop.apache.org
>> Subject: Extra output files from mapper ?
>>
>>
>>
>> I am using MapReduce streaming with Python code. It works fine, for 
>> basic for stdin and stdout.
>>
>>
>>
>> But I have a mapper-only application that also emits some other 
>> output files. So in addition to stdout, the program also creates 
>> files named output1.txt and output2.txt. My code seems to be running 
>> correctly, and I suspect the proper output files are being created 
>> somewhere, but I cannot find them after the job finishes.
>>
>>
>>
>> I tried using the -files option to create a link to the location I 
>> want the file, but no luck. I tried using some of the -jobconf 
>> options to change the various working directories, but no luck.
>>
>>
>>
>> Thank you.
>>
>>
>>
>> Chuck Connell
>>
>> Nuance R&D Data Team
>>
>> Burlington, MA
>>
>>
>
>
>
> --
> Harsh J



--
Harsh J

Re: Extra output files from mapper ?

Posted by Harsh J <ha...@cloudera.com>.

Unfortunately Python does not recognize hdfs:// URIs. It isn't a
standard like HTTP is, so to say, at least not yet :)

You can instead use Pydoop's HDFS APIs though
http://pydoop.sourceforge.net/docs/api_docs/hdfs_api.html#hdfs-api.
Pydoop authors are pretty active and do releases from time to time.
See the open() method in the API, and use that with the write flag
(pythonic style).

On Thu, Jul 12, 2012 at 9:31 PM, Connell, Chuck
<Ch...@nuance.com> wrote:
> Thank you. I will try that.
>
> A related question... Shouldn't I just be able to create HDFS files directly from a Python open statement, when running within MR, like this? It does not seem to work as intended.
>
> outfile1 = open("hdfs://localhost/tmp/out1.txt", 'w')
>
> Chuck
>
>
>
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com]
> Sent: Thursday, July 12, 2012 10:58 AM
> To: mapreduce-user@hadoop.apache.org
> Subject: Re: Extra output files from mapper ?
>
> Chuck,
>
> Note that the regular file opens from within an MR program (be it streaming or be it Java), will create files on the local file system of the node the task executed on.
>
> Hence, at the end of your script, move them to HDFS after closing them.
>
> Something like:
>
> os.system("hadoop fs -put outfile1.txt /path/on/hdfs/file.txt")
>
> (Or via a python lib API for HDFS)
>
> On Thu, Jul 12, 2012 at 8:08 PM, Connell, Chuck <Ch...@nuance.com> wrote:
>> Here is a test case...
>>
>>
>>
>>
>>
>> The Python code (file_io.py) that I want to run as a map-only job is below.
>> It takes one input file (not stdin) and creates two output files (not
>> stdout).
>>
>>
>>
>> #!/usr/bin/env python
>>
>>
>>
>> import sys
>>
>>
>>
>> infile = open(sys.argv[1], 'r')
>>
>> outfile1 = open(sys.argv[2], 'w')
>>
>> outfile2 = open(sys.argv[3], 'w')
>>
>>
>>
>> for line in infile:
>>
>>      sys.stdout.write(line)  # just to verify that infile is being
>> read correctly
>>
>>      outfile1.write("1. " + line)
>>
>>      outfile2.write("2. " + line)
>>
>>
>>
>>
>>
>> But since MapReduce streaming likes to use stdio, I put my job in a
>> Python wrapper (file_io_wrap.py):
>>
>>
>>
>> #!/usr/bin/env python
>>
>>
>>
>> import sys
>>
>> from subprocess import call
>>
>>
>>
>> # Eat input stream on stdin
>>
>> line = sys.stdin.readline()
>>
>> while line:
>>
>>     line = sys.stdin.readline()
>>
>>
>>
>> # Call real program.
>>
>> status = call (["python", "file_io.py", "in1.txt", "out1.txt",
>> "out2.txt"])
>>
>>
>>
>> # Write to stdout.
>>
>> if status==0:
>>
>>      sys.stdout.write("Success.")
>>
>> else:
>>
>>      sys.stdout.write("Subprocess call failed.")
>>
>>
>>
>>
>>
>> Finally, I call the streaming job from this shell script...
>>
>>
>>
>> #!/bin/bash
>>
>>
>>
>> #Find latest streaming jar.
>>
>> STREAM="hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-streaming*.jar"
>>
>>
>>
>> # Input file should explicitly use hdfs: to avoid confusion with local
>> file
>>
>> # Output dir should not exist.
>>
>> # The mapper and reducer should explicitly state "python XXX.py"
>> rather than just "XXX.py"
>>
>>
>>
>> $STREAM  \
>>
>> -files "hdfs://localhost/tmp/input/in1.txt#in1.txt" \
>>
>> -files "hdfs://localhost/tmp/out1.txt#out1.txt" \
>>
>> -files "hdfs://localhost/tmp/out2.txt#out2.txt" \
>>
>> -file file_io_wrap.py \
>>
>> -file file_io.py \
>>
>> -input "hdfs://localhost/tmp/input/empty.txt" \
>>
>> -mapper "python file_io_wrap.py" \
>>
>> -reducer NONE \
>>
>> -output /tmp/output20
>>
>>
>>
>>
>>
>> The result is that the whole job runs correctly and the input file is
>> read correctly. I can see a copy of the  input file in part-0000. But
>> the output files (out1.txt and out2.txt) are nowhere to be found. I
>> suspect they were created somewhere, but where? And how can I control where they are created?
>>
>>
>>
>> Thank you,
>>
>> Chuck Connell
>>
>> Nuance R&D Data Team
>>
>> Burlington, MA
>>
>>
>>
>>
>>
>>
>>
>> From: Connell, Chuck [mailto:Chuck.Connell@nuance.com]
>> Sent: Wednesday, July 11, 2012 4:48 PM
>> To: mapreduce-user@hadoop.apache.org
>> Subject: Extra output files from mapper ?
>>
>>
>>
>> I am using MapReduce streaming with Python code. It works fine, for
>> basic for stdin and stdout.
>>
>>
>>
>> But I have a mapper-only application that also emits some other output
>> files. So in addition to stdout, the program also creates files named
>> output1.txt and output2.txt. My code seems to be running correctly,
>> and I suspect the proper output files are being created somewhere, but
>> I cannot find them after the job finishes.
>>
>>
>>
>> I tried using the -files option to create a link to the location I
>> want the file, but no luck. I tried using some of the -jobconf options
>> to change the various working directories, but no luck.
>>
>>
>>
>> Thank you.
>>
>>
>>
>> Chuck Connell
>>
>> Nuance R&D Data Team
>>
>> Burlington, MA
>>
>>
>
>
>
> --
> Harsh J



-- 
Harsh J

RE: Extra output files from mapper ?

Posted by "Connell, Chuck" <Ch...@nuance.com>.

Thank you. I will try that. 

A related question... Shouldn't I just be able to create HDFS files directly from a Python open statement, when running within MR, like this? It does not seem to work as intended.

outfile1 = open("hdfs://localhost/tmp/out1.txt", 'w')

Chuck



-----Original Message-----
From: Harsh J [mailto:harsh@cloudera.com] 
Sent: Thursday, July 12, 2012 10:58 AM
To: mapreduce-user@hadoop.apache.org
Subject: Re: Extra output files from mapper ?

Chuck,

Note that the regular file opens from within an MR program (be it streaming or be it Java), will create files on the local file system of the node the task executed on.

Hence, at the end of your script, move them to HDFS after closing them.

Something like:

os.system("hadoop fs -put outfile1.txt /path/on/hdfs/file.txt")

(Or via a python lib API for HDFS)

On Thu, Jul 12, 2012 at 8:08 PM, Connell, Chuck <Ch...@nuance.com> wrote:
> Here is a test case...
>
>
>
>
>
> The Python code (file_io.py) that I want to run as a map-only job is below.
> It takes one input file (not stdin) and creates two output files (not 
> stdout).
>
>
>
> #!/usr/bin/env python
>
>
>
> import sys
>
>
>
> infile = open(sys.argv[1], 'r')
>
> outfile1 = open(sys.argv[2], 'w')
>
> outfile2 = open(sys.argv[3], 'w')
>
>
>
> for line in infile:
>
>      sys.stdout.write(line)  # just to verify that infile is being 
> read correctly
>
>      outfile1.write("1. " + line)
>
>      outfile2.write("2. " + line)
>
>
>
>
>
> But since MapReduce streaming likes to use stdio, I put my job in a 
> Python wrapper (file_io_wrap.py):
>
>
>
> #!/usr/bin/env python
>
>
>
> import sys
>
> from subprocess import call
>
>
>
> # Eat input stream on stdin
>
> line = sys.stdin.readline()
>
> while line:
>
>     line = sys.stdin.readline()
>
>
>
> # Call real program.
>
> status = call (["python", "file_io.py", "in1.txt", "out1.txt", 
> "out2.txt"])
>
>
>
> # Write to stdout.
>
> if status==0:
>
>      sys.stdout.write("Success.")
>
> else:
>
>      sys.stdout.write("Subprocess call failed.")
>
>
>
>
>
> Finally, I call the streaming job from this shell script...
>
>
>
> #!/bin/bash
>
>
>
> #Find latest streaming jar.
>
> STREAM="hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-streaming*.jar"
>
>
>
> # Input file should explicitly use hdfs: to avoid confusion with local 
> file
>
> # Output dir should not exist.
>
> # The mapper and reducer should explicitly state "python XXX.py" 
> rather than just "XXX.py"
>
>
>
> $STREAM  \
>
> -files "hdfs://localhost/tmp/input/in1.txt#in1.txt" \
>
> -files "hdfs://localhost/tmp/out1.txt#out1.txt" \
>
> -files "hdfs://localhost/tmp/out2.txt#out2.txt" \
>
> -file file_io_wrap.py \
>
> -file file_io.py \
>
> -input "hdfs://localhost/tmp/input/empty.txt" \
>
> -mapper "python file_io_wrap.py" \
>
> -reducer NONE \
>
> -output /tmp/output20
>
>
>
>
>
> The result is that the whole job runs correctly and the input file is 
> read correctly. I can see a copy of the  input file in part-0000. But 
> the output files (out1.txt and out2.txt) are nowhere to be found. I 
> suspect they were created somewhere, but where? And how can I control where they are created?
>
>
>
> Thank you,
>
> Chuck Connell
>
> Nuance R&D Data Team
>
> Burlington, MA
>
>
>
>
>
>
>
> From: Connell, Chuck [mailto:Chuck.Connell@nuance.com]
> Sent: Wednesday, July 11, 2012 4:48 PM
> To: mapreduce-user@hadoop.apache.org
> Subject: Extra output files from mapper ?
>
>
>
> I am using MapReduce streaming with Python code. It works fine, for 
> basic for stdin and stdout.
>
>
>
> But I have a mapper-only application that also emits some other output 
> files. So in addition to stdout, the program also creates files named 
> output1.txt and output2.txt. My code seems to be running correctly, 
> and I suspect the proper output files are being created somewhere, but 
> I cannot find them after the job finishes.
>
>
>
> I tried using the -files option to create a link to the location I 
> want the file, but no luck. I tried using some of the -jobconf options 
> to change the various working directories, but no luck.
>
>
>
> Thank you.
>
>
>
> Chuck Connell
>
> Nuance R&D Data Team
>
> Burlington, MA
>
>



--
Harsh J

RE: Extra output files from mapper ?

Posted by "Connell, Chuck" <Ch...@nuance.com>.

This works. Thank you.




-----Original Message-----
From: Harsh J [mailto:harsh@cloudera.com] 
Sent: Thursday, July 12, 2012 10:58 AM
To: mapreduce-user@hadoop.apache.org
Subject: Re: Extra output files from mapper ?

Chuck,

Note that the regular file opens from within an MR program (be it streaming or be it Java), will create files on the local file system of the node the task executed on.

Hence, at the end of your script, move them to HDFS after closing them.

Something like:

os.system("hadoop fs -put outfile1.txt /path/on/hdfs/file.txt")

(Or via a python lib API for HDFS)

On Thu, Jul 12, 2012 at 8:08 PM, Connell, Chuck <Ch...@nuance.com> wrote:
> Here is a test case...
>
>
>
>
>
> The Python code (file_io.py) that I want to run as a map-only job is below.
> It takes one input file (not stdin) and creates two output files (not 
> stdout).
>
>
>
> #!/usr/bin/env python
>
>
>
> import sys
>
>
>
> infile = open(sys.argv[1], 'r')
>
> outfile1 = open(sys.argv[2], 'w')
>
> outfile2 = open(sys.argv[3], 'w')
>
>
>
> for line in infile:
>
>      sys.stdout.write(line)  # just to verify that infile is being 
> read correctly
>
>      outfile1.write("1. " + line)
>
>      outfile2.write("2. " + line)
>
>
>
>
>
> But since MapReduce streaming likes to use stdio, I put my job in a 
> Python wrapper (file_io_wrap.py):
>
>
>
> #!/usr/bin/env python
>
>
>
> import sys
>
> from subprocess import call
>
>
>
> # Eat input stream on stdin
>
> line = sys.stdin.readline()
>
> while line:
>
>     line = sys.stdin.readline()
>
>
>
> # Call real program.
>
> status = call (["python", "file_io.py", "in1.txt", "out1.txt", 
> "out2.txt"])
>
>
>
> # Write to stdout.
>
> if status==0:
>
>      sys.stdout.write("Success.")
>
> else:
>
>      sys.stdout.write("Subprocess call failed.")
>
>
>
>
>
> Finally, I call the streaming job from this shell script...
>
>
>
> #!/bin/bash
>
>
>
> #Find latest streaming jar.
>
> STREAM="hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-streaming*.jar"
>
>
>
> # Input file should explicitly use hdfs: to avoid confusion with local 
> file
>
> # Output dir should not exist.
>
> # The mapper and reducer should explicitly state "python XXX.py" 
> rather than just "XXX.py"
>
>
>
> $STREAM  \
>
> -files "hdfs://localhost/tmp/input/in1.txt#in1.txt" \
>
> -files "hdfs://localhost/tmp/out1.txt#out1.txt" \
>
> -files "hdfs://localhost/tmp/out2.txt#out2.txt" \
>
> -file file_io_wrap.py \
>
> -file file_io.py \
>
> -input "hdfs://localhost/tmp/input/empty.txt" \
>
> -mapper "python file_io_wrap.py" \
>
> -reducer NONE \
>
> -output /tmp/output20
>
>
>
>
>
> The result is that the whole job runs correctly and the input file is 
> read correctly. I can see a copy of the  input file in part-0000. But 
> the output files (out1.txt and out2.txt) are nowhere to be found. I 
> suspect they were created somewhere, but where? And how can I control where they are created?
>
>
>
> Thank you,
>
> Chuck Connell
>
> Nuance R&D Data Team
>
> Burlington, MA
>
>
>
>
>
>
>
> From: Connell, Chuck [mailto:Chuck.Connell@nuance.com]
> Sent: Wednesday, July 11, 2012 4:48 PM
> To: mapreduce-user@hadoop.apache.org
> Subject: Extra output files from mapper ?
>
>
>
> I am using MapReduce streaming with Python code. It works fine, for 
> basic for stdin and stdout.
>
>
>
> But I have a mapper-only application that also emits some other output 
> files. So in addition to stdout, the program also creates files named 
> output1.txt and output2.txt. My code seems to be running correctly, 
> and I suspect the proper output files are being created somewhere, but 
> I cannot find them after the job finishes.
>
>
>
> I tried using the -files option to create a link to the location I 
> want the file, but no luck. I tried using some of the -jobconf options 
> to change the various working directories, but no luck.
>
>
>
> Thank you.
>
>
>
> Chuck Connell
>
> Nuance R&D Data Team
>
> Burlington, MA
>
>



--
Harsh J

Re: Extra output files from mapper ?

Posted by Harsh J <ha...@cloudera.com>.

Chuck,

Note that the regular file opens from within an MR program (be it
streaming or be it Java), will create files on the local file system
of the node the task executed on.

Hence, at the end of your script, move them to HDFS after closing them.

Something like:

os.system("hadoop fs -put outfile1.txt /path/on/hdfs/file.txt")

(Or via a python lib API for HDFS)

On Thu, Jul 12, 2012 at 8:08 PM, Connell, Chuck
<Ch...@nuance.com> wrote:
> Here is a test case…
>
>
>
>
>
> The Python code (file_io.py) that I want to run as a map-only job is below.
> It takes one input file (not stdin) and creates two output files (not
> stdout).
>
>
>
> #!/usr/bin/env python
>
>
>
> import sys
>
>
>
> infile = open(sys.argv[1], 'r')
>
> outfile1 = open(sys.argv[2], 'w')
>
> outfile2 = open(sys.argv[3], 'w')
>
>
>
> for line in infile:
>
>      sys.stdout.write(line)  # just to verify that infile is being read
> correctly
>
>      outfile1.write("1. " + line)
>
>      outfile2.write("2. " + line)
>
>
>
>
>
> But since MapReduce streaming likes to use stdio, I put my job in a Python
> wrapper (file_io_wrap.py):
>
>
>
> #!/usr/bin/env python
>
>
>
> import sys
>
> from subprocess import call
>
>
>
> # Eat input stream on stdin
>
> line = sys.stdin.readline()
>
> while line:
>
>     line = sys.stdin.readline()
>
>
>
> # Call real program.
>
> status = call (["python", "file_io.py", "in1.txt", "out1.txt", "out2.txt"])
>
>
>
> # Write to stdout.
>
> if status==0:
>
>      sys.stdout.write("Success.")
>
> else:
>
>      sys.stdout.write("Subprocess call failed.")
>
>
>
>
>
> Finally, I call the streaming job from this shell script…
>
>
>
> #!/bin/bash
>
>
>
> #Find latest streaming jar.
>
> STREAM="hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-streaming*.jar"
>
>
>
> # Input file should explicitly use hdfs: to avoid confusion with local file
>
> # Output dir should not exist.
>
> # The mapper and reducer should explicitly state "python XXX.py" rather than
> just "XXX.py"
>
>
>
> $STREAM  \
>
> -files "hdfs://localhost/tmp/input/in1.txt#in1.txt" \
>
> -files "hdfs://localhost/tmp/out1.txt#out1.txt" \
>
> -files "hdfs://localhost/tmp/out2.txt#out2.txt" \
>
> -file file_io_wrap.py \
>
> -file file_io.py \
>
> -input "hdfs://localhost/tmp/input/empty.txt" \
>
> -mapper "python file_io_wrap.py" \
>
> -reducer NONE \
>
> -output /tmp/output20
>
>
>
>
>
> The result is that the whole job runs correctly and the input file is read
> correctly. I can see a copy of the  input file in part-0000. But the output
> files (out1.txt and out2.txt) are nowhere to be found. I suspect they were
> created somewhere, but where? And how can I control where they are created?
>
>
>
> Thank you,
>
> Chuck Connell
>
> Nuance R&D Data Team
>
> Burlington, MA
>
>
>
>
>
>
>
> From: Connell, Chuck [mailto:Chuck.Connell@nuance.com]
> Sent: Wednesday, July 11, 2012 4:48 PM
> To: mapreduce-user@hadoop.apache.org
> Subject: Extra output files from mapper ?
>
>
>
> I am using MapReduce streaming with Python code. It works fine, for basic
> for stdin and stdout.
>
>
>
> But I have a mapper-only application that also emits some other output
> files. So in addition to stdout, the program also creates files named
> output1.txt and output2.txt. My code seems to be running correctly, and I
> suspect the proper output files are being created somewhere, but I cannot
> find them after the job finishes.
>
>
>
> I tried using the –files option to create a link to the location I want the
> file, but no luck. I tried using some of the –jobconf options to change the
> various working directories, but no luck.
>
>
>
> Thank you.
>
>
>
> Chuck Connell
>
> Nuance R&D Data Team
>
> Burlington, MA
>
>



-- 
Harsh J

RE: Extra output files from mapper ?

Posted by "Connell, Chuck" <Ch...@nuance.com>.

Here is a test case...


The Python code (file_io.py) that I want to run as a map-only job is below. It takes one input file (not stdin) and creates two output files (not stdout).

#!/usr/bin/env python

import sys

infile = open(sys.argv[1], 'r')
outfile1 = open(sys.argv[2], 'w')
outfile2 = open(sys.argv[3], 'w')

for line in infile:
     sys.stdout.write(line)  # just to verify that infile is being read correctly
     outfile1.write("1. " + line)
     outfile2.write("2. " + line)


But since MapReduce streaming likes to use stdio, I put my job in a Python wrapper (file_io_wrap.py):

#!/usr/bin/env python

import sys
from subprocess import call

# Eat input stream on stdin
line = sys.stdin.readline()
while line:
    line = sys.stdin.readline()

# Call real program.
status = call (["python", "file_io.py", "in1.txt", "out1.txt", "out2.txt"])

# Write to stdout.
if status==0:
     sys.stdout.write("Success.")
else:
     sys.stdout.write("Subprocess call failed.")


Finally, I call the streaming job from this shell script...

#!/bin/bash

#Find latest streaming jar.
STREAM="hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-streaming*.jar"

# Input file should explicitly use hdfs: to avoid confusion with local file
# Output dir should not exist.
# The mapper and reducer should explicitly state "python XXX.py" rather than just "XXX.py"

$STREAM  \
-files "hdfs://localhost/tmp/input/in1.txt#in1.txt" \
-files "hdfs://localhost/tmp/out1.txt#out1.txt" \
-files "hdfs://localhost/tmp/out2.txt#out2.txt" \
-file file_io_wrap.py \
-file file_io.py \
-input "hdfs://localhost/tmp/input/empty.txt" \
-mapper "python file_io_wrap.py" \
-reducer NONE \
-output /tmp/output20


The result is that the whole job runs correctly and the input file is read correctly. I can see a copy of the  input file in part-0000. But the output files (out1.txt and out2.txt) are nowhere to be found. I suspect they were created somewhere, but where? And how can I control where they are created?

Thank you,
Chuck Connell
Nuance R&D Data Team
Burlington, MA



From: Connell, Chuck [mailto:Chuck.Connell@nuance.com]
Sent: Wednesday, July 11, 2012 4:48 PM
To: mapreduce-user@hadoop.apache.org
Subject: Extra output files from mapper ?

I am using MapReduce streaming with Python code. It works fine, for basic for stdin and stdout.

But I have a mapper-only application that also emits some other output files. So in addition to stdout, the program also creates files named output1.txt and output2.txt. My code seems to be running correctly, and I suspect the proper output files are being created somewhere, but I cannot find them after the job finishes.

I tried using the -files option to create a link to the location I want the file, but no luck. I tried using some of the -jobconf options to change the various working directories, but no luck.

Thank you.

Chuck Connell
Nuance R&D Data Team
Burlington, MA