You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Julian Bui <ju...@gmail.com> on 2013/02/07 02:13:55 UTC

Creating files through the hadoop streaming interface

Hi hadoop users,

I am trying to use the streaming interface to use my python script mapper
to create some files but am running into difficulties actually creating
files on the hdfs.

I have a python script mapper with no reducers.  Currently, it doesn't even
read the input and instead reads in the env variable for the output dir
(outdir = os.environ['mapred_output_dir']) and attempts to create an empty
file at that location.  However, that appears to fail with the [vague]
error message appended to this email.

I am using the streaming interface because the python file examples seem so
much cleaner and abstract a lot of the details away for me but if I instead
need to use the java bindings (and create a mapper and reducer class) then
please let me know.  I'm still learning hadoop.  As I understand it, I
should be able to create files in hadoop but perhaps there is limited
ability while using the streaming i/o interface.

Further questions: If my mapper absolutely must send my output to stdout,
is there a way to rename the file after it has been created?

Please help.

Thanks,
-Julian

Python mapper code:
outdir = os.environ['mapred_output_dir']
f = open(outdir + "/testfile.txt", "wb")
f.close()


13/02/06 17:07:55 INFO streaming.StreamJob:  map 100%  reduce 100%
13/02/06 17:07:55 INFO streaming.StreamJob: To kill this job, run:
13/02/06 17:07:55 INFO streaming.StreamJob:
/opt/hadoop/libexec/../bin/hadoop job
 -Dmapred.job.tracker=gcn-13-88.ibnet0:54311 -kill job_201302061706_0001
13/02/06 17:07:55 INFO streaming.StreamJob: Tracking URL:
http://gcn-13-88.ibnet0:50030/jobdetails.jsp?jobid=job_201302061706_0001
13/02/06 17:07:55 ERROR streaming.StreamJob: Job not successful. Error: #
of failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask:
task_201302061706_0001_m_000000
13/02/06 17:07:55 INFO streaming.StreamJob: killJob...
Streaming Command Failed!

Re: Creating files through the hadoop streaming interface

Posted by Simone Leo <si...@crs4.it>.

Hello,

the lack of an HDFS API is just one of the drawbacks that motivated us 
to abandon Streaming and develop Pydoop.  Unfortunately, in the blog 
post cited by Harsh J, Pydoop is just briefly mentioned because the 
author failed to build and install it.

Here is how you solve your problem in Pydoop (for details on how to run 
programs, see the docs at http://pydoop.sourceforge.net/docs):

import pydoop.pipes as pp
import pydoop.hdfs as hdfs

class Mapper(pp.Mapper):

   def __init__(self, context):
     super(Mapper, self).__init__(context)
     jc = context.getJobConf()
     fname = "%s/%s" % (jc.get("mapred.output.dir"), 
jc.get("mapred.task.id"))
     self.fo = hdfs.open(fname, "w", user="simleo")
     self.fo.close()
     self.fo = hdfs.open(fname, "a", user="simleo")

   def map(self, context):
     l = len(context.getInputValue())
     self.fo.write("%d\n" % l)

   def close(self):
     self.fo.close()

class Reducer(pp.Reducer):
   pass

if __name__ == "__main__":
   pp.runTask(pp.Factory(Mapper, Reducer))

Note that I'm embedding the task attempt info into the file name, to 
avoid clashes due to different mappers trying to access the same file at 
the same time.

Simone

On 02/07/2013 06:18 AM, Harsh J wrote:
> The raw streaming interface has much issues of this manner. The python
> open(…, 'w') calls won't open files on HDFS, further. Perhaps, since
> you wish to use Python for its various advantages, check out this
> detailed comparison guide of various Python-based Hadoop frameworks
> (including the raw streaming we offer as part of Apache Hadoop) at
> http://blog.cloudera.com/blog/2013/01/a-guide-to-python-frameworks-for-hadoop/
> by Uri? Many of these provide python extensions to HDFS/etc., letting
> you do much more than plain streaming.
>
> On Thu, Feb 7, 2013 at 6:43 AM, Julian Bui <ju...@gmail.com> wrote:
>> Hi hadoop users,
>>
>> I am trying to use the streaming interface to use my python script mapper to
>> create some files but am running into difficulties actually creating files
>> on the hdfs.
>>
>> I have a python script mapper with no reducers.  Currently, it doesn't even
>> read the input and instead reads in the env variable for the output dir
>> (outdir = os.environ['mapred_output_dir']) and attempts to create an empty
>> file at that location.  However, that appears to fail with the [vague] error
>> message appended to this email.
>>
>> I am using the streaming interface because the python file examples seem so
>> much cleaner and abstract a lot of the details away for me but if I instead
>> need to use the java bindings (and create a mapper and reducer class) then
>> please let me know.  I'm still learning hadoop.  As I understand it, I
>> should be able to create files in hadoop but perhaps there is limited
>> ability while using the streaming i/o interface.
>>
>> Further questions: If my mapper absolutely must send my output to stdout, is
>> there a way to rename the file after it has been created?
>>
>> Please help.
>>
>> Thanks,
>> -Julian
>>
>> Python mapper code:
>> outdir = os.environ['mapred_output_dir']
>> f = open(outdir + "/testfile.txt", "wb")
>> f.close()
>>
>>
>> 13/02/06 17:07:55 INFO streaming.StreamJob:  map 100%  reduce 100%
>> 13/02/06 17:07:55 INFO streaming.StreamJob: To kill this job, run:
>> 13/02/06 17:07:55 INFO streaming.StreamJob:
>> /opt/hadoop/libexec/../bin/hadoop job
>> -Dmapred.job.tracker=gcn-13-88.ibnet0:54311 -kill job_201302061706_0001
>> 13/02/06 17:07:55 INFO streaming.StreamJob: Tracking URL:
>> http://gcn-13-88.ibnet0:50030/jobdetails.jsp?jobid=job_201302061706_0001
>> 13/02/06 17:07:55 ERROR streaming.StreamJob: Job not successful. Error: # of
>> failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask:
>> task_201302061706_0001_m_000000
>> 13/02/06 17:07:55 INFO streaming.StreamJob: killJob...
>> Streaming Command Failed!
>>
>
>
>
> --
> Harsh J
>

-- 
Simone Leo
Data Fusion - Distributed Computing
CRS4
POLARIS - Building #1
Piscina Manna
I-09010 Pula (CA) - Italy
e-mail: simone.leo@crs4.it
http://www.crs4.it

Re: Creating files through the hadoop streaming interface

Posted by Simone Leo <si...@crs4.it>.

Hello,

the lack of an HDFS API is just one of the drawbacks that motivated us 
to abandon Streaming and develop Pydoop.  Unfortunately, in the blog 
post cited by Harsh J, Pydoop is just briefly mentioned because the 
author failed to build and install it.

Here is how you solve your problem in Pydoop (for details on how to run 
programs, see the docs at http://pydoop.sourceforge.net/docs):

import pydoop.pipes as pp
import pydoop.hdfs as hdfs

class Mapper(pp.Mapper):

   def __init__(self, context):
     super(Mapper, self).__init__(context)
     jc = context.getJobConf()
     fname = "%s/%s" % (jc.get("mapred.output.dir"), 
jc.get("mapred.task.id"))
     self.fo = hdfs.open(fname, "w", user="simleo")
     self.fo.close()
     self.fo = hdfs.open(fname, "a", user="simleo")

   def map(self, context):
     l = len(context.getInputValue())
     self.fo.write("%d\n" % l)

   def close(self):
     self.fo.close()

class Reducer(pp.Reducer):
   pass

if __name__ == "__main__":
   pp.runTask(pp.Factory(Mapper, Reducer))

Note that I'm embedding the task attempt info into the file name, to 
avoid clashes due to different mappers trying to access the same file at 
the same time.

Simone

On 02/07/2013 06:18 AM, Harsh J wrote:
> The raw streaming interface has much issues of this manner. The python
> open(…, 'w') calls won't open files on HDFS, further. Perhaps, since
> you wish to use Python for its various advantages, check out this
> detailed comparison guide of various Python-based Hadoop frameworks
> (including the raw streaming we offer as part of Apache Hadoop) at
> http://blog.cloudera.com/blog/2013/01/a-guide-to-python-frameworks-for-hadoop/
> by Uri? Many of these provide python extensions to HDFS/etc., letting
> you do much more than plain streaming.
>
> On Thu, Feb 7, 2013 at 6:43 AM, Julian Bui <ju...@gmail.com> wrote:
>> Hi hadoop users,
>>
>> I am trying to use the streaming interface to use my python script mapper to
>> create some files but am running into difficulties actually creating files
>> on the hdfs.
>>
>> I have a python script mapper with no reducers.  Currently, it doesn't even
>> read the input and instead reads in the env variable for the output dir
>> (outdir = os.environ['mapred_output_dir']) and attempts to create an empty
>> file at that location.  However, that appears to fail with the [vague] error
>> message appended to this email.
>>
>> I am using the streaming interface because the python file examples seem so
>> much cleaner and abstract a lot of the details away for me but if I instead
>> need to use the java bindings (and create a mapper and reducer class) then
>> please let me know.  I'm still learning hadoop.  As I understand it, I
>> should be able to create files in hadoop but perhaps there is limited
>> ability while using the streaming i/o interface.
>>
>> Further questions: If my mapper absolutely must send my output to stdout, is
>> there a way to rename the file after it has been created?
>>
>> Please help.
>>
>> Thanks,
>> -Julian
>>
>> Python mapper code:
>> outdir = os.environ['mapred_output_dir']
>> f = open(outdir + "/testfile.txt", "wb")
>> f.close()
>>
>>
>> 13/02/06 17:07:55 INFO streaming.StreamJob:  map 100%  reduce 100%
>> 13/02/06 17:07:55 INFO streaming.StreamJob: To kill this job, run:
>> 13/02/06 17:07:55 INFO streaming.StreamJob:
>> /opt/hadoop/libexec/../bin/hadoop job
>> -Dmapred.job.tracker=gcn-13-88.ibnet0:54311 -kill job_201302061706_0001
>> 13/02/06 17:07:55 INFO streaming.StreamJob: Tracking URL:
>> http://gcn-13-88.ibnet0:50030/jobdetails.jsp?jobid=job_201302061706_0001
>> 13/02/06 17:07:55 ERROR streaming.StreamJob: Job not successful. Error: # of
>> failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask:
>> task_201302061706_0001_m_000000
>> 13/02/06 17:07:55 INFO streaming.StreamJob: killJob...
>> Streaming Command Failed!
>>
>
>
>
> --
> Harsh J
>

-- 
Simone Leo
Data Fusion - Distributed Computing
CRS4
POLARIS - Building #1
Piscina Manna
I-09010 Pula (CA) - Italy
e-mail: simone.leo@crs4.it
http://www.crs4.it

Re: Creating files through the hadoop streaming interface

Posted by Simone Leo <si...@crs4.it>.

Hello,

the lack of an HDFS API is just one of the drawbacks that motivated us 
to abandon Streaming and develop Pydoop.  Unfortunately, in the blog 
post cited by Harsh J, Pydoop is just briefly mentioned because the 
author failed to build and install it.

Here is how you solve your problem in Pydoop (for details on how to run 
programs, see the docs at http://pydoop.sourceforge.net/docs):

import pydoop.pipes as pp
import pydoop.hdfs as hdfs

class Mapper(pp.Mapper):

   def __init__(self, context):
     super(Mapper, self).__init__(context)
     jc = context.getJobConf()
     fname = "%s/%s" % (jc.get("mapred.output.dir"), 
jc.get("mapred.task.id"))
     self.fo = hdfs.open(fname, "w", user="simleo")
     self.fo.close()
     self.fo = hdfs.open(fname, "a", user="simleo")

   def map(self, context):
     l = len(context.getInputValue())
     self.fo.write("%d\n" % l)

   def close(self):
     self.fo.close()

class Reducer(pp.Reducer):
   pass

if __name__ == "__main__":
   pp.runTask(pp.Factory(Mapper, Reducer))

Note that I'm embedding the task attempt info into the file name, to 
avoid clashes due to different mappers trying to access the same file at 
the same time.

Simone

On 02/07/2013 06:18 AM, Harsh J wrote:
> The raw streaming interface has much issues of this manner. The python
> open(…, 'w') calls won't open files on HDFS, further. Perhaps, since
> you wish to use Python for its various advantages, check out this
> detailed comparison guide of various Python-based Hadoop frameworks
> (including the raw streaming we offer as part of Apache Hadoop) at
> http://blog.cloudera.com/blog/2013/01/a-guide-to-python-frameworks-for-hadoop/
> by Uri? Many of these provide python extensions to HDFS/etc., letting
> you do much more than plain streaming.
>
> On Thu, Feb 7, 2013 at 6:43 AM, Julian Bui <ju...@gmail.com> wrote:
>> Hi hadoop users,
>>
>> I am trying to use the streaming interface to use my python script mapper to
>> create some files but am running into difficulties actually creating files
>> on the hdfs.
>>
>> I have a python script mapper with no reducers.  Currently, it doesn't even
>> read the input and instead reads in the env variable for the output dir
>> (outdir = os.environ['mapred_output_dir']) and attempts to create an empty
>> file at that location.  However, that appears to fail with the [vague] error
>> message appended to this email.
>>
>> I am using the streaming interface because the python file examples seem so
>> much cleaner and abstract a lot of the details away for me but if I instead
>> need to use the java bindings (and create a mapper and reducer class) then
>> please let me know.  I'm still learning hadoop.  As I understand it, I
>> should be able to create files in hadoop but perhaps there is limited
>> ability while using the streaming i/o interface.
>>
>> Further questions: If my mapper absolutely must send my output to stdout, is
>> there a way to rename the file after it has been created?
>>
>> Please help.
>>
>> Thanks,
>> -Julian
>>
>> Python mapper code:
>> outdir = os.environ['mapred_output_dir']
>> f = open(outdir + "/testfile.txt", "wb")
>> f.close()
>>
>>
>> 13/02/06 17:07:55 INFO streaming.StreamJob:  map 100%  reduce 100%
>> 13/02/06 17:07:55 INFO streaming.StreamJob: To kill this job, run:
>> 13/02/06 17:07:55 INFO streaming.StreamJob:
>> /opt/hadoop/libexec/../bin/hadoop job
>> -Dmapred.job.tracker=gcn-13-88.ibnet0:54311 -kill job_201302061706_0001
>> 13/02/06 17:07:55 INFO streaming.StreamJob: Tracking URL:
>> http://gcn-13-88.ibnet0:50030/jobdetails.jsp?jobid=job_201302061706_0001
>> 13/02/06 17:07:55 ERROR streaming.StreamJob: Job not successful. Error: # of
>> failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask:
>> task_201302061706_0001_m_000000
>> 13/02/06 17:07:55 INFO streaming.StreamJob: killJob...
>> Streaming Command Failed!
>>
>
>
>
> --
> Harsh J
>

-- 
Simone Leo
Data Fusion - Distributed Computing
CRS4
POLARIS - Building #1
Piscina Manna
I-09010 Pula (CA) - Italy
e-mail: simone.leo@crs4.it
http://www.crs4.it

Re: Creating files through the hadoop streaming interface

Posted by Simone Leo <si...@crs4.it>.

Hello,

the lack of an HDFS API is just one of the drawbacks that motivated us 
to abandon Streaming and develop Pydoop.  Unfortunately, in the blog 
post cited by Harsh J, Pydoop is just briefly mentioned because the 
author failed to build and install it.

Here is how you solve your problem in Pydoop (for details on how to run 
programs, see the docs at http://pydoop.sourceforge.net/docs):

import pydoop.pipes as pp
import pydoop.hdfs as hdfs

class Mapper(pp.Mapper):

   def __init__(self, context):
     super(Mapper, self).__init__(context)
     jc = context.getJobConf()
     fname = "%s/%s" % (jc.get("mapred.output.dir"), 
jc.get("mapred.task.id"))
     self.fo = hdfs.open(fname, "w", user="simleo")
     self.fo.close()
     self.fo = hdfs.open(fname, "a", user="simleo")

   def map(self, context):
     l = len(context.getInputValue())
     self.fo.write("%d\n" % l)

   def close(self):
     self.fo.close()

class Reducer(pp.Reducer):
   pass

if __name__ == "__main__":
   pp.runTask(pp.Factory(Mapper, Reducer))

Note that I'm embedding the task attempt info into the file name, to 
avoid clashes due to different mappers trying to access the same file at 
the same time.

Simone

On 02/07/2013 06:18 AM, Harsh J wrote:
> The raw streaming interface has much issues of this manner. The python
> open(…, 'w') calls won't open files on HDFS, further. Perhaps, since
> you wish to use Python for its various advantages, check out this
> detailed comparison guide of various Python-based Hadoop frameworks
> (including the raw streaming we offer as part of Apache Hadoop) at
> http://blog.cloudera.com/blog/2013/01/a-guide-to-python-frameworks-for-hadoop/
> by Uri? Many of these provide python extensions to HDFS/etc., letting
> you do much more than plain streaming.
>
> On Thu, Feb 7, 2013 at 6:43 AM, Julian Bui <ju...@gmail.com> wrote:
>> Hi hadoop users,
>>
>> I am trying to use the streaming interface to use my python script mapper to
>> create some files but am running into difficulties actually creating files
>> on the hdfs.
>>
>> I have a python script mapper with no reducers.  Currently, it doesn't even
>> read the input and instead reads in the env variable for the output dir
>> (outdir = os.environ['mapred_output_dir']) and attempts to create an empty
>> file at that location.  However, that appears to fail with the [vague] error
>> message appended to this email.
>>
>> I am using the streaming interface because the python file examples seem so
>> much cleaner and abstract a lot of the details away for me but if I instead
>> need to use the java bindings (and create a mapper and reducer class) then
>> please let me know.  I'm still learning hadoop.  As I understand it, I
>> should be able to create files in hadoop but perhaps there is limited
>> ability while using the streaming i/o interface.
>>
>> Further questions: If my mapper absolutely must send my output to stdout, is
>> there a way to rename the file after it has been created?
>>
>> Please help.
>>
>> Thanks,
>> -Julian
>>
>> Python mapper code:
>> outdir = os.environ['mapred_output_dir']
>> f = open(outdir + "/testfile.txt", "wb")
>> f.close()
>>
>>
>> 13/02/06 17:07:55 INFO streaming.StreamJob:  map 100%  reduce 100%
>> 13/02/06 17:07:55 INFO streaming.StreamJob: To kill this job, run:
>> 13/02/06 17:07:55 INFO streaming.StreamJob:
>> /opt/hadoop/libexec/../bin/hadoop job
>> -Dmapred.job.tracker=gcn-13-88.ibnet0:54311 -kill job_201302061706_0001
>> 13/02/06 17:07:55 INFO streaming.StreamJob: Tracking URL:
>> http://gcn-13-88.ibnet0:50030/jobdetails.jsp?jobid=job_201302061706_0001
>> 13/02/06 17:07:55 ERROR streaming.StreamJob: Job not successful. Error: # of
>> failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask:
>> task_201302061706_0001_m_000000
>> 13/02/06 17:07:55 INFO streaming.StreamJob: killJob...
>> Streaming Command Failed!
>>
>
>
>
> --
> Harsh J
>

-- 
Simone Leo
Data Fusion - Distributed Computing
CRS4
POLARIS - Building #1
Piscina Manna
I-09010 Pula (CA) - Italy
e-mail: simone.leo@crs4.it
http://www.crs4.it

Re: Creating files through the hadoop streaming interface

Posted by Harsh J <ha...@cloudera.com>.

The raw streaming interface has much issues of this manner. The python
open(…, 'w') calls won't open files on HDFS, further. Perhaps, since
you wish to use Python for its various advantages, check out this
detailed comparison guide of various Python-based Hadoop frameworks
(including the raw streaming we offer as part of Apache Hadoop) at
http://blog.cloudera.com/blog/2013/01/a-guide-to-python-frameworks-for-hadoop/
by Uri? Many of these provide python extensions to HDFS/etc., letting
you do much more than plain streaming.

On Thu, Feb 7, 2013 at 6:43 AM, Julian Bui <ju...@gmail.com> wrote:
> Hi hadoop users,
>
> I am trying to use the streaming interface to use my python script mapper to
> create some files but am running into difficulties actually creating files
> on the hdfs.
>
> I have a python script mapper with no reducers.  Currently, it doesn't even
> read the input and instead reads in the env variable for the output dir
> (outdir = os.environ['mapred_output_dir']) and attempts to create an empty
> file at that location.  However, that appears to fail with the [vague] error
> message appended to this email.
>
> I am using the streaming interface because the python file examples seem so
> much cleaner and abstract a lot of the details away for me but if I instead
> need to use the java bindings (and create a mapper and reducer class) then
> please let me know.  I'm still learning hadoop.  As I understand it, I
> should be able to create files in hadoop but perhaps there is limited
> ability while using the streaming i/o interface.
>
> Further questions: If my mapper absolutely must send my output to stdout, is
> there a way to rename the file after it has been created?
>
> Please help.
>
> Thanks,
> -Julian
>
> Python mapper code:
> outdir = os.environ['mapred_output_dir']
> f = open(outdir + "/testfile.txt", "wb")
> f.close()
>
>
> 13/02/06 17:07:55 INFO streaming.StreamJob:  map 100%  reduce 100%
> 13/02/06 17:07:55 INFO streaming.StreamJob: To kill this job, run:
> 13/02/06 17:07:55 INFO streaming.StreamJob:
> /opt/hadoop/libexec/../bin/hadoop job
> -Dmapred.job.tracker=gcn-13-88.ibnet0:54311 -kill job_201302061706_0001
> 13/02/06 17:07:55 INFO streaming.StreamJob: Tracking URL:
> http://gcn-13-88.ibnet0:50030/jobdetails.jsp?jobid=job_201302061706_0001
> 13/02/06 17:07:55 ERROR streaming.StreamJob: Job not successful. Error: # of
> failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask:
> task_201302061706_0001_m_000000
> 13/02/06 17:07:55 INFO streaming.StreamJob: killJob...
> Streaming Command Failed!
>



--
Harsh J

Re: Creating files through the hadoop streaming interface

Posted by Harsh J <ha...@cloudera.com>.

The raw streaming interface has much issues of this manner. The python
open(…, 'w') calls won't open files on HDFS, further. Perhaps, since
you wish to use Python for its various advantages, check out this
detailed comparison guide of various Python-based Hadoop frameworks
(including the raw streaming we offer as part of Apache Hadoop) at
http://blog.cloudera.com/blog/2013/01/a-guide-to-python-frameworks-for-hadoop/
by Uri? Many of these provide python extensions to HDFS/etc., letting
you do much more than plain streaming.

On Thu, Feb 7, 2013 at 6:43 AM, Julian Bui <ju...@gmail.com> wrote:
> Hi hadoop users,
>
> I am trying to use the streaming interface to use my python script mapper to
> create some files but am running into difficulties actually creating files
> on the hdfs.
>
> I have a python script mapper with no reducers.  Currently, it doesn't even
> read the input and instead reads in the env variable for the output dir
> (outdir = os.environ['mapred_output_dir']) and attempts to create an empty
> file at that location.  However, that appears to fail with the [vague] error
> message appended to this email.
>
> I am using the streaming interface because the python file examples seem so
> much cleaner and abstract a lot of the details away for me but if I instead
> need to use the java bindings (and create a mapper and reducer class) then
> please let me know.  I'm still learning hadoop.  As I understand it, I
> should be able to create files in hadoop but perhaps there is limited
> ability while using the streaming i/o interface.
>
> Further questions: If my mapper absolutely must send my output to stdout, is
> there a way to rename the file after it has been created?
>
> Please help.
>
> Thanks,
> -Julian
>
> Python mapper code:
> outdir = os.environ['mapred_output_dir']
> f = open(outdir + "/testfile.txt", "wb")
> f.close()
>
>
> 13/02/06 17:07:55 INFO streaming.StreamJob:  map 100%  reduce 100%
> 13/02/06 17:07:55 INFO streaming.StreamJob: To kill this job, run:
> 13/02/06 17:07:55 INFO streaming.StreamJob:
> /opt/hadoop/libexec/../bin/hadoop job
> -Dmapred.job.tracker=gcn-13-88.ibnet0:54311 -kill job_201302061706_0001
> 13/02/06 17:07:55 INFO streaming.StreamJob: Tracking URL:
> http://gcn-13-88.ibnet0:50030/jobdetails.jsp?jobid=job_201302061706_0001
> 13/02/06 17:07:55 ERROR streaming.StreamJob: Job not successful. Error: # of
> failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask:
> task_201302061706_0001_m_000000
> 13/02/06 17:07:55 INFO streaming.StreamJob: killJob...
> Streaming Command Failed!
>



--
Harsh J

Re: Creating files through the hadoop streaming interface

Posted by Harsh J <ha...@cloudera.com>.

The raw streaming interface has much issues of this manner. The python
open(…, 'w') calls won't open files on HDFS, further. Perhaps, since
you wish to use Python for its various advantages, check out this
detailed comparison guide of various Python-based Hadoop frameworks
(including the raw streaming we offer as part of Apache Hadoop) at
http://blog.cloudera.com/blog/2013/01/a-guide-to-python-frameworks-for-hadoop/
by Uri? Many of these provide python extensions to HDFS/etc., letting
you do much more than plain streaming.

On Thu, Feb 7, 2013 at 6:43 AM, Julian Bui <ju...@gmail.com> wrote:
> Hi hadoop users,
>
> I am trying to use the streaming interface to use my python script mapper to
> create some files but am running into difficulties actually creating files
> on the hdfs.
>
> I have a python script mapper with no reducers.  Currently, it doesn't even
> read the input and instead reads in the env variable for the output dir
> (outdir = os.environ['mapred_output_dir']) and attempts to create an empty
> file at that location.  However, that appears to fail with the [vague] error
> message appended to this email.
>
> I am using the streaming interface because the python file examples seem so
> much cleaner and abstract a lot of the details away for me but if I instead
> need to use the java bindings (and create a mapper and reducer class) then
> please let me know.  I'm still learning hadoop.  As I understand it, I
> should be able to create files in hadoop but perhaps there is limited
> ability while using the streaming i/o interface.
>
> Further questions: If my mapper absolutely must send my output to stdout, is
> there a way to rename the file after it has been created?
>
> Please help.
>
> Thanks,
> -Julian
>
> Python mapper code:
> outdir = os.environ['mapred_output_dir']
> f = open(outdir + "/testfile.txt", "wb")
> f.close()
>
>
> 13/02/06 17:07:55 INFO streaming.StreamJob:  map 100%  reduce 100%
> 13/02/06 17:07:55 INFO streaming.StreamJob: To kill this job, run:
> 13/02/06 17:07:55 INFO streaming.StreamJob:
> /opt/hadoop/libexec/../bin/hadoop job
> -Dmapred.job.tracker=gcn-13-88.ibnet0:54311 -kill job_201302061706_0001
> 13/02/06 17:07:55 INFO streaming.StreamJob: Tracking URL:
> http://gcn-13-88.ibnet0:50030/jobdetails.jsp?jobid=job_201302061706_0001
> 13/02/06 17:07:55 ERROR streaming.StreamJob: Job not successful. Error: # of
> failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask:
> task_201302061706_0001_m_000000
> 13/02/06 17:07:55 INFO streaming.StreamJob: killJob...
> Streaming Command Failed!
>



--
Harsh J

Re: Creating files through the hadoop streaming interface

Posted by Harsh J <ha...@cloudera.com>.

The raw streaming interface has much issues of this manner. The python
open(…, 'w') calls won't open files on HDFS, further. Perhaps, since
you wish to use Python for its various advantages, check out this
detailed comparison guide of various Python-based Hadoop frameworks
(including the raw streaming we offer as part of Apache Hadoop) at
http://blog.cloudera.com/blog/2013/01/a-guide-to-python-frameworks-for-hadoop/
by Uri? Many of these provide python extensions to HDFS/etc., letting
you do much more than plain streaming.

On Thu, Feb 7, 2013 at 6:43 AM, Julian Bui <ju...@gmail.com> wrote:
> Hi hadoop users,
>
> I am trying to use the streaming interface to use my python script mapper to
> create some files but am running into difficulties actually creating files
> on the hdfs.
>
> I have a python script mapper with no reducers.  Currently, it doesn't even
> read the input and instead reads in the env variable for the output dir
> (outdir = os.environ['mapred_output_dir']) and attempts to create an empty
> file at that location.  However, that appears to fail with the [vague] error
> message appended to this email.
>
> I am using the streaming interface because the python file examples seem so
> much cleaner and abstract a lot of the details away for me but if I instead
> need to use the java bindings (and create a mapper and reducer class) then
> please let me know.  I'm still learning hadoop.  As I understand it, I
> should be able to create files in hadoop but perhaps there is limited
> ability while using the streaming i/o interface.
>
> Further questions: If my mapper absolutely must send my output to stdout, is
> there a way to rename the file after it has been created?
>
> Please help.
>
> Thanks,
> -Julian
>
> Python mapper code:
> outdir = os.environ['mapred_output_dir']
> f = open(outdir + "/testfile.txt", "wb")
> f.close()
>
>
> 13/02/06 17:07:55 INFO streaming.StreamJob:  map 100%  reduce 100%
> 13/02/06 17:07:55 INFO streaming.StreamJob: To kill this job, run:
> 13/02/06 17:07:55 INFO streaming.StreamJob:
> /opt/hadoop/libexec/../bin/hadoop job
> -Dmapred.job.tracker=gcn-13-88.ibnet0:54311 -kill job_201302061706_0001
> 13/02/06 17:07:55 INFO streaming.StreamJob: Tracking URL:
> http://gcn-13-88.ibnet0:50030/jobdetails.jsp?jobid=job_201302061706_0001
> 13/02/06 17:07:55 ERROR streaming.StreamJob: Job not successful. Error: # of
> failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask:
> task_201302061706_0001_m_000000
> 13/02/06 17:07:55 INFO streaming.StreamJob: killJob...
> Streaming Command Failed!
>



--
Harsh J