You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by Steve Gao <st...@yahoo.com> on 2009/05/16 01:52:52 UTC

How to get jobconf variables in streaming's mapper/reducer?

I am using streaming with perl, and I want to get jobconf variable values. As many tutorials say they are in environment, but I can not get them. 

For example, in reducer:
while (<STDIN>){
  my $part = $ENV{"mapred.task.partition"};
  print ("$part\n");
}

It turns out that  $ENV{"mapred.task.partition"} is not defined.

HOWEVER, I can get myself defined variable value. For example:

 $HADOOP_HOME/bin/hadoop  \
 jar $HADOOP_HOME/hadoop-streaming.jar \
     -input file1 \
     -output myOutputDir \
     -mapper mapper \
     -reducer reducer \
     -jobcont arg=test

In reducer:

while (<STDIN>){

  my $part2 = $ENV{"arg"};

  print ("$part2\n");

}


It works.

Anybody knows why is that? How to get jobconf variables in streaming? Thanks lot!



      

Re: How to get jobconf variables in streaming's mapper/reducer?

Posted by Peter Skomoroch <pe...@gmail.com>.
It took me a while to track this down, Todd is half right (at least for
18.3)...

mapred.task.partition actually turns into $mapred_task_partition  (note it
is lowercase)

for example, to get the filename in the mapper of a python streaming job:

----------

import sys, os
filename = os.environ["map_input_file"]
taskpartition = os.environ["mapred_task_partition"]

filename will have the form:

hdfs://domU-12-31-38-01-6C-F1.compute-1.internal:9000/user/root/myinputs/gzpagecounts/pagecounts-20090501-030001.gz

See:

http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200904.mbox/%3C49E13557.7090504@domaintools.com%3E

and

http://svn.apache.org/repos/asf/hadoop/core/trunk/src/contrib/streaming/src/java/org/apache/hadoop/streaming/PipeMapRed.java

-Pete

On Fri, May 15, 2009 at 8:01 PM, Todd Lipcon <to...@cloudera.com> wrote:

> Hi Steve,
>
> The variables are transformed before going to the mappers.
> mapred.task.partition turns into $MAPRED_TASK_PARTITION to be more unix-y
>
> -Todd
>
> On Fri, May 15, 2009 at 4:52 PM, Steve Gao <st...@yahoo.com> wrote:
>
> > I am using streaming with perl, and I want to get jobconf variable
> values.
> > As many tutorials say they are in environment, but I can not get them.
> >
> > For example, in reducer:
> > while (<STDIN>){
> >   my $part = $ENV{"mapred.task.partition"};
> >   print ("$part\n");
> > }
> >
> > It turns out that  $ENV{"mapred.task.partition"} is not defined.
> >
> > HOWEVER, I can get myself defined variable value. For example:
> >
> >  $HADOOP_HOME/bin/hadoop  \
> >  jar $HADOOP_HOME/hadoop-streaming.jar \
> >      -input file1 \
> >      -output myOutputDir \
> >      -mapper mapper \
> >      -reducer reducer \
> >      -jobcont arg=test
> >
> > In reducer:
> >
> > while (<STDIN>){
> >
> >   my $part2 = $ENV{"arg"};
> >
> >   print ("$part2\n");
> >
> > }
> >
> >
> > It works.
> >
> > Anybody knows why is that? How to get jobconf variables in streaming?
> > Thanks lot!
> >
> >
> >
> >
>



-- 
Peter N. Skomoroch
617.285.8348
http://www.datawrangling.com
http://delicious.com/pskomoroch
http://twitter.com/peteskomoroch

Re: How to get jobconf variables in streaming's mapper/reducer?

Posted by Todd Lipcon <to...@cloudera.com>.
Hi Steve,

The variables are transformed before going to the mappers.
mapred.task.partition turns into $MAPRED_TASK_PARTITION to be more unix-y

-Todd

On Fri, May 15, 2009 at 4:52 PM, Steve Gao <st...@yahoo.com> wrote:

> I am using streaming with perl, and I want to get jobconf variable values.
> As many tutorials say they are in environment, but I can not get them.
>
> For example, in reducer:
> while (<STDIN>){
>   my $part = $ENV{"mapred.task.partition"};
>   print ("$part\n");
> }
>
> It turns out that  $ENV{"mapred.task.partition"} is not defined.
>
> HOWEVER, I can get myself defined variable value. For example:
>
>  $HADOOP_HOME/bin/hadoop  \
>  jar $HADOOP_HOME/hadoop-streaming.jar \
>      -input file1 \
>      -output myOutputDir \
>      -mapper mapper \
>      -reducer reducer \
>      -jobcont arg=test
>
> In reducer:
>
> while (<STDIN>){
>
>   my $part2 = $ENV{"arg"};
>
>   print ("$part2\n");
>
> }
>
>
> It works.
>
> Anybody knows why is that? How to get jobconf variables in streaming?
> Thanks lot!
>
>
>
>