You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Robert Fynes <fy...@gmail.com> on 2013/03/27 17:35:58 UTC

Streaming mapper byte offsets not being generated

I'm running a streaming Hadoop (1.0.4) job in local mode. However, the byte
offsets of the data in my input file are not being generated as keys for
the mapper output, like I would expect. The command:

$HADOOP_INSTALL/bin/hadoop \
jar $HADOOP_INSTALL/contrib/streaming/hadoop-streaming-$HADOOP_VERSION.jar
\-D stream.map.input.ignoreKey=false \-inputformat
org.apache.hadoop.mapred.TextInputFormat \-file ./mapper.py \-file
./reducer.py \-mapper ./mapper.py \-reducer ./reducer.py \-input
$INPUT_DIR \-output $OUTPUT_DIR \-cmdenv REGEX=$REGEX

My understanding is that TextInputFormat is the default, so I also tried
the above command without the -inputformat option. I've also tried removing
the -D (above), but I'm told that this is required to get the byte offset
as key when using the streaming API.

For what it's worth, I'm just experimenting with Hadoop for a student
project. At the moment, the mapper is a very simple python grep of a file
in HDFS, matching each line against the supplied regex:

pattern = re.compile(os.environ['REGEX'])for line in sys.stdin:
   match = pattern.search(line)
   if (match):
      sys.stdout.write(line)

Right now though, the only thing that the reducer receives as input is the
matching lines. I'm expecting tab or whitespace delimited key/value pairs,
where key=byte_offset and value=regex_line_match.

*Can anyone tell me or suggest why this is happening?*

I should also point out that if I use:

-inputformat org.apache.hadoop.mapred.lib.NLineInputFormat

...then the byte offsets are *successfully* produced as keys of the mapper
output. But the job takes an extremely long time to complete (and my input
file only has about 50 lines of text in it!).

*Also, I'm just as interested in answering these two (related) questions:*

   1. Is it possible for a mapper to manually determine the byte offset for
   each line of the data it is processing relative to the file which the data
   belongs to?
   2. Is is possible for a mapper to determine the total number of bytes in
   the file to which the data it is processing belongs?

If yes to either of these questions, how? (python, or streaming in general).

Cheers.

-Rob