You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by Shahfik Bin AMASHA <sh...@sis.smu.edu.sg> on 2009/10/18 13:58:56 UTC

State Transition problem

Hi,

I am using Hadoop 0.20.1 and I have a problem which is similar to the
one below:

Setup
-----
I have a web server log that looks like:

> serial number, domain, datetime, httpstatus

The web server outputs to a single log file for 1,000 domains. I would
like an output in the following format:

> previous serial number, domain1, previous datetime, previous
httpstatus, next serial number, current datetime, current  httpstatus

For example, an input of:

> 1728 ...
> 1729, hadoop.apache.org, 2009/10/18 08:23:15, 200
> 1730, ...
> ...
> 1735, ...
> 1736, hadoop.apache.org, 2009/10/18 08:23:19, 404
> 1737, ...
> ...
> 1741, ...
> 1742, hadoop.apache.org, 2009/10/18 08:23:24, 500
> 1743, ...
> ...
> 1750, ...
> 1751, hadoop.apache.org, 2009/10/18 08:23:30, 200
> 1752 ...

would output:

> 1729, hadoop.apache.org, 2009/10/18 08:23:15, 200, 1736, 2009/10/18
08:23:19, 404
> 1736, hadoop.apache.org, 2009/10/18 08:23:19, 404, 1742, 2009/10/18
08:23:24, 500
> 1742, hadoop.apache.org, 2009/10/18 08:23:24, 500, 1751, 2009/10/18
08:23:30, 200


My thoughts
-----------
I have separated the problem into 2 jobs:

1) Do a Secondary Sort on the log file to output a file sorted primarily
by <domain> followed by <serial number>

2a) Implement a custom TwoLineRecordReader<LongWritable, Text> that
takes in the previous output as the input. The custom RecordReader:
    2a) i) During initialize(InputSplit, TaskAttemptContext), reads the
first line.
    2a) ii) During nextKeyValue(), reads the second line output and sets
<value> to firstLine + "|" + secondLine.
            Consequently, sets firstLine to secondLine.

2b) The mapper and reducer generates the output.

I have been successful at job 1.

Problems
--------
It seems as though the job is not using TwoLineRecordReader, even though
I've specified it through a custom InputFormat. Instead, it outputs the
same input file when I do a println on <value> in Mapper.

> TwoLineInputFormat.addInputPath(job, new
Path("output/sorted/part-r-00000"));
> TextOutputFormat.setOutputPath(job, new Path("output/transitions"));

Call to action
--------------
1) Perhaps I'm not thinking of the problem the right way. Would you
suggest another way to solve it?
2) Am I implementing the custom RecordReader in the right way?


Thank you!

Regards,
Shahfik Amasha
Undergraduate
School of Information Systems
Singapore Management University