You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by Shahfik Bin AMASHA <sh...@sis.smu.edu.sg> on 2009/10/18 13:58:56 UTC
State Transition problem
Hi,
I am using Hadoop 0.20.1 and I have a problem which is similar to the
one below:
Setup
-----
I have a web server log that looks like:
> serial number, domain, datetime, httpstatus
The web server outputs to a single log file for 1,000 domains. I would
like an output in the following format:
> previous serial number, domain1, previous datetime, previous
httpstatus, next serial number, current datetime, current httpstatus
For example, an input of:
> 1728 ...
> 1729, hadoop.apache.org, 2009/10/18 08:23:15, 200
> 1730, ...
> ...
> 1735, ...
> 1736, hadoop.apache.org, 2009/10/18 08:23:19, 404
> 1737, ...
> ...
> 1741, ...
> 1742, hadoop.apache.org, 2009/10/18 08:23:24, 500
> 1743, ...
> ...
> 1750, ...
> 1751, hadoop.apache.org, 2009/10/18 08:23:30, 200
> 1752 ...
would output:
> 1729, hadoop.apache.org, 2009/10/18 08:23:15, 200, 1736, 2009/10/18
08:23:19, 404
> 1736, hadoop.apache.org, 2009/10/18 08:23:19, 404, 1742, 2009/10/18
08:23:24, 500
> 1742, hadoop.apache.org, 2009/10/18 08:23:24, 500, 1751, 2009/10/18
08:23:30, 200
My thoughts
-----------
I have separated the problem into 2 jobs:
1) Do a Secondary Sort on the log file to output a file sorted primarily
by <domain> followed by <serial number>
2a) Implement a custom TwoLineRecordReader<LongWritable, Text> that
takes in the previous output as the input. The custom RecordReader:
2a) i) During initialize(InputSplit, TaskAttemptContext), reads the
first line.
2a) ii) During nextKeyValue(), reads the second line output and sets
<value> to firstLine + "|" + secondLine.
Consequently, sets firstLine to secondLine.
2b) The mapper and reducer generates the output.
I have been successful at job 1.
Problems
--------
It seems as though the job is not using TwoLineRecordReader, even though
I've specified it through a custom InputFormat. Instead, it outputs the
same input file when I do a println on <value> in Mapper.
> TwoLineInputFormat.addInputPath(job, new
Path("output/sorted/part-r-00000"));
> TextOutputFormat.setOutputPath(job, new Path("output/transitions"));
Call to action
--------------
1) Perhaps I'm not thinking of the problem the right way. Would you
suggest another way to solve it?
2) Am I implementing the custom RecordReader in the right way?
Thank you!
Regards,
Shahfik Amasha
Undergraduate
School of Information Systems
Singapore Management University