You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Nguyen Kien Trung <tr...@gmail.com> on 2007/07/06 05:05:54 UTC
Optimizing MapReduce

Hi all,

I am given a task to extract data from a big file to HDFS.
The input is a 1G text file contains millions of lines. The line starts 
with # which indicates a record. Subsequence lines which don't start 
with # are data of that record.
E.g:
# 1 2 A3 LOCS 43
4 FS 23 ....
5 SDF ....
# 3 4 D8
9 FS 45 ...
# 8 DFD 9
1 FS LL
2 LI O

the above file contains 3 records

The actual file contains around 1.5 million records.
The task is to extract those records, each into a text file and store in 
HDFS.

I've written a MapReduce program to do this job. But it doesn't seem to 
run as fast as I can imagine. Furthermore, it eats up all system 
resources in NameNode machine after few hours.
I tried to restart the program few times but it still can't finish the job.

The following describes what I did

First, I wrote a standalone program to split up the file into 100 
smaller files which contains equal number of records (not lines). Then I 
use 100 split files as the inputs for my MapReduce program.
This is my configuration and pseudo code for the MapReduce program

_Configuration:_
Running Hadoop 0.13.0 on 3 machines
*machine 1*: Penitum D 3.2Ghz, 2G RAM. Running as a namenode and a 
jobtracker (1G each)
*machine 2*:  Dual Core AMD Opteron(tm) Processor 170, 2G Ram. Running 
as a datanode and a tasktracker (1G each) with configuration:
    mapred.tasktracker.tasks.maximum = 4
    mapred.child.java.opts = -Xmx150m
*machine 3*:  Pentinum 4 HT 3.0 Ghz, 2G Ram. Running as a datanode and a 
tasktracker (1G each) with configuration:
    mapred.tasktracker.tasks.maximum = 4
    mapred.child.java.opts = -Xmx150

The MapReduce program is triggered in machine 1

_Pseudocode:
_CustomRecordReader:

constructor(file)
begin
    this.file = file;
end

function next(key, value)
begin
    ((Text) key). set(file.getUri().toString())
end

CustomInputFormat:
function getSplits(jobConf, numSplits):
begin
    splits = empty list;
    inputPaths = jobConf.getInputPaths(); // which returns only one, 
which contains 100 split files
    fs = FileSystem.get(jobConf);
    for each path in inputPaths
    begin
       for each file in fs.listPaths(path)
       begin
          splits.add(new FileSplits(file, 0, 1, job)); // just want to 
use the name of the file
       end
    end
end

function getRecordReader(split, jobConf, reporter)
begin
    return new CustomeRecordReader(((FileSplit) split).getPath());
end

Mapper:

function map(key, value, out, reporter)
begin
    file = new Path(((Text) key).toString());
    recCount = 0;
    start reading the file
    for each set of lines which forms a record
    begin
       recCount++
       out.collect(recCount, recordLines in string)
    end
    close the file
end
   
Reducer:

function reduce(key, values, out, reporter)
begin
    // there'd be more than one values as same recCount may be produced 
by multiple mappers
    for each value in values
    begin
       recordLines = value
       xmlRecord = convertToXml(recordLines)
       fileTemp = save xmlRecord to temp file
       copy fileTemp to HDFS using fileSystem.copyFromLocalFile
       // The reason I have to save xmlRecord to a temp file because if 
using Sequence.Writer, the text appears in the HDFS file is not pure text
       // Is there other solution?
    end
end


Driver:
Input format: CustomInputFormat
Output format: nulloutputformat
Number of mapper: 7
Number of reducer: 17
SpeculativeExecution: true

Sorry for my lengthy post.
Any suggestion and comments are highly appreciated and hope our 
discussion will bring up more understanding about Hadoop and MapReduce.

Cheers,

Trung