You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Xuan Dzung Doan <do...@yahoo.com> on 2008/06/27 18:30:55 UTC
Re: Memory mapped file in DFS? (Amazon EC2-based HDFS)

I guess I know the basic functioning of the map reduce setting.

The issue here is I don't think I can have the input file automatically partioned by the framework. It's a text file containing variable-length string sequences delimited by the star (*) character; the sequences have unknown lengths which vary from a few hundreds to hundreds of thousands characters; the total file length could reach hundreds of Mbs. The sequences could be extracted and then processed in parallel.

The only way I know of to extract sequences is to write some code to scan through the file character by character to detect the delimiter (*), then get the sequence accordingly (I don't think extracting can be done automatically by the framework). And this code, as far as I know, should be implemented in my own InputFormat. And here comes my question: from the perspective of a regular program running on a single machine, because the file is so large, it needs to be mapped into memory for any sort of character by character scanning to be performed; now from the perpective of a map reduce program running on a cluster handling the file hosted by HDFS, what is the (equivalent) way to scan through the file character by character? Would the regular memory mapped file method (as in single machine) still work (which I suppose not) or even be needed?

I hope my question is clear, and expect some suggestions here.

Thanks,
David.


----- Original Message ----
From: "Goel, Ankur" <An...@corp.aol.com>
To: core-user@hadoop.apache.org
Sent: Thursday, June 26, 2008 11:37:00 PM
Subject: RE: Memory mapped file in DFS?

In a map reduce setting, files are read as a sequence of records. In
mappers you process each record to generate an intermediate set of (key,
value) pairs. All the values for a particular key are collected and
grouped together and provided as (key, value1, value2...) in reducers.
The input data-set will be automatically partioned to generate the right
number of mappers so you don't need to explicitly memory map the input
files in a mapper or reducer.

The sequencing logic needs to be broken down into a map-reduce setting
where you identify the sequence current record belongs to and generate a
key that represents this sequence and value is any other information
that needs to be collected from this record for the sequence. The
reducer will automatically get all the values for a particular sequence
key and then you can decide to do whatever you want to with that.

Hope this helps.

-Ankur

-----Original Message-----
From: Xuan Dzung Doan [mailto:doanxuandung@yahoo.com] 
Sent: Friday, June 27, 2008 5:38 AM
To: core-user@hadoop.apache.org
Subject: Memory mapped file in DFS?

Hi,

I'm converting a regular sequential Java program running on one machine
to a map/reduce one running in an Amazon EC2 cluster. The program scans
through a text file whose size may reach hundreds of Mbs and extracts
sequences from it for further processing. Because the file can't fit
into main memory, the current program has to map it to memory using the
memory mapped file utility provided by Java NIO like this:

MappedByteBuffer db = (new
RandomAccessFile(fileName,"r")).getChannel().map(FileChannel.MapMode.REA
D_ONLY,startPos,length);

Now, for the new map/reduce program, the input text file will be copied
to the HDFS on the cluster (I suppose) and will become a file in HDFS. I
think I need to define my own InputFormat that scans through the file
and extracts sequences from it just like the sequential program does.
But how can this be done? Will the memory mapped file method still work?
I suspect not, because it should only work with a local file system
(like Windows, Linux), not with a more abstract level fs like HDFS. Is
it correct? If that's the case, how can I have my InputFormat entirely
scan through that file?

Highly appreciate any input.

Thanks,
David.