You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Juho Mäkinen <ju...@gmail.com> on 2008/09/15 15:13:38 UTC

Implementing own InputFormat and RecordReader

I'm trying to implement my own InputFormat and RecordReader for my
data and I'm stuck as I can't find enough documentation about the
details.

My input format is a tightly packed binary data consisting individual
event packets. Each event packet contains its length and the packets
are simply appended into end of an file. Thus the file must be read as
stream and it cannot be splitted.

FileInputFormat seems like a reasonable place to start but I
immediately ran into problems and unanswered questions:

1) The FileInputFormat.getSplits() returns InputSplit[] array. If my
input file is 128MB and my HDFS block size is 64MB, will it return one
InputSplit or two InputSplits?

2) If my file is splitted into two or more filesystem blocks, how will
hadoop handle the reading of those blocks? As the file must be read in
sequence, will hadoop first copy every block to a machine (if the
blocks aren't already in there) and then start the mapper in this
machine? Do I need to handle the reading and opening multiple blocks,
or will hadoop provide me a simple stream interface which I can use to
read the entire file without worrying if the file is larger than the
HDFS block size?

 - Juho Mäkinen

Re: Implementing own InputFormat and RecordReader

Posted by Owen O'Malley <om...@apache.org>.
On Sep 15, 2008, at 6:13 AM, Juho Mäkinen wrote:

> 1) The FileInputFormat.getSplits() returns InputSplit[] array. If my
> input file is 128MB and my HDFS block size is 64MB, will it return one
> InputSplit or two InputSplits?

Your InputFormat needs to define:

protected boolean isSplitable(FileSystem fs, Path filename) {
   return false;
}

which tells the FileInputFormat.getSplits to not split files. You will  
end up with a single split for each file.

> 2) If my file is splitted into two or more filesystem blocks, how will
> hadoop handle the reading of those blocks? As the file must be read in
> sequence, will hadoop first copy every block to a machine (if the
> blocks aren't already in there) and then start the mapper in this
> machine? Do I need to handle the reading and opening multiple blocks,
> or will hadoop provide me a simple stream interface which I can use to
> read the entire file without worrying if the file is larger than the
> HDFS block size?

HDFS transparently handles the data motion for you. You can just use  
FileSystem.open(path) and HDFS will pull the file from the closest  
location. It doesn't actually move the block to your local disk, just  
gives it to the application. Basically, you don't need to worry about  
it.

There are two downsides to unsplitable files. The first is that if  
they are large, the map times can be very long. The second is that the  
map/reduce scheduler tries to place the tasks close to the data, which  
it can't do very well if the data spans blocks. Of course if data  
isn't splitable, you don't have a choice.

-- Owen