You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by "Patterson, Josh" <jp...@tva.gov> on 2009/03/17 21:39:18 UTC

RecordReader design heuristic

I am currently working on a RecordReader to read a custom time series
data binary file format and was wondering about ways to be most
efficient in designing the InputFormat/RecordReader process. Reading
through:
 
http://wiki.apache.org/hadoop/HadoopMapReduce
<http://wiki.apache.org/hadoop/HadoopMapReduce> 
 
gave me a lot of hints about how the various classes work together in
order to read any type of file. I was looking at how the TextInputFormat
uses the LineRecordReader in order to send individual lines to each
mapper. My question is, what is a good heuristic in how to choose how
much data to send to each mapper? With the stock LineRecordReader each
mapper only gets to work with a single line which leads me to believe
that we want to give each mapper very little work. Currently I'm looking
at either sending each mapper a single point of data (10 bytes), which
seems small, or sending a single mapper a block of data (around 819
points, at 10 bytes each, ---> 8190 bytes). I'm leaning towards sending
the block to the mapper.
 
These factors are based around dealing with a legacy file format (for
now) so I'm just trying to make the best tradeoff possible for the short
term until I get some basic stuff rolling, at which point I can suggest
a better storage format, or just start converting the groups of stored
points into a format more fitting for the platform. I understand that
the InputFormat is not really trying to make much meaning out of the
data, other than to help assist in getting the correct data out of the
file based on the file split variables. Another question I have is, with
a pretty much stock install, generally how big is each FileSplit?
 
Josh Patterson
TVA

Re: RecordReader design heuristic

Posted by Tom White <to...@cloudera.com>.

Josh,

Sounds like your file format is similar to Hadoop's SequenceFile
(although it has its metadata at the head of the file), so I might be
worth considering that going forward. SequenceFile also has support
for block compression which can be useful (and the files can still be
split).

The approach you describe for splitting your existing files sounds good.

Tom

On Wed, Mar 18, 2009 at 5:39 PM, Patterson, Josh <jp...@tva.gov> wrote:
> Hi Tom,
> Yeah, I'm assuming the splits are going to be about a single dfs block
> size (64M here). Each file I'm working with is around 1.5GB in size, and
> has a sort of File Allocation Table at the very end which tells you the
> block sizes inside the file, and then some other info. Once I pull that
> info out of the tail end of the file, I can calculate what "internal
> blocks" lie inside the split byte ranges, pull those out and push the
> individual data points up to the mapper, as well as deal with any block
> that falls over the split range (I'm assuming right now I'll use the
> same idea as the line-oriented reader, and just read all blocks that
> fall over the end point of the split, unless its the first split
> section). I guess the only hit I'm going to take here is having to ask
> the dfs for a quick read into the last 16 bytes of the whole file where
> my file info is stored. Splitting this file format doesn't seem to be so
> bad, its just finding which multiples of the "internal file block" size
> fits inside the split range, its just getting that multiple factor
> beforehand.
>
> After I get some mechanics of the process down, and I show the team some
> valid results, I may be able to talk them into going to another format
> that works better with MR. If anyone has any ideas on what file formats
> works best for storing and processing large amounts of time series
> points with MR, I'm all ears. We're moving towards a new philosophy wrt
> big data so it's a good time for us to examine best practices going
> forward.
>
> Josh Patterson
> TVA
>
> -----Original Message-----
> From: Tom White [mailto:tom@cloudera.com]
> Sent: Wednesday, March 18, 2009 1:21 PM
> To: core-user@hadoop.apache.org
> Subject: Re: RecordReader design heuristic
>
> Hi Josh,
>
> The other aspect to think about when writing your own record reader is
> input splits. As Jeff mentioned you really want mappers to be
> processing about one HDFS block's worth of data. If your inputs are
> significantly smaller, the overhead of creating mappers will be high
> and your jobs will be inefficient. On the other hand, if your inputs
> are significantly larger then you need to split them otherwise each
> mapper will take a very long time processing each split. Some file
> formats are inherently splittable, meaning you can re-align with
> record boundaries from an arbitrary point in the file. Examples
> include line-oriented text (split at newlines), and bzip2 (has a
> unique block marker). If your format is splittable then you will be
> able to take advantage of this to make MR processing more efficient.
>
> Cheers,
> Tom
>
> On Wed, Mar 18, 2009 at 5:00 PM, Patterson, Josh <jp...@tva.gov>
> wrote:
>> Jeff,
>> Yeah, the mapper sitting on a dfs block is pretty cool.
>>
>> Also, yes, we are about to start crunching on a lot of energy smart
> grid
>> data. TVA is sorta like "Switzerland" for smart grid power generation
>> and transmission data across the nation. Right now we have about 12TB,
>> and this is slated to be around 30TB by the end of next 2010 (possibly
>> more, depending on how many more PMUs come online). I am very
> interested
>> in Mahout and have read up on it, it has many algorithms that I am
>> familiar with from grad school. I will be doing some very simple MR
> jobs
>> early on like finding the average frequency for a range of data, and
>> I've been selling various groups internally on what CAN be done with
>> good data mining and tools like Hadoop/Mahout. Our production cluster
>> wont be online for a few more weeks, but that part is already rolling
> so
>> I've moved on to focus on designing the first jobs to find quality
>> "results/benefits" that I can "sell" in order to campaign for more
>> ambitious projects I have drawn up. I know time series data lends
> itself
>> to many machine learning applications, so, yes, I would be very
>> interested in talking to anyone who wants to talk or share notes on
>> hadoop and machine learning. I believe Mahout can be a tremendous
>> resource for us and definitely plan on running and contributing to it.
>>
>> Josh Patterson
>> TVA
>>
>> -----Original Message-----
>> From: Jeff Eastman [mailto:jdog@windwardsolutions.com]
>> Sent: Wednesday, March 18, 2009 12:02 PM
>> To: core-user@hadoop.apache.org
>> Subject: Re: RecordReader design heuristic
>>
>> Hi Josh,
>> It seemed like you had a conceptual wire crossed and I'm glad to help
>> out. The neat thing about Hadoop mappers is - since they are given a
>> replicated HDFS block to munch on - the job scheduler has <replication
>> factor> number of node choices where it can run each mapper. This
> means
>> mappers are always reading from local storage.
>>
>> On another note, I notice you are processing what looks to be large
>> quantities of vector data. If you have any interest in clustering this
>> data you might want to look at the Mahout project
>> (http://lucene.apache.org/mahout/). We have a number of Hadoop-ready
>> clustering algorithms, including a new non-parametric Dirichlet
> Process
>> Clustering implementation that I committed recently. We are pulling it
>> all together for a 0.1 release and I would be very interested in
> helping
>>
>> you to apply these algorithms if you have an interest.
>>
>> Jeff
>>
>>
>> Patterson, Josh wrote:
>>> Jeff,
>>> ok, that makes more sense, I was under the mis-impression that it was
>> creating and destroying mappers for each input record. I dont know why
> I
>> had that in my head. My design suddenly became a lot clearer, and this
>> provides a much more clean abstraction. Thanks for your help!
>>>
>>> Josh Patterson
>>> TVA
>>>
>>>
>>
>>
>

RE: RecordReader design heuristic

Posted by "Patterson, Josh" <jp...@tva.gov>.

Hi Tom,
Yeah, I'm assuming the splits are going to be about a single dfs block
size (64M here). Each file I'm working with is around 1.5GB in size, and
has a sort of File Allocation Table at the very end which tells you the
block sizes inside the file, and then some other info. Once I pull that
info out of the tail end of the file, I can calculate what "internal
blocks" lie inside the split byte ranges, pull those out and push the
individual data points up to the mapper, as well as deal with any block
that falls over the split range (I'm assuming right now I'll use the
same idea as the line-oriented reader, and just read all blocks that
fall over the end point of the split, unless its the first split
section). I guess the only hit I'm going to take here is having to ask
the dfs for a quick read into the last 16 bytes of the whole file where
my file info is stored. Splitting this file format doesn't seem to be so
bad, its just finding which multiples of the "internal file block" size
fits inside the split range, its just getting that multiple factor
beforehand.

After I get some mechanics of the process down, and I show the team some
valid results, I may be able to talk them into going to another format
that works better with MR. If anyone has any ideas on what file formats
works best for storing and processing large amounts of time series
points with MR, I'm all ears. We're moving towards a new philosophy wrt
big data so it's a good time for us to examine best practices going
forward.

Josh Patterson
TVA

-----Original Message-----
From: Tom White [mailto:tom@cloudera.com] 
Sent: Wednesday, March 18, 2009 1:21 PM
To: core-user@hadoop.apache.org
Subject: Re: RecordReader design heuristic

Hi Josh,

The other aspect to think about when writing your own record reader is
input splits. As Jeff mentioned you really want mappers to be
processing about one HDFS block's worth of data. If your inputs are
significantly smaller, the overhead of creating mappers will be high
and your jobs will be inefficient. On the other hand, if your inputs
are significantly larger then you need to split them otherwise each
mapper will take a very long time processing each split. Some file
formats are inherently splittable, meaning you can re-align with
record boundaries from an arbitrary point in the file. Examples
include line-oriented text (split at newlines), and bzip2 (has a
unique block marker). If your format is splittable then you will be
able to take advantage of this to make MR processing more efficient.

Cheers,
Tom

On Wed, Mar 18, 2009 at 5:00 PM, Patterson, Josh <jp...@tva.gov>
wrote:
> Jeff,
> Yeah, the mapper sitting on a dfs block is pretty cool.
>
> Also, yes, we are about to start crunching on a lot of energy smart
grid
> data. TVA is sorta like "Switzerland" for smart grid power generation
> and transmission data across the nation. Right now we have about 12TB,
> and this is slated to be around 30TB by the end of next 2010 (possibly
> more, depending on how many more PMUs come online). I am very
interested
> in Mahout and have read up on it, it has many algorithms that I am
> familiar with from grad school. I will be doing some very simple MR
jobs
> early on like finding the average frequency for a range of data, and
> I've been selling various groups internally on what CAN be done with
> good data mining and tools like Hadoop/Mahout. Our production cluster
> wont be online for a few more weeks, but that part is already rolling
so
> I've moved on to focus on designing the first jobs to find quality
> "results/benefits" that I can "sell" in order to campaign for more
> ambitious projects I have drawn up. I know time series data lends
itself
> to many machine learning applications, so, yes, I would be very
> interested in talking to anyone who wants to talk or share notes on
> hadoop and machine learning. I believe Mahout can be a tremendous
> resource for us and definitely plan on running and contributing to it.
>
> Josh Patterson
> TVA
>
> -----Original Message-----
> From: Jeff Eastman [mailto:jdog@windwardsolutions.com]
> Sent: Wednesday, March 18, 2009 12:02 PM
> To: core-user@hadoop.apache.org
> Subject: Re: RecordReader design heuristic
>
> Hi Josh,
> It seemed like you had a conceptual wire crossed and I'm glad to help
> out. The neat thing about Hadoop mappers is - since they are given a
> replicated HDFS block to munch on - the job scheduler has <replication
> factor> number of node choices where it can run each mapper. This
means
> mappers are always reading from local storage.
>
> On another note, I notice you are processing what looks to be large
> quantities of vector data. If you have any interest in clustering this
> data you might want to look at the Mahout project
> (http://lucene.apache.org/mahout/). We have a number of Hadoop-ready
> clustering algorithms, including a new non-parametric Dirichlet
Process
> Clustering implementation that I committed recently. We are pulling it
> all together for a 0.1 release and I would be very interested in
helping
>
> you to apply these algorithms if you have an interest.
>
> Jeff
>
>
> Patterson, Josh wrote:
>> Jeff,
>> ok, that makes more sense, I was under the mis-impression that it was
> creating and destroying mappers for each input record. I dont know why
I
> had that in my head. My design suddenly became a lot clearer, and this
> provides a much more clean abstraction. Thanks for your help!
>>
>> Josh Patterson
>> TVA
>>
>>
>
>

Re: RecordReader design heuristic

Posted by Tom White <to...@cloudera.com>.

Hi Josh,

The other aspect to think about when writing your own record reader is
input splits. As Jeff mentioned you really want mappers to be
processing about one HDFS block's worth of data. If your inputs are
significantly smaller, the overhead of creating mappers will be high
and your jobs will be inefficient. On the other hand, if your inputs
are significantly larger then you need to split them otherwise each
mapper will take a very long time processing each split. Some file
formats are inherently splittable, meaning you can re-align with
record boundaries from an arbitrary point in the file. Examples
include line-oriented text (split at newlines), and bzip2 (has a
unique block marker). If your format is splittable then you will be
able to take advantage of this to make MR processing more efficient.

Cheers,
Tom

On Wed, Mar 18, 2009 at 5:00 PM, Patterson, Josh <jp...@tva.gov> wrote:
> Jeff,
> Yeah, the mapper sitting on a dfs block is pretty cool.
>
> Also, yes, we are about to start crunching on a lot of energy smart grid
> data. TVA is sorta like "Switzerland" for smart grid power generation
> and transmission data across the nation. Right now we have about 12TB,
> and this is slated to be around 30TB by the end of next 2010 (possibly
> more, depending on how many more PMUs come online). I am very interested
> in Mahout and have read up on it, it has many algorithms that I am
> familiar with from grad school. I will be doing some very simple MR jobs
> early on like finding the average frequency for a range of data, and
> I've been selling various groups internally on what CAN be done with
> good data mining and tools like Hadoop/Mahout. Our production cluster
> wont be online for a few more weeks, but that part is already rolling so
> I've moved on to focus on designing the first jobs to find quality
> "results/benefits" that I can "sell" in order to campaign for more
> ambitious projects I have drawn up. I know time series data lends itself
> to many machine learning applications, so, yes, I would be very
> interested in talking to anyone who wants to talk or share notes on
> hadoop and machine learning. I believe Mahout can be a tremendous
> resource for us and definitely plan on running and contributing to it.
>
> Josh Patterson
> TVA
>
> -----Original Message-----
> From: Jeff Eastman [mailto:jdog@windwardsolutions.com]
> Sent: Wednesday, March 18, 2009 12:02 PM
> To: core-user@hadoop.apache.org
> Subject: Re: RecordReader design heuristic
>
> Hi Josh,
> It seemed like you had a conceptual wire crossed and I'm glad to help
> out. The neat thing about Hadoop mappers is - since they are given a
> replicated HDFS block to munch on - the job scheduler has <replication
> factor> number of node choices where it can run each mapper. This means
> mappers are always reading from local storage.
>
> On another note, I notice you are processing what looks to be large
> quantities of vector data. If you have any interest in clustering this
> data you might want to look at the Mahout project
> (http://lucene.apache.org/mahout/). We have a number of Hadoop-ready
> clustering algorithms, including a new non-parametric Dirichlet Process
> Clustering implementation that I committed recently. We are pulling it
> all together for a 0.1 release and I would be very interested in helping
>
> you to apply these algorithms if you have an interest.
>
> Jeff
>
>
> Patterson, Josh wrote:
>> Jeff,
>> ok, that makes more sense, I was under the mis-impression that it was
> creating and destroying mappers for each input record. I dont know why I
> had that in my head. My design suddenly became a lot clearer, and this
> provides a much more clean abstraction. Thanks for your help!
>>
>> Josh Patterson
>> TVA
>>
>>
>
>

RE: RecordReader design heuristic

Posted by "Patterson, Josh" <jp...@tva.gov>.

Jeff,
Yeah, the mapper sitting on a dfs block is pretty cool.

Also, yes, we are about to start crunching on a lot of energy smart grid
data. TVA is sorta like "Switzerland" for smart grid power generation
and transmission data across the nation. Right now we have about 12TB,
and this is slated to be around 30TB by the end of next 2010 (possibly
more, depending on how many more PMUs come online). I am very interested
in Mahout and have read up on it, it has many algorithms that I am
familiar with from grad school. I will be doing some very simple MR jobs
early on like finding the average frequency for a range of data, and
I've been selling various groups internally on what CAN be done with
good data mining and tools like Hadoop/Mahout. Our production cluster
wont be online for a few more weeks, but that part is already rolling so
I've moved on to focus on designing the first jobs to find quality
"results/benefits" that I can "sell" in order to campaign for more
ambitious projects I have drawn up. I know time series data lends itself
to many machine learning applications, so, yes, I would be very
interested in talking to anyone who wants to talk or share notes on
hadoop and machine learning. I believe Mahout can be a tremendous
resource for us and definitely plan on running and contributing to it.

Josh Patterson
TVA

-----Original Message-----
From: Jeff Eastman [mailto:jdog@windwardsolutions.com] 
Sent: Wednesday, March 18, 2009 12:02 PM
To: core-user@hadoop.apache.org
Subject: Re: RecordReader design heuristic

Hi Josh,
It seemed like you had a conceptual wire crossed and I'm glad to help 
out. The neat thing about Hadoop mappers is - since they are given a 
replicated HDFS block to munch on - the job scheduler has <replication 
factor> number of node choices where it can run each mapper. This means 
mappers are always reading from local storage.

On another note, I notice you are processing what looks to be large 
quantities of vector data. If you have any interest in clustering this 
data you might want to look at the Mahout project 
(http://lucene.apache.org/mahout/). We have a number of Hadoop-ready 
clustering algorithms, including a new non-parametric Dirichlet Process 
Clustering implementation that I committed recently. We are pulling it 
all together for a 0.1 release and I would be very interested in helping

you to apply these algorithms if you have an interest.

Jeff

Patterson, Josh wrote:
> Jeff,
> ok, that makes more sense, I was under the mis-impression that it was
creating and destroying mappers for each input record. I dont know why I
had that in my head. My design suddenly became a lot clearer, and this
provides a much more clean abstraction. Thanks for your help!
>
> Josh Patterson
> TVA
>
>

Re: RecordReader design heuristic

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

Hi Josh,
It seemed like you had a conceptual wire crossed and I'm glad to help 
out. The neat thing about Hadoop mappers is - since they are given a 
replicated HDFS block to munch on - the job scheduler has <replication 
factor> number of node choices where it can run each mapper. This means 
mappers are always reading from local storage.

On another note, I notice you are processing what looks to be large 
quantities of vector data. If you have any interest in clustering this 
data you might want to look at the Mahout project 
(http://lucene.apache.org/mahout/). We have a number of Hadoop-ready 
clustering algorithms, including a new non-parametric Dirichlet Process 
Clustering implementation that I committed recently. We are pulling it 
all together for a 0.1 release and I would be very interested in helping 
you to apply these algorithms if you have an interest.

Jeff

Patterson, Josh wrote:
> Jeff,
> ok, that makes more sense, I was under the mis-impression that it was creating and destroying mappers for each input record. I dont know why I had that in my head. My design suddenly became a lot clearer, and this provides a much more clean abstraction. Thanks for your help!
>
> Josh Patterson
> TVA
>
>

RE: RecordReader design heuristic

Posted by "Patterson, Josh" <jp...@tva.gov>.

Jeff,
ok, that makes more sense, I was under the mis-impression that it was creating and destroying mappers for each input record. I dont know why I had that in my head. My design suddenly became a lot clearer, and this provides a much more clean abstraction. Thanks for your help!

Josh Patterson
TVA


-----Original Message-----
From: Jeff Eastman [mailto:jdog@windwardsolutions.com]
Sent: Tue 03/17/2009 6:02 PM
To: core-user@hadoop.apache.org
Subject: Re: RecordReader design heuristic
 
Hi Josh,

Well, I don't really see how you will get more mappers, just simpler 
logic in the mapper. The number of mappers is driven by how many input 
files you have and their sizes and not by any chunking you do in the 
record reader. Each record reader will get an entire split and will feed 
it to its mapper in a stream one record at a time. You can duplicate 
some of that logic in the mapper if you want but you already will have 
it in the reader so why bother?

Jeff


Patterson, Josh wrote:
> Jeff,
> So if I'm hearing you right, its "good" to send one point of data (10
> bytes here) to a single mapper? This mind set increases the number of
> mappers, but keeps their logic scaled down to simply "look at this
> record and emit/don't emit" --- which is considered more favorable? I'm
> still getting the hang of the MR design tradeoffs, thanks for your
> feedback.
>
> Josh Patterson
> TVA
>
> -----Original Message-----
> From: Jeff Eastman [mailto:jdog@windwardsolutions.com] 
> Sent: Tuesday, March 17, 2009 5:11 PM
> To: core-user@hadoop.apache.org
> Subject: Re: RecordReader design heuristic
>
> If you send a single point to the mapper, your mapper logic will be 
> clean and simple. Otherwise you will need to loop over your block of 
> points in the mapper. In Mahout clustering, I send the mapper individual
>
> points because the input file is point-per-line. In either case, the 
> record reader will be iterating over a block of data to provide mapper 
> inputs. IIRC, splits will generally be an HDFS block or less, so if you 
> have files smaller than that you will get one mapper per. For larger 
> files you can get up to one mapper per split block.
>
> Jeff
>
> Patterson, Josh wrote:
>   
>> I am currently working on a RecordReader to read a custom time series
>> data binary file format and was wondering about ways to be most
>> efficient in designing the InputFormat/RecordReader process. Reading
>> through:
>>  
>> http://wiki.apache.org/hadoop/HadoopMapReduce
>> <http://wiki.apache.org/hadoop/HadoopMapReduce> 
>>  
>> gave me a lot of hints about how the various classes work together in
>> order to read any type of file. I was looking at how the
>>     
> TextInputFormat
>   
>> uses the LineRecordReader in order to send individual lines to each
>> mapper. My question is, what is a good heuristic in how to choose how
>> much data to send to each mapper? With the stock LineRecordReader each
>> mapper only gets to work with a single line which leads me to believe
>> that we want to give each mapper very little work. Currently I'm
>>     
> looking
>   
>> at either sending each mapper a single point of data (10 bytes), which
>> seems small, or sending a single mapper a block of data (around 819
>> points, at 10 bytes each, ---> 8190 bytes). I'm leaning towards
>>     
> sending
>   
>> the block to the mapper.
>>  
>> These factors are based around dealing with a legacy file format (for
>> now) so I'm just trying to make the best tradeoff possible for the
>>     
> short
>   
>> term until I get some basic stuff rolling, at which point I can
>>     
> suggest
>   
>> a better storage format, or just start converting the groups of stored
>> points into a format more fitting for the platform. I understand that
>> the InputFormat is not really trying to make much meaning out of the
>> data, other than to help assist in getting the correct data out of the
>> file based on the file split variables. Another question I have is,
>>     
> with
>   
>> a pretty much stock install, generally how big is each FileSplit?
>>  
>> Josh Patterson
>> TVA
>>
>>   
>>     
>
>
>
>

Re: RecordReader design heuristic

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

Hi Josh,

Well, I don't really see how you will get more mappers, just simpler 
logic in the mapper. The number of mappers is driven by how many input 
files you have and their sizes and not by any chunking you do in the 
record reader. Each record reader will get an entire split and will feed 
it to its mapper in a stream one record at a time. You can duplicate 
some of that logic in the mapper if you want but you already will have 
it in the reader so why bother?

Jeff


Patterson, Josh wrote:
> Jeff,
> So if I'm hearing you right, its "good" to send one point of data (10
> bytes here) to a single mapper? This mind set increases the number of
> mappers, but keeps their logic scaled down to simply "look at this
> record and emit/don't emit" --- which is considered more favorable? I'm
> still getting the hang of the MR design tradeoffs, thanks for your
> feedback.
>
> Josh Patterson
> TVA
>
> -----Original Message-----
> From: Jeff Eastman [mailto:jdog@windwardsolutions.com] 
> Sent: Tuesday, March 17, 2009 5:11 PM
> To: core-user@hadoop.apache.org
> Subject: Re: RecordReader design heuristic
>
> If you send a single point to the mapper, your mapper logic will be 
> clean and simple. Otherwise you will need to loop over your block of 
> points in the mapper. In Mahout clustering, I send the mapper individual
>
> points because the input file is point-per-line. In either case, the 
> record reader will be iterating over a block of data to provide mapper 
> inputs. IIRC, splits will generally be an HDFS block or less, so if you 
> have files smaller than that you will get one mapper per. For larger 
> files you can get up to one mapper per split block.
>
> Jeff
>
> Patterson, Josh wrote:
>   
>> I am currently working on a RecordReader to read a custom time series
>> data binary file format and was wondering about ways to be most
>> efficient in designing the InputFormat/RecordReader process. Reading
>> through:
>>  
>> http://wiki.apache.org/hadoop/HadoopMapReduce
>> <http://wiki.apache.org/hadoop/HadoopMapReduce> 
>>  
>> gave me a lot of hints about how the various classes work together in
>> order to read any type of file. I was looking at how the
>>     
> TextInputFormat
>   
>> uses the LineRecordReader in order to send individual lines to each
>> mapper. My question is, what is a good heuristic in how to choose how
>> much data to send to each mapper? With the stock LineRecordReader each
>> mapper only gets to work with a single line which leads me to believe
>> that we want to give each mapper very little work. Currently I'm
>>     
> looking
>   
>> at either sending each mapper a single point of data (10 bytes), which
>> seems small, or sending a single mapper a block of data (around 819
>> points, at 10 bytes each, ---> 8190 bytes). I'm leaning towards
>>     
> sending
>   
>> the block to the mapper.
>>  
>> These factors are based around dealing with a legacy file format (for
>> now) so I'm just trying to make the best tradeoff possible for the
>>     
> short
>   
>> term until I get some basic stuff rolling, at which point I can
>>     
> suggest
>   
>> a better storage format, or just start converting the groups of stored
>> points into a format more fitting for the platform. I understand that
>> the InputFormat is not really trying to make much meaning out of the
>> data, other than to help assist in getting the correct data out of the
>> file based on the file split variables. Another question I have is,
>>     
> with
>   
>> a pretty much stock install, generally how big is each FileSplit?
>>  
>> Josh Patterson
>> TVA
>>
>>   
>>     
>
>
>
>

RE: RecordReader design heuristic

Posted by "Patterson, Josh" <jp...@tva.gov>.

Jeff,
So if I'm hearing you right, its "good" to send one point of data (10
bytes here) to a single mapper? This mind set increases the number of
mappers, but keeps their logic scaled down to simply "look at this
record and emit/don't emit" --- which is considered more favorable? I'm
still getting the hang of the MR design tradeoffs, thanks for your
feedback.

Josh Patterson
TVA

-----Original Message-----
From: Jeff Eastman [mailto:jdog@windwardsolutions.com] 
Sent: Tuesday, March 17, 2009 5:11 PM
To: core-user@hadoop.apache.org
Subject: Re: RecordReader design heuristic

If you send a single point to the mapper, your mapper logic will be 
clean and simple. Otherwise you will need to loop over your block of 
points in the mapper. In Mahout clustering, I send the mapper individual

points because the input file is point-per-line. In either case, the 
record reader will be iterating over a block of data to provide mapper 
inputs. IIRC, splits will generally be an HDFS block or less, so if you 
have files smaller than that you will get one mapper per. For larger 
files you can get up to one mapper per split block.

Jeff

Patterson, Josh wrote:
> I am currently working on a RecordReader to read a custom time series
> data binary file format and was wondering about ways to be most
> efficient in designing the InputFormat/RecordReader process. Reading
> through:
>  
> http://wiki.apache.org/hadoop/HadoopMapReduce
> <http://wiki.apache.org/hadoop/HadoopMapReduce> 
>  
> gave me a lot of hints about how the various classes work together in
> order to read any type of file. I was looking at how the
TextInputFormat
> uses the LineRecordReader in order to send individual lines to each
> mapper. My question is, what is a good heuristic in how to choose how
> much data to send to each mapper? With the stock LineRecordReader each
> mapper only gets to work with a single line which leads me to believe
> that we want to give each mapper very little work. Currently I'm
looking
> at either sending each mapper a single point of data (10 bytes), which
> seems small, or sending a single mapper a block of data (around 819
> points, at 10 bytes each, ---> 8190 bytes). I'm leaning towards
sending
> the block to the mapper.
>  
> These factors are based around dealing with a legacy file format (for
> now) so I'm just trying to make the best tradeoff possible for the
short
> term until I get some basic stuff rolling, at which point I can
suggest
> a better storage format, or just start converting the groups of stored
> points into a format more fitting for the platform. I understand that
> the InputFormat is not really trying to make much meaning out of the
> data, other than to help assist in getting the correct data out of the
> file based on the file split variables. Another question I have is,
with
> a pretty much stock install, generally how big is each FileSplit?
>  
> Josh Patterson
> TVA
>
>

Re: RecordReader design heuristic

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

If you send a single point to the mapper, your mapper logic will be 
clean and simple. Otherwise you will need to loop over your block of 
points in the mapper. In Mahout clustering, I send the mapper individual 
points because the input file is point-per-line. In either case, the 
record reader will be iterating over a block of data to provide mapper 
inputs. IIRC, splits will generally be an HDFS block or less, so if you 
have files smaller than that you will get one mapper per. For larger 
files you can get up to one mapper per split block.

Jeff

Patterson, Josh wrote:
> I am currently working on a RecordReader to read a custom time series
> data binary file format and was wondering about ways to be most
> efficient in designing the InputFormat/RecordReader process. Reading
> through:
>  
> http://wiki.apache.org/hadoop/HadoopMapReduce
> <http://wiki.apache.org/hadoop/HadoopMapReduce> 
>  
> gave me a lot of hints about how the various classes work together in
> order to read any type of file. I was looking at how the TextInputFormat
> uses the LineRecordReader in order to send individual lines to each
> mapper. My question is, what is a good heuristic in how to choose how
> much data to send to each mapper? With the stock LineRecordReader each
> mapper only gets to work with a single line which leads me to believe
> that we want to give each mapper very little work. Currently I'm looking
> at either sending each mapper a single point of data (10 bytes), which
> seems small, or sending a single mapper a block of data (around 819
> points, at 10 bytes each, ---> 8190 bytes). I'm leaning towards sending
> the block to the mapper.
>  
> These factors are based around dealing with a legacy file format (for
> now) so I'm just trying to make the best tradeoff possible for the short
> term until I get some basic stuff rolling, at which point I can suggest
> a better storage format, or just start converting the groups of stored
> points into a format more fitting for the platform. I understand that
> the InputFormat is not really trying to make much meaning out of the
> data, other than to help assist in getting the correct data out of the
> file based on the file split variables. Another question I have is, with
> a pretty much stock install, generally how big is each FileSplit?
>  
> Josh Patterson
> TVA
>
>