You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by "Patterson, Josh" <jp...@tva.gov> on 2009/03/24 19:04:04 UTC

Small Test Data Sets

I want to confirm something with the list that I'm seeing;
 
I needed to confirm that my Reader was reading our file format
correctly, so I created a MR job that simply output each K/V pair to the
reducer, which then just wrote out each one to the output file. This
allows me to check by hand that all K/V points of data from our file
format are getting pulled out of the file correctly. I have setup our
InputFormat, RecordReader, and Reader subclasses for our specific file
format.
 
While running some basic tests on a small (1meg) single file I noticed
something odd --- I was getting 2 copies of each data point in the
output file. Initially I thought my Reader was just somehow reading the
data point and not moving the read head, but I verified that was not the
case through a series of tests.
 
I then went on to reason that since I had 2 mappers by default on my
job, and only 1 input file, that each mapper must be reading the file
independently. I then set the -m flag to 1, and I got the proper output;
Is it safe to assume in testing on a file that is smaller than the block
size that I should always use -m 1 in order to get proper block->mapper
mapping? Also, should I assume that if you have more mappers than disk
blocks involved that you will get duplicate values? I may have set
something wrong, I just wanted to check. Thanks
 
Josh Patterson
TVA

RE: Small Test Data Sets

Posted by "Patterson, Josh" <jp...@tva.gov>.

You are exactly right, there was a secondary contructor in my Reader
class that was not setting its split start and length correctly, each
one was just reading the whole file. I missed a silly one, thanks for
the heads up!

Josh Patterson
TVA 

-----Original Message-----
From: Enis Soztutar [mailto:enis.soz@gmail.com] 
Sent: Wednesday, March 25, 2009 5:27 AM
To: core-user@hadoop.apache.org
Subject: Re: Small Test Data Sets

Patterson, Josh wrote:
> I want to confirm something with the list that I'm seeing;
>  
> I needed to confirm that my Reader was reading our file format
> correctly, so I created a MR job that simply output each K/V pair to
the
> reducer, which then just wrote out each one to the output file. This
> allows me to check by hand that all K/V points of data from our file
> format are getting pulled out of the file correctly. I have setup our
> InputFormat, RecordReader, and Reader subclasses for our specific file
> format.
>  
> While running some basic tests on a small (1meg) single file I noticed
> something odd --- I was getting 2 copies of each data point in the
> output file. Initially I thought my Reader was just somehow reading
the
> data point and not moving the read head, but I verified that was not
the
> case through a series of tests.
>  
> I then went on to reason that since I had 2 mappers by default on my
> job, and only 1 input file, that each mapper must be reading the file
> independently. I then set the -m flag to 1, and I got the proper
output;
> Is it safe to assume in testing on a file that is smaller than the
block
> size that I should always use -m 1 in order to get proper
block->mapper
> mapping? Also, should I assume that if you have more mappers than disk
> blocks involved that you will get duplicate values? I may have set
> something wrong, I just wanted to check. Thanks
>  
> Josh Patterson
> TVA
>  
>
>   
If you have developed your own inputformat, than the problem might be 
there.
The job of the inputformat is to create input splits, and readers. For 
one file and
two mappers, the input format should return two splits each representing

half of
the file. In your case, I assume you return two splits each containing 
the whole file.
Is this the case?

Enis

Re: Small Test Data Sets

Posted by Enis Soztutar <en...@gmail.com>.

Patterson, Josh wrote:
> I want to confirm something with the list that I'm seeing;
>  
> I needed to confirm that my Reader was reading our file format
> correctly, so I created a MR job that simply output each K/V pair to the
> reducer, which then just wrote out each one to the output file. This
> allows me to check by hand that all K/V points of data from our file
> format are getting pulled out of the file correctly. I have setup our
> InputFormat, RecordReader, and Reader subclasses for our specific file
> format.
>  
> While running some basic tests on a small (1meg) single file I noticed
> something odd --- I was getting 2 copies of each data point in the
> output file. Initially I thought my Reader was just somehow reading the
> data point and not moving the read head, but I verified that was not the
> case through a series of tests.
>  
> I then went on to reason that since I had 2 mappers by default on my
> job, and only 1 input file, that each mapper must be reading the file
> independently. I then set the -m flag to 1, and I got the proper output;
> Is it safe to assume in testing on a file that is smaller than the block
> size that I should always use -m 1 in order to get proper block->mapper
> mapping? Also, should I assume that if you have more mappers than disk
> blocks involved that you will get duplicate values? I may have set
> something wrong, I just wanted to check. Thanks
>  
> Josh Patterson
> TVA
>  
>
>   
If you have developed your own inputformat, than the problem might be 
there.
The job of the inputformat is to create input splits, and readers. For 
one file and
two mappers, the input format should return two splits each representing 
half of
the file. In your case, I assume you return two splits each containing 
the whole file.
Is this the case?

Enis