You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Jeff LI <un...@gmail.com> on 2013/03/06 19:21:09 UTC

Difference between HDFS_BYTES_READ and the actual size of input files

Dear Hadoop Users,

I recently noticed there is a difference between the File System Counter
"HDFS_BYTES_READ" and the actual size of input files in map-reduce jobs.
 And the difference seems to increase as the size of each key,value pairs
increases.

For example, I'm running the same job on two datasets.  The sizes of both
datasets are the same, which is about 128 GB.  And the keys are integers.
 The difference between these two datasets is the number of key,values
pairs and thus the size of each value: dataset1 has 2^17 key,value pairs
and 1MB for each value; dataset2 has 2^12 key,value pairs and 32MB for each
value.
The HDFS_BYTES_READ counter is 128.77GB for dataset1 and 152.00GB for
dataset2.

I have also tested on other sizes of each value.  There doesn't seem to be
any difference when the size of each value is small (<=128KB),
but noticeable difference when the size increases.

Could you give me some idea on why this is happening?

By the way, I'm running Hadoop 0.20.2-cdh3u5.

Cheers

Jeff

Re: Difference between HDFS_BYTES_READ and the actual size of input files

Posted by Jeff LI <un...@gmail.com>.
Thanks Jeff, for your suggestions.
I did run another test to study the extra copies you suggested.

First, I forgot to mention in the previous test, there are 256 files for
each dataset.
I ran another test with one change, number of files from 256 to 1024, which
means in the later case one input split per file (given that the block size
is 128MB).  And it turned out that there is no difference or very slight
difference in the HDFS_BYTES_READ.

So I think this confirms your suggestion in some way.  There are probably a
lot "copied twice" going on.

Thanks.

Cheers

Jeff


On Wed, Mar 6, 2013 at 2:12 PM, Jeffrey Buell <jb...@vmware.com> wrote:

> Jeff,
>
> Probably because records are split across blocks, so some of the data has
> to be read twice. Assuming you have a 64 MB block size and 128 GB of data,
> I'd estimate the overhead at 1 GB for 1 MB record size, and 32 GB for 32 MB
> record size.  Your overhead is about 75% of that, maybe my arithmetic is
> off, or there is some intelligence in HDFS to reduce how often records are
> split across blocks (say if a big record needs to be written but there is
> little space left in the current block, anybody know?).  Test this theory
> by increasing the HDFS block size to 128 MB or 256 MB, and then re-create
> or copy dataset2.  Overhead should be 1/2 or 1/4 of what it is now.
>
> Jeff
>
> ------------------------------
> *From: *"Jeff LI" <un...@gmail.com>
> *To: *user@hadoop.apache.org
> *Sent: *Wednesday, March 6, 2013 10:21:09 AM
> *Subject: *Difference between HDFS_BYTES_READ and the actual size of
> input files
>
>
> Dear Hadoop Users,
>
> I recently noticed there is a difference between the File System Counter
> "HDFS_BYTES_READ" and the actual size of input files in map-reduce jobs.
>  And the difference seems to increase as the size of each key,value pairs
> increases.
>
> For example, I'm running the same job on two datasets.  The sizes of both
> datasets are the same, which is about 128 GB.  And the keys are integers.
>  The difference between these two datasets is the number of key,values
> pairs and thus the size of each value: dataset1 has 2^17 key,value pairs
> and 1MB for each value; dataset2 has 2^12 key,value pairs and 32MB for each
> value.
> The HDFS_BYTES_READ counter is 128.77GB for dataset1 and 152.00GB for
> dataset2.
>
> I have also tested on other sizes of each value.  There doesn't seem to be
> any difference when the size of each value is small (<=128KB),
> but noticeable difference when the size increases.
>
> Could you give me some idea on why this is happening?
>
> By the way, I'm running Hadoop 0.20.2-cdh3u5.
>
> Cheers
>
> Jeff
>
>
>

Re: Difference between HDFS_BYTES_READ and the actual size of input files

Posted by Jeff LI <un...@gmail.com>.
Thanks Jeff, for your suggestions.
I did run another test to study the extra copies you suggested.

First, I forgot to mention in the previous test, there are 256 files for
each dataset.
I ran another test with one change, number of files from 256 to 1024, which
means in the later case one input split per file (given that the block size
is 128MB).  And it turned out that there is no difference or very slight
difference in the HDFS_BYTES_READ.

So I think this confirms your suggestion in some way.  There are probably a
lot "copied twice" going on.

Thanks.

Cheers

Jeff


On Wed, Mar 6, 2013 at 2:12 PM, Jeffrey Buell <jb...@vmware.com> wrote:

> Jeff,
>
> Probably because records are split across blocks, so some of the data has
> to be read twice. Assuming you have a 64 MB block size and 128 GB of data,
> I'd estimate the overhead at 1 GB for 1 MB record size, and 32 GB for 32 MB
> record size.  Your overhead is about 75% of that, maybe my arithmetic is
> off, or there is some intelligence in HDFS to reduce how often records are
> split across blocks (say if a big record needs to be written but there is
> little space left in the current block, anybody know?).  Test this theory
> by increasing the HDFS block size to 128 MB or 256 MB, and then re-create
> or copy dataset2.  Overhead should be 1/2 or 1/4 of what it is now.
>
> Jeff
>
> ------------------------------
> *From: *"Jeff LI" <un...@gmail.com>
> *To: *user@hadoop.apache.org
> *Sent: *Wednesday, March 6, 2013 10:21:09 AM
> *Subject: *Difference between HDFS_BYTES_READ and the actual size of
> input files
>
>
> Dear Hadoop Users,
>
> I recently noticed there is a difference between the File System Counter
> "HDFS_BYTES_READ" and the actual size of input files in map-reduce jobs.
>  And the difference seems to increase as the size of each key,value pairs
> increases.
>
> For example, I'm running the same job on two datasets.  The sizes of both
> datasets are the same, which is about 128 GB.  And the keys are integers.
>  The difference between these two datasets is the number of key,values
> pairs and thus the size of each value: dataset1 has 2^17 key,value pairs
> and 1MB for each value; dataset2 has 2^12 key,value pairs and 32MB for each
> value.
> The HDFS_BYTES_READ counter is 128.77GB for dataset1 and 152.00GB for
> dataset2.
>
> I have also tested on other sizes of each value.  There doesn't seem to be
> any difference when the size of each value is small (<=128KB),
> but noticeable difference when the size increases.
>
> Could you give me some idea on why this is happening?
>
> By the way, I'm running Hadoop 0.20.2-cdh3u5.
>
> Cheers
>
> Jeff
>
>
>

Re: Difference between HDFS_BYTES_READ and the actual size of input files

Posted by Jeff LI <un...@gmail.com>.
Thanks Jeff, for your suggestions.
I did run another test to study the extra copies you suggested.

First, I forgot to mention in the previous test, there are 256 files for
each dataset.
I ran another test with one change, number of files from 256 to 1024, which
means in the later case one input split per file (given that the block size
is 128MB).  And it turned out that there is no difference or very slight
difference in the HDFS_BYTES_READ.

So I think this confirms your suggestion in some way.  There are probably a
lot "copied twice" going on.

Thanks.

Cheers

Jeff


On Wed, Mar 6, 2013 at 2:12 PM, Jeffrey Buell <jb...@vmware.com> wrote:

> Jeff,
>
> Probably because records are split across blocks, so some of the data has
> to be read twice. Assuming you have a 64 MB block size and 128 GB of data,
> I'd estimate the overhead at 1 GB for 1 MB record size, and 32 GB for 32 MB
> record size.  Your overhead is about 75% of that, maybe my arithmetic is
> off, or there is some intelligence in HDFS to reduce how often records are
> split across blocks (say if a big record needs to be written but there is
> little space left in the current block, anybody know?).  Test this theory
> by increasing the HDFS block size to 128 MB or 256 MB, and then re-create
> or copy dataset2.  Overhead should be 1/2 or 1/4 of what it is now.
>
> Jeff
>
> ------------------------------
> *From: *"Jeff LI" <un...@gmail.com>
> *To: *user@hadoop.apache.org
> *Sent: *Wednesday, March 6, 2013 10:21:09 AM
> *Subject: *Difference between HDFS_BYTES_READ and the actual size of
> input files
>
>
> Dear Hadoop Users,
>
> I recently noticed there is a difference between the File System Counter
> "HDFS_BYTES_READ" and the actual size of input files in map-reduce jobs.
>  And the difference seems to increase as the size of each key,value pairs
> increases.
>
> For example, I'm running the same job on two datasets.  The sizes of both
> datasets are the same, which is about 128 GB.  And the keys are integers.
>  The difference between these two datasets is the number of key,values
> pairs and thus the size of each value: dataset1 has 2^17 key,value pairs
> and 1MB for each value; dataset2 has 2^12 key,value pairs and 32MB for each
> value.
> The HDFS_BYTES_READ counter is 128.77GB for dataset1 and 152.00GB for
> dataset2.
>
> I have also tested on other sizes of each value.  There doesn't seem to be
> any difference when the size of each value is small (<=128KB),
> but noticeable difference when the size increases.
>
> Could you give me some idea on why this is happening?
>
> By the way, I'm running Hadoop 0.20.2-cdh3u5.
>
> Cheers
>
> Jeff
>
>
>

Re: Difference between HDFS_BYTES_READ and the actual size of input files

Posted by Jeff LI <un...@gmail.com>.
Thanks Jeff, for your suggestions.
I did run another test to study the extra copies you suggested.

First, I forgot to mention in the previous test, there are 256 files for
each dataset.
I ran another test with one change, number of files from 256 to 1024, which
means in the later case one input split per file (given that the block size
is 128MB).  And it turned out that there is no difference or very slight
difference in the HDFS_BYTES_READ.

So I think this confirms your suggestion in some way.  There are probably a
lot "copied twice" going on.

Thanks.

Cheers

Jeff


On Wed, Mar 6, 2013 at 2:12 PM, Jeffrey Buell <jb...@vmware.com> wrote:

> Jeff,
>
> Probably because records are split across blocks, so some of the data has
> to be read twice. Assuming you have a 64 MB block size and 128 GB of data,
> I'd estimate the overhead at 1 GB for 1 MB record size, and 32 GB for 32 MB
> record size.  Your overhead is about 75% of that, maybe my arithmetic is
> off, or there is some intelligence in HDFS to reduce how often records are
> split across blocks (say if a big record needs to be written but there is
> little space left in the current block, anybody know?).  Test this theory
> by increasing the HDFS block size to 128 MB or 256 MB, and then re-create
> or copy dataset2.  Overhead should be 1/2 or 1/4 of what it is now.
>
> Jeff
>
> ------------------------------
> *From: *"Jeff LI" <un...@gmail.com>
> *To: *user@hadoop.apache.org
> *Sent: *Wednesday, March 6, 2013 10:21:09 AM
> *Subject: *Difference between HDFS_BYTES_READ and the actual size of
> input files
>
>
> Dear Hadoop Users,
>
> I recently noticed there is a difference between the File System Counter
> "HDFS_BYTES_READ" and the actual size of input files in map-reduce jobs.
>  And the difference seems to increase as the size of each key,value pairs
> increases.
>
> For example, I'm running the same job on two datasets.  The sizes of both
> datasets are the same, which is about 128 GB.  And the keys are integers.
>  The difference between these two datasets is the number of key,values
> pairs and thus the size of each value: dataset1 has 2^17 key,value pairs
> and 1MB for each value; dataset2 has 2^12 key,value pairs and 32MB for each
> value.
> The HDFS_BYTES_READ counter is 128.77GB for dataset1 and 152.00GB for
> dataset2.
>
> I have also tested on other sizes of each value.  There doesn't seem to be
> any difference when the size of each value is small (<=128KB),
> but noticeable difference when the size increases.
>
> Could you give me some idea on why this is happening?
>
> By the way, I'm running Hadoop 0.20.2-cdh3u5.
>
> Cheers
>
> Jeff
>
>
>

Re: Difference between HDFS_BYTES_READ and the actual size of input files

Posted by Jeffrey Buell <jb...@vmware.com>.
Jeff, 

Probably because records are split across blocks, so some of the data has to be read twice. Assuming you have a 64 MB block size and 128 GB of data, I'd estimate the overhead at 1 GB for 1 MB record size, and 32 GB for 32 MB record size. Your overhead is about 75% of that, maybe my arithmetic is off, or there is some intelligence in HDFS to reduce how often records are split across blocks (say if a big record needs to be written but there is little space left in the current block, anybody know?). Test this theory by increasing the HDFS block size to 128 MB or 256 MB, and then re-create or copy dataset2. Overhead should be 1/2 or 1/4 of what it is now. 

Jeff 

----- Original Message -----

From: "Jeff LI" <un...@gmail.com> 
To: user@hadoop.apache.org 
Sent: Wednesday, March 6, 2013 10:21:09 AM 
Subject: Difference between HDFS_BYTES_READ and the actual size of input files 

Dear Hadoop Users, 



I recently noticed there is a difference between the File System Counter "HDFS_BYTES_READ" and the actual size of input files in map-reduce jobs. And the difference seems to increase as the size of each key,value pairs increases. 


For example, I'm running the same job on two datasets. The sizes of both datasets are the same, which is about 128 GB. And the keys are integers. The difference between these two datasets is the number of key,values pairs and thus the size of each value: dataset1 has 2^17 key,value pairs and 1MB for each value; dataset2 has 2^12 key,value pairs and 32MB for each value. 
The HDFS_BYTES_READ counter is 128.77GB for dataset1 and 152.00GB for dataset2. 


I have also tested on other sizes of each value. There doesn't seem to be any difference when the size of each value is small (<=128KB), but noticeable difference when the size increases. 


Could you give me some idea on why this is happening? 


By the way, I'm running Hadoop 0.20.2-cdh3u5. 


Cheers 

Jeff 



Re: Difference between HDFS_BYTES_READ and the actual size of input files

Posted by Jeffrey Buell <jb...@vmware.com>.
Jeff, 

Probably because records are split across blocks, so some of the data has to be read twice. Assuming you have a 64 MB block size and 128 GB of data, I'd estimate the overhead at 1 GB for 1 MB record size, and 32 GB for 32 MB record size. Your overhead is about 75% of that, maybe my arithmetic is off, or there is some intelligence in HDFS to reduce how often records are split across blocks (say if a big record needs to be written but there is little space left in the current block, anybody know?). Test this theory by increasing the HDFS block size to 128 MB or 256 MB, and then re-create or copy dataset2. Overhead should be 1/2 or 1/4 of what it is now. 

Jeff 

----- Original Message -----

From: "Jeff LI" <un...@gmail.com> 
To: user@hadoop.apache.org 
Sent: Wednesday, March 6, 2013 10:21:09 AM 
Subject: Difference between HDFS_BYTES_READ and the actual size of input files 

Dear Hadoop Users, 



I recently noticed there is a difference between the File System Counter "HDFS_BYTES_READ" and the actual size of input files in map-reduce jobs. And the difference seems to increase as the size of each key,value pairs increases. 


For example, I'm running the same job on two datasets. The sizes of both datasets are the same, which is about 128 GB. And the keys are integers. The difference between these two datasets is the number of key,values pairs and thus the size of each value: dataset1 has 2^17 key,value pairs and 1MB for each value; dataset2 has 2^12 key,value pairs and 32MB for each value. 
The HDFS_BYTES_READ counter is 128.77GB for dataset1 and 152.00GB for dataset2. 


I have also tested on other sizes of each value. There doesn't seem to be any difference when the size of each value is small (<=128KB), but noticeable difference when the size increases. 


Could you give me some idea on why this is happening? 


By the way, I'm running Hadoop 0.20.2-cdh3u5. 


Cheers 

Jeff 



Re: Difference between HDFS_BYTES_READ and the actual size of input files

Posted by Jeffrey Buell <jb...@vmware.com>.
Jeff, 

Probably because records are split across blocks, so some of the data has to be read twice. Assuming you have a 64 MB block size and 128 GB of data, I'd estimate the overhead at 1 GB for 1 MB record size, and 32 GB for 32 MB record size. Your overhead is about 75% of that, maybe my arithmetic is off, or there is some intelligence in HDFS to reduce how often records are split across blocks (say if a big record needs to be written but there is little space left in the current block, anybody know?). Test this theory by increasing the HDFS block size to 128 MB or 256 MB, and then re-create or copy dataset2. Overhead should be 1/2 or 1/4 of what it is now. 

Jeff 

----- Original Message -----

From: "Jeff LI" <un...@gmail.com> 
To: user@hadoop.apache.org 
Sent: Wednesday, March 6, 2013 10:21:09 AM 
Subject: Difference between HDFS_BYTES_READ and the actual size of input files 

Dear Hadoop Users, 



I recently noticed there is a difference between the File System Counter "HDFS_BYTES_READ" and the actual size of input files in map-reduce jobs. And the difference seems to increase as the size of each key,value pairs increases. 


For example, I'm running the same job on two datasets. The sizes of both datasets are the same, which is about 128 GB. And the keys are integers. The difference between these two datasets is the number of key,values pairs and thus the size of each value: dataset1 has 2^17 key,value pairs and 1MB for each value; dataset2 has 2^12 key,value pairs and 32MB for each value. 
The HDFS_BYTES_READ counter is 128.77GB for dataset1 and 152.00GB for dataset2. 


I have also tested on other sizes of each value. There doesn't seem to be any difference when the size of each value is small (<=128KB), but noticeable difference when the size increases. 


Could you give me some idea on why this is happening? 


By the way, I'm running Hadoop 0.20.2-cdh3u5. 


Cheers 

Jeff 



Re: Difference between HDFS_BYTES_READ and the actual size of input files

Posted by Jeffrey Buell <jb...@vmware.com>.
Jeff, 

Probably because records are split across blocks, so some of the data has to be read twice. Assuming you have a 64 MB block size and 128 GB of data, I'd estimate the overhead at 1 GB for 1 MB record size, and 32 GB for 32 MB record size. Your overhead is about 75% of that, maybe my arithmetic is off, or there is some intelligence in HDFS to reduce how often records are split across blocks (say if a big record needs to be written but there is little space left in the current block, anybody know?). Test this theory by increasing the HDFS block size to 128 MB or 256 MB, and then re-create or copy dataset2. Overhead should be 1/2 or 1/4 of what it is now. 

Jeff 

----- Original Message -----

From: "Jeff LI" <un...@gmail.com> 
To: user@hadoop.apache.org 
Sent: Wednesday, March 6, 2013 10:21:09 AM 
Subject: Difference between HDFS_BYTES_READ and the actual size of input files 

Dear Hadoop Users, 



I recently noticed there is a difference between the File System Counter "HDFS_BYTES_READ" and the actual size of input files in map-reduce jobs. And the difference seems to increase as the size of each key,value pairs increases. 


For example, I'm running the same job on two datasets. The sizes of both datasets are the same, which is about 128 GB. And the keys are integers. The difference between these two datasets is the number of key,values pairs and thus the size of each value: dataset1 has 2^17 key,value pairs and 1MB for each value; dataset2 has 2^12 key,value pairs and 32MB for each value. 
The HDFS_BYTES_READ counter is 128.77GB for dataset1 and 152.00GB for dataset2. 


I have also tested on other sizes of each value. There doesn't seem to be any difference when the size of each value is small (<=128KB), but noticeable difference when the size increases. 


Could you give me some idea on why this is happening? 


By the way, I'm running Hadoop 0.20.2-cdh3u5. 


Cheers 

Jeff