You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Kiyoshi Mizumaru <ki...@gmail.com> on 2010/05/14 08:38:10 UTC

TestDFSIO writes files on HDFS with wrong block size?

Hi all, this is my first post to this list, and if i'm not in
appropriate place, please let me know.


I have just created a Hadoop instance and its HDFS is configured as:
  dfs.replication = 1
  dfs.block.size = 536870912 (512MB)

Then I typed the following command to run TestDFSIO against this instance:
  % hadoop jar hadoop-*-test.jar TestDFSIO -write -nrFiles 1 -fileSize 1024

One file with 1024MB size should consist of 2 blocks of size 512MB,
but filesystem browser shows that /benchmarks/TestDFSIO/io_data/test_io_0
consists of 16 blocks of size 64MB, and its replication is 3, so 48 blocks
are displayed in total..

This is not what I expected, does anyone know what's wrong?

I'm using Cloudera's Distribution for Hadoop (hadoop-0.20-0.20.2+228-1)
with Sun Java6 (jdk-6u19-linux-amd64).  Thanks in advance and sorry for
my poor English, I'm still leaning it.
--
Kiyoshi

Re: TestDFSIO writes files on HDFS with wrong block size?

Posted by Kiyoshi Mizumaru <ki...@gmail.com>.
dfs -put works as you explained in our environment.
So my question is, what's the difference between TestDFSIO and others?

$ ls -lF large.txt
-rw-r--r-- 1 maru sssdev 250573905 Apr 26 18:48 large.txt
$ hadoop dfs -ls /
Found 1 items
drwxr-xr-x   - hadoop supergroup          0 2010-05-27 20:01 /localdisk
$ hadoop dfs -put large.txt /large.txt
$ hadoop dfs -Ddfs.block.size=10240 -put large.txt /large.txt.2
$ hadoop fsck /large.txt | grep -i 'total blocks'
 Total blocks (validated):      1 (avg. block size 250573905 B)
$ hadoop fsck /large.txt.2 | grep -i 'total blocks'
 Total blocks (validated):      24471 (avg. block size 10239 B)
$

On Sat, May 22, 2010 at 2:30 AM, Kiyoshi Mizumaru
<ki...@gmail.com> wrote:
> Thank you for your reply.
>
> In that case, I wonder why it does not work as I expected on our Hadoop.
> As far as I have tested, they did not work with TestDFSIO as you explained.
> I'll check if they work with dfs -put command on our Hadoop instance next week.
>
>
> On Fri, May 21, 2010 at 12:37 AM, Koji Noguchi <kn...@yahoo-inc.com> wrote:
>> Kiyoshi,
>>
>> Block size is set by the client, so no need to restart nor format nor
>> changing the configs.
>>
>> $ ls -l testfile.txt
>> -rw-r--r-- 1 knoguchi users 202145 May  1  2009 testfile.txt
>> $ hadoop dfs -put testfile.txt /user/knoguchi/testfile.txt
>> $ hadoop dfs -Ddfs.block.size=10240 -put testfile.txt
>> /user/knoguchi/testfile2.txt
>> $ hadoop fsck /user/knoguchi/testfile.txt | grep "Total blocks"
>>  Total blocks (validated):      1 (avg. block size 202145 B)
>> $ hadoop fsck /user/knoguchi/testfile2.txt | grep "Total blocks"
>>  Total blocks (validated):      20 (avg. block size 10107 B)
>> $
>>
>> Koji
>>
>>
>> On 5/20/10 7:56 AM, "Kiyoshi Mizumaru" <ki...@gmail.com> wrote:
>>
>>> Unfortunately it does not work as I expected.
>>>
>>> Cleaned up previous Hadoop instance data by removing all the files/directories
>>> which exist in dfs.name.dir and dfs.data.dir, and formatted new HDFS with
>>> hadoop namenode -format gave me a new Hadoop instance as I expected.
>>>
>>> It seems that changing configuration files and formatting HDFS (and restarting
>>> all daemons, of course) are not enough to change replication and block size,
>>> is it correct?
>>>
>>>
>>> On Wed, May 19, 2010 at 2:14 PM, Kiyoshi Mizumaru
>>> <ki...@gmail.com> wrote:
>>>> Hi Koji,
>>>>
>>>> Thank you for your reply.
>>>> I'll try what you wrote and see if it works as expected.
>>>>
>>>> By the way, what does the `client-side config' mean?
>>>> dfs.replication and dfs.block.size are written in conf/hdfs-site.xml.
>>>> Where should I put them into?
>>>>
>>>>
>>>> On Tue, May 18, 2010 at 3:01 AM, Koji Noguchi <kn...@yahoo-inc.com> wrote:
>>>>> Hi Kiyoshi,
>>>>>
>>>>> In case you haven't received a reply, try
>>>>>
>>>>> hadoop jar hadoop-*-test.jar TestDFSIO -Ddfs.block.size=536870912 -D
>>>>> dfs.replication=1 ....
>>>>>
>>>>> If that works, add them as part of your client-side config.
>>>>>
>>>>> Koji
>>>>>
>>>>>
>>>>> On 5/13/10 11:38 PM, "Kiyoshi Mizumaru" <ki...@gmail.com> wrote:
>>>>>
>>>>>> Hi all, this is my first post to this list, and if i'm not in
>>>>>> appropriate place, please let me know.
>>>>>>
>>>>>>
>>>>>> I have just created a Hadoop instance and its HDFS is configured as:
>>>>>>   dfs.replication = 1
>>>>>>   dfs.block.size = 536870912 (512MB)
>>>>>>
>>>>>> Then I typed the following command to run TestDFSIO against this instance:
>>>>>>   % hadoop jar hadoop-*-test.jar TestDFSIO -write -nrFiles 1 -fileSize 1024
>>>>>>
>>>>>> One file with 1024MB size should consist of 2 blocks of size 512MB,
>>>>>> but filesystem browser shows that /benchmarks/TestDFSIO/io_data/test_io_0
>>>>>> consists of 16 blocks of size 64MB, and its replication is 3, so 48 blocks
>>>>>> are displayed in total..
>>>>>>
>>>>>> This is not what I expected, does anyone know what's wrong?
>>>>>>
>>>>>> I'm using Cloudera's Distribution for Hadoop (hadoop-0.20-0.20.2+228-1)
>>>>>> with Sun Java6 (jdk-6u19-linux-amd64).  Thanks in advance and sorry for
>>>>>> my poor English, I'm still leaning it.
>>>>>> --
>>>>>> Kiyoshi
>>>>>
>>>>>
>>>>
>>
>>
>

Re: TestDFSIO writes files on HDFS with wrong block size?

Posted by Kiyoshi Mizumaru <ki...@gmail.com>.
Thank you for your reply.

In that case, I wonder why it does not work as I expected on our Hadoop.
As far as I have tested, they did not work with TestDFSIO as you explained.
I'll check if they work with dfs -put command on our Hadoop instance next week.


On Fri, May 21, 2010 at 12:37 AM, Koji Noguchi <kn...@yahoo-inc.com> wrote:
> Kiyoshi,
>
> Block size is set by the client, so no need to restart nor format nor
> changing the configs.
>
> $ ls -l testfile.txt
> -rw-r--r-- 1 knoguchi users 202145 May  1  2009 testfile.txt
> $ hadoop dfs -put testfile.txt /user/knoguchi/testfile.txt
> $ hadoop dfs -Ddfs.block.size=10240 -put testfile.txt
> /user/knoguchi/testfile2.txt
> $ hadoop fsck /user/knoguchi/testfile.txt | grep "Total blocks"
>  Total blocks (validated):      1 (avg. block size 202145 B)
> $ hadoop fsck /user/knoguchi/testfile2.txt | grep "Total blocks"
>  Total blocks (validated):      20 (avg. block size 10107 B)
> $
>
> Koji
>
>
> On 5/20/10 7:56 AM, "Kiyoshi Mizumaru" <ki...@gmail.com> wrote:
>
>> Unfortunately it does not work as I expected.
>>
>> Cleaned up previous Hadoop instance data by removing all the files/directories
>> which exist in dfs.name.dir and dfs.data.dir, and formatted new HDFS with
>> hadoop namenode -format gave me a new Hadoop instance as I expected.
>>
>> It seems that changing configuration files and formatting HDFS (and restarting
>> all daemons, of course) are not enough to change replication and block size,
>> is it correct?
>>
>>
>> On Wed, May 19, 2010 at 2:14 PM, Kiyoshi Mizumaru
>> <ki...@gmail.com> wrote:
>>> Hi Koji,
>>>
>>> Thank you for your reply.
>>> I'll try what you wrote and see if it works as expected.
>>>
>>> By the way, what does the `client-side config' mean?
>>> dfs.replication and dfs.block.size are written in conf/hdfs-site.xml.
>>> Where should I put them into?
>>>
>>>
>>> On Tue, May 18, 2010 at 3:01 AM, Koji Noguchi <kn...@yahoo-inc.com> wrote:
>>>> Hi Kiyoshi,
>>>>
>>>> In case you haven't received a reply, try
>>>>
>>>> hadoop jar hadoop-*-test.jar TestDFSIO -Ddfs.block.size=536870912 -D
>>>> dfs.replication=1 ....
>>>>
>>>> If that works, add them as part of your client-side config.
>>>>
>>>> Koji
>>>>
>>>>
>>>> On 5/13/10 11:38 PM, "Kiyoshi Mizumaru" <ki...@gmail.com> wrote:
>>>>
>>>>> Hi all, this is my first post to this list, and if i'm not in
>>>>> appropriate place, please let me know.
>>>>>
>>>>>
>>>>> I have just created a Hadoop instance and its HDFS is configured as:
>>>>>   dfs.replication = 1
>>>>>   dfs.block.size = 536870912 (512MB)
>>>>>
>>>>> Then I typed the following command to run TestDFSIO against this instance:
>>>>>   % hadoop jar hadoop-*-test.jar TestDFSIO -write -nrFiles 1 -fileSize 1024
>>>>>
>>>>> One file with 1024MB size should consist of 2 blocks of size 512MB,
>>>>> but filesystem browser shows that /benchmarks/TestDFSIO/io_data/test_io_0
>>>>> consists of 16 blocks of size 64MB, and its replication is 3, so 48 blocks
>>>>> are displayed in total..
>>>>>
>>>>> This is not what I expected, does anyone know what's wrong?
>>>>>
>>>>> I'm using Cloudera's Distribution for Hadoop (hadoop-0.20-0.20.2+228-1)
>>>>> with Sun Java6 (jdk-6u19-linux-amd64).  Thanks in advance and sorry for
>>>>> my poor English, I'm still leaning it.
>>>>> --
>>>>> Kiyoshi
>>>>
>>>>
>>>
>
>

Re: TestDFSIO writes files on HDFS with wrong block size?

Posted by Koji Noguchi <kn...@yahoo-inc.com>.
Kiyoshi,

Block size is set by the client, so no need to restart nor format nor
changing the configs.

$ ls -l testfile.txt
-rw-r--r-- 1 knoguchi users 202145 May  1  2009 testfile.txt
$ hadoop dfs -put testfile.txt /user/knoguchi/testfile.txt
$ hadoop dfs -Ddfs.block.size=10240 -put testfile.txt
/user/knoguchi/testfile2.txt
$ hadoop fsck /user/knoguchi/testfile.txt | grep "Total blocks"
 Total blocks (validated):      1 (avg. block size 202145 B)
$ hadoop fsck /user/knoguchi/testfile2.txt | grep "Total blocks"
 Total blocks (validated):      20 (avg. block size 10107 B)
$ 

Koji


On 5/20/10 7:56 AM, "Kiyoshi Mizumaru" <ki...@gmail.com> wrote:

> Unfortunately it does not work as I expected.
> 
> Cleaned up previous Hadoop instance data by removing all the files/directories
> which exist in dfs.name.dir and dfs.data.dir, and formatted new HDFS with
> hadoop namenode -format gave me a new Hadoop instance as I expected.
> 
> It seems that changing configuration files and formatting HDFS (and restarting
> all daemons, of course) are not enough to change replication and block size,
> is it correct?
> 
> 
> On Wed, May 19, 2010 at 2:14 PM, Kiyoshi Mizumaru
> <ki...@gmail.com> wrote:
>> Hi Koji,
>> 
>> Thank you for your reply.
>> I'll try what you wrote and see if it works as expected.
>> 
>> By the way, what does the `client-side config' mean?
>> dfs.replication and dfs.block.size are written in conf/hdfs-site.xml.
>> Where should I put them into?
>> 
>> 
>> On Tue, May 18, 2010 at 3:01 AM, Koji Noguchi <kn...@yahoo-inc.com> wrote:
>>> Hi Kiyoshi,
>>> 
>>> In case you haven't received a reply, try
>>> 
>>> hadoop jar hadoop-*-test.jar TestDFSIO -Ddfs.block.size=536870912 -D
>>> dfs.replication=1 ....
>>> 
>>> If that works, add them as part of your client-side config.
>>> 
>>> Koji
>>> 
>>> 
>>> On 5/13/10 11:38 PM, "Kiyoshi Mizumaru" <ki...@gmail.com> wrote:
>>> 
>>>> Hi all, this is my first post to this list, and if i'm not in
>>>> appropriate place, please let me know.
>>>> 
>>>> 
>>>> I have just created a Hadoop instance and its HDFS is configured as:
>>>>   dfs.replication = 1
>>>>   dfs.block.size = 536870912 (512MB)
>>>> 
>>>> Then I typed the following command to run TestDFSIO against this instance:
>>>>   % hadoop jar hadoop-*-test.jar TestDFSIO -write -nrFiles 1 -fileSize 1024
>>>> 
>>>> One file with 1024MB size should consist of 2 blocks of size 512MB,
>>>> but filesystem browser shows that /benchmarks/TestDFSIO/io_data/test_io_0
>>>> consists of 16 blocks of size 64MB, and its replication is 3, so 48 blocks
>>>> are displayed in total..
>>>> 
>>>> This is not what I expected, does anyone know what's wrong?
>>>> 
>>>> I'm using Cloudera's Distribution for Hadoop (hadoop-0.20-0.20.2+228-1)
>>>> with Sun Java6 (jdk-6u19-linux-amd64).  Thanks in advance and sorry for
>>>> my poor English, I'm still leaning it.
>>>> --
>>>> Kiyoshi
>>> 
>>> 
>> 


Re: TestDFSIO writes files on HDFS with wrong block size?

Posted by Kiyoshi Mizumaru <ki...@gmail.com>.
Unfortunately it does not work as I expected.

Cleaned up previous Hadoop instance data by removing all the files/directories
which exist in dfs.name.dir and dfs.data.dir, and formatted new HDFS with
hadoop namenode -format gave me a new Hadoop instance as I expected.

It seems that changing configuration files and formatting HDFS (and restarting
all daemons, of course) are not enough to change replication and block size,
is it correct?


On Wed, May 19, 2010 at 2:14 PM, Kiyoshi Mizumaru
<ki...@gmail.com> wrote:
> Hi Koji,
>
> Thank you for your reply.
> I'll try what you wrote and see if it works as expected.
>
> By the way, what does the `client-side config' mean?
> dfs.replication and dfs.block.size are written in conf/hdfs-site.xml.
> Where should I put them into?
>
>
> On Tue, May 18, 2010 at 3:01 AM, Koji Noguchi <kn...@yahoo-inc.com> wrote:
>> Hi Kiyoshi,
>>
>> In case you haven't received a reply, try
>>
>> hadoop jar hadoop-*-test.jar TestDFSIO -Ddfs.block.size=536870912 -D
>> dfs.replication=1 ....
>>
>> If that works, add them as part of your client-side config.
>>
>> Koji
>>
>>
>> On 5/13/10 11:38 PM, "Kiyoshi Mizumaru" <ki...@gmail.com> wrote:
>>
>>> Hi all, this is my first post to this list, and if i'm not in
>>> appropriate place, please let me know.
>>>
>>>
>>> I have just created a Hadoop instance and its HDFS is configured as:
>>>   dfs.replication = 1
>>>   dfs.block.size = 536870912 (512MB)
>>>
>>> Then I typed the following command to run TestDFSIO against this instance:
>>>   % hadoop jar hadoop-*-test.jar TestDFSIO -write -nrFiles 1 -fileSize 1024
>>>
>>> One file with 1024MB size should consist of 2 blocks of size 512MB,
>>> but filesystem browser shows that /benchmarks/TestDFSIO/io_data/test_io_0
>>> consists of 16 blocks of size 64MB, and its replication is 3, so 48 blocks
>>> are displayed in total..
>>>
>>> This is not what I expected, does anyone know what's wrong?
>>>
>>> I'm using Cloudera's Distribution for Hadoop (hadoop-0.20-0.20.2+228-1)
>>> with Sun Java6 (jdk-6u19-linux-amd64).  Thanks in advance and sorry for
>>> my poor English, I'm still leaning it.
>>> --
>>> Kiyoshi
>>
>>
>

Splitting support for bz2 format

Posted by Deepika Khera <De...@avg.com>.
Hi,

Do we have a patch to support splitting with the bzip2 format for the current stable version 0.20.2 ? Please refer to JIRA below :

https://issues.apache.org/jira/browse/HADOOP-4012

Thanks,
Deepika

Re: TestDFSIO writes files on HDFS with wrong block size?

Posted by Kiyoshi Mizumaru <ki...@gmail.com>.
Hi Koji,

Thank you for your reply.
I'll try what you wrote and see if it works as expected.

By the way, what does the `client-side config' mean?
dfs.replication and dfs.block.size are written in conf/hdfs-site.xml.
Where should I put them into?


On Tue, May 18, 2010 at 3:01 AM, Koji Noguchi <kn...@yahoo-inc.com> wrote:
> Hi Kiyoshi,
>
> In case you haven't received a reply, try
>
> hadoop jar hadoop-*-test.jar TestDFSIO -Ddfs.block.size=536870912 -D
> dfs.replication=1 ....
>
> If that works, add them as part of your client-side config.
>
> Koji
>
>
> On 5/13/10 11:38 PM, "Kiyoshi Mizumaru" <ki...@gmail.com> wrote:
>
>> Hi all, this is my first post to this list, and if i'm not in
>> appropriate place, please let me know.
>>
>>
>> I have just created a Hadoop instance and its HDFS is configured as:
>>   dfs.replication = 1
>>   dfs.block.size = 536870912 (512MB)
>>
>> Then I typed the following command to run TestDFSIO against this instance:
>>   % hadoop jar hadoop-*-test.jar TestDFSIO -write -nrFiles 1 -fileSize 1024
>>
>> One file with 1024MB size should consist of 2 blocks of size 512MB,
>> but filesystem browser shows that /benchmarks/TestDFSIO/io_data/test_io_0
>> consists of 16 blocks of size 64MB, and its replication is 3, so 48 blocks
>> are displayed in total..
>>
>> This is not what I expected, does anyone know what's wrong?
>>
>> I'm using Cloudera's Distribution for Hadoop (hadoop-0.20-0.20.2+228-1)
>> with Sun Java6 (jdk-6u19-linux-amd64).  Thanks in advance and sorry for
>> my poor English, I'm still leaning it.
>> --
>> Kiyoshi
>
>

Re: TestDFSIO writes files on HDFS with wrong block size?

Posted by Koji Noguchi <kn...@yahoo-inc.com>.
Hi Kiyoshi,

In case you haven't received a reply, try

hadoop jar hadoop-*-test.jar TestDFSIO -Ddfs.block.size=536870912 -D
dfs.replication=1 ....

If that works, add them as part of your client-side config.

Koji


On 5/13/10 11:38 PM, "Kiyoshi Mizumaru" <ki...@gmail.com> wrote:

> Hi all, this is my first post to this list, and if i'm not in
> appropriate place, please let me know.
> 
> 
> I have just created a Hadoop instance and its HDFS is configured as:
>   dfs.replication = 1
>   dfs.block.size = 536870912 (512MB)
> 
> Then I typed the following command to run TestDFSIO against this instance:
>   % hadoop jar hadoop-*-test.jar TestDFSIO -write -nrFiles 1 -fileSize 1024
> 
> One file with 1024MB size should consist of 2 blocks of size 512MB,
> but filesystem browser shows that /benchmarks/TestDFSIO/io_data/test_io_0
> consists of 16 blocks of size 64MB, and its replication is 3, so 48 blocks
> are displayed in total..
> 
> This is not what I expected, does anyone know what's wrong?
> 
> I'm using Cloudera's Distribution for Hadoop (hadoop-0.20-0.20.2+228-1)
> with Sun Java6 (jdk-6u19-linux-amd64).  Thanks in advance and sorry for
> my poor English, I'm still leaning it.
> --
> Kiyoshi