You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by "Ellis H. Wilson III" <el...@cse.psu.edu> on 2012/08/09 15:10:03 UTC

fs.local.block.size vs file.blocksize

Hi all!

Can someone please briefly explain the difference?  I do not see 
deprecated warnings for fs.local.block.size when I run with them set and 
I see two copies of RawLocalFileSystem.java (the other is 
local/RawLocalFs.java).

The things I really need to get answers to are:
1. Is the default boosted to 64MB from Hadoop 1.0 to Hadoop 2.0?  I 
believe it is, but want validation on that.
2. Which one controls shuffle block-size?
3. If I have a single machine non-distributed instance, and point it at 
file://, do both of these control the persistent data's block size or 
just one of them or what?
4. Is there any way to run with say a 512MB blocksize for the persistent 
data and the default 64MB blocksize for the shuffled data?

Thanks!

ellis

Re: fs.local.block.size vs file.blocksize

Posted by Harsh J <ha...@cloudera.com>.

Thanks for clarifying Ellis. Am sorry I assumed certain things when
replying here.

I looked at it as well and it does absolutely nothing, nor is referred
by anything, nor can we do anything with it. We may as well remove it
(the tunable), or document it. Please do file a HADOOP JIRA (once
Apache JIRA is up).

On Sun, Aug 12, 2012 at 11:10 PM, Ellis H. Wilson III <el...@cse.psu.edu> wrote:
> Many thanks to Eli and Harsh for their responses!  Comments in-line:
>
>
> On 08/12/2012 09:48 AM, Harsh J wrote:
>>
>> Hi Ellis,
>>
>> Note that when in Hadoop-land, a "block size" term generally means the
>> chunking size of HDFS writers and readers, and that is not the same as
>> the FS term "block size" in any way.
>
>
> Yes, I do know that, but I was confused about something else.  More on that
> later in #2.
>
>> On Thu, Aug 9, 2012 at 6:40 PM, Ellis H. Wilson III<el...@cse.psu.edu>
>> wrote:
>>>
>>> Can someone please briefly explain the difference?  I do not see
>>> deprecated
>>>
>>> warnings for fs.local.block.size when I run with them set and I see two
>>> copies of RawLocalFileSystem.java (the other is local/RawLocalFs.java).
>>
>>
>> The right param still seems to be "fs.local.block.size", when it comes
>> to using "getDefaultBlocksize" calls via the file:/// filesystems or
>> other filesystems that have not over-riden the default behavior.
>
>
> This question was more out of curiosity than anything.  My experiments agree
> that "fs.local.blocksize" is the right parameter for controlling the
> blocksize of file:///, but I'm still quite perplexed as to where
> file.blocksize actually is used.  I chased it around for a while in Eclipse
> last night, but have yet to see where it is directly resourced (keyconfigs
> sets it and suggests FileSystem, RawLocalFileSystem and CheckSumFileSystem
> all use it, but I don't see it being used in any practical way).
>
>
>>> The things I really need to get answers to are:
>>> 1. Is the default boosted to 64MB from Hadoop 1.0 to Hadoop 2.0?  I
>>> believe
>>> it is, but want validation on that.
>>
>>
>> The dfs.blocksize, which applies to HDFS, has not changed from its 64
>> MB default.
>
>
> I was referring to RawLocalFileSystem, not DistributedFileSystem.  I am
> fairly certain from my tests and from the code I've dug through that the
> default blocksize is still 32MB at the moment.  Please note that my
> questions here are fairly unconcerned with HDFS, as I'm not using it at all
> in >75% of my tests.
>
>
>>> 2. Which one controls shuffle block-size?
>>
>>
>> There is no "shuffle block-size", as shuffle goes to local filesystems
>> and that has no block size concepts. Can you elaborate on this?
>
>
> This was a plain ol' misconception/mistake on my part, still sticking around
> from when I started working in the Hadoop source just over a year back.  I
> mistook performance increases in TeraGen but performance decreases in
> TeraSort (noted by an elongated shuffle phase) when I increased file:///'s
> blocksize to suggest that the shuffling used the file:/// filesystem as
> well.  I now understand why this can happen, and appreciate you clarifying
> as my digging through the shuffle code has done that indeed, no chunking
> occurs on shuffle.  My apologies for the confusing question, based on errant
> inferences.
>
> Thanks again to both of you!  However, if anyone has better intuition on
> what the file.blocksize parameter does, I'd be happy to hear it.
>
> Best,
>
> ellis



-- 
Harsh J

Re: fs.local.block.size vs file.blocksize

Posted by Harsh J <ha...@cloudera.com>.

Thanks for clarifying Ellis. Am sorry I assumed certain things when
replying here.

I looked at it as well and it does absolutely nothing, nor is referred
by anything, nor can we do anything with it. We may as well remove it
(the tunable), or document it. Please do file a HADOOP JIRA (once
Apache JIRA is up).

On Sun, Aug 12, 2012 at 11:10 PM, Ellis H. Wilson III <el...@cse.psu.edu> wrote:
> Many thanks to Eli and Harsh for their responses!  Comments in-line:
>
>
> On 08/12/2012 09:48 AM, Harsh J wrote:
>>
>> Hi Ellis,
>>
>> Note that when in Hadoop-land, a "block size" term generally means the
>> chunking size of HDFS writers and readers, and that is not the same as
>> the FS term "block size" in any way.
>
>
> Yes, I do know that, but I was confused about something else.  More on that
> later in #2.
>
>> On Thu, Aug 9, 2012 at 6:40 PM, Ellis H. Wilson III<el...@cse.psu.edu>
>> wrote:
>>>
>>> Can someone please briefly explain the difference?  I do not see
>>> deprecated
>>>
>>> warnings for fs.local.block.size when I run with them set and I see two
>>> copies of RawLocalFileSystem.java (the other is local/RawLocalFs.java).
>>
>>
>> The right param still seems to be "fs.local.block.size", when it comes
>> to using "getDefaultBlocksize" calls via the file:/// filesystems or
>> other filesystems that have not over-riden the default behavior.
>
>
> This question was more out of curiosity than anything.  My experiments agree
> that "fs.local.blocksize" is the right parameter for controlling the
> blocksize of file:///, but I'm still quite perplexed as to where
> file.blocksize actually is used.  I chased it around for a while in Eclipse
> last night, but have yet to see where it is directly resourced (keyconfigs
> sets it and suggests FileSystem, RawLocalFileSystem and CheckSumFileSystem
> all use it, but I don't see it being used in any practical way).
>
>
>>> The things I really need to get answers to are:
>>> 1. Is the default boosted to 64MB from Hadoop 1.0 to Hadoop 2.0?  I
>>> believe
>>> it is, but want validation on that.
>>
>>
>> The dfs.blocksize, which applies to HDFS, has not changed from its 64
>> MB default.
>
>
> I was referring to RawLocalFileSystem, not DistributedFileSystem.  I am
> fairly certain from my tests and from the code I've dug through that the
> default blocksize is still 32MB at the moment.  Please note that my
> questions here are fairly unconcerned with HDFS, as I'm not using it at all
> in >75% of my tests.
>
>
>>> 2. Which one controls shuffle block-size?
>>
>>
>> There is no "shuffle block-size", as shuffle goes to local filesystems
>> and that has no block size concepts. Can you elaborate on this?
>
>
> This was a plain ol' misconception/mistake on my part, still sticking around
> from when I started working in the Hadoop source just over a year back.  I
> mistook performance increases in TeraGen but performance decreases in
> TeraSort (noted by an elongated shuffle phase) when I increased file:///'s
> blocksize to suggest that the shuffling used the file:/// filesystem as
> well.  I now understand why this can happen, and appreciate you clarifying
> as my digging through the shuffle code has done that indeed, no chunking
> occurs on shuffle.  My apologies for the confusing question, based on errant
> inferences.
>
> Thanks again to both of you!  However, if anyone has better intuition on
> what the file.blocksize parameter does, I'd be happy to hear it.
>
> Best,
>
> ellis



-- 
Harsh J

Re: fs.local.block.size vs file.blocksize

Posted by Harsh J <ha...@cloudera.com>.

Thanks for clarifying Ellis. Am sorry I assumed certain things when
replying here.

I looked at it as well and it does absolutely nothing, nor is referred
by anything, nor can we do anything with it. We may as well remove it
(the tunable), or document it. Please do file a HADOOP JIRA (once
Apache JIRA is up).

On Sun, Aug 12, 2012 at 11:10 PM, Ellis H. Wilson III <el...@cse.psu.edu> wrote:
> Many thanks to Eli and Harsh for their responses!  Comments in-line:
>
>
> On 08/12/2012 09:48 AM, Harsh J wrote:
>>
>> Hi Ellis,
>>
>> Note that when in Hadoop-land, a "block size" term generally means the
>> chunking size of HDFS writers and readers, and that is not the same as
>> the FS term "block size" in any way.
>
>
> Yes, I do know that, but I was confused about something else.  More on that
> later in #2.
>
>> On Thu, Aug 9, 2012 at 6:40 PM, Ellis H. Wilson III<el...@cse.psu.edu>
>> wrote:
>>>
>>> Can someone please briefly explain the difference?  I do not see
>>> deprecated
>>>
>>> warnings for fs.local.block.size when I run with them set and I see two
>>> copies of RawLocalFileSystem.java (the other is local/RawLocalFs.java).
>>
>>
>> The right param still seems to be "fs.local.block.size", when it comes
>> to using "getDefaultBlocksize" calls via the file:/// filesystems or
>> other filesystems that have not over-riden the default behavior.
>
>
> This question was more out of curiosity than anything.  My experiments agree
> that "fs.local.blocksize" is the right parameter for controlling the
> blocksize of file:///, but I'm still quite perplexed as to where
> file.blocksize actually is used.  I chased it around for a while in Eclipse
> last night, but have yet to see where it is directly resourced (keyconfigs
> sets it and suggests FileSystem, RawLocalFileSystem and CheckSumFileSystem
> all use it, but I don't see it being used in any practical way).
>
>
>>> The things I really need to get answers to are:
>>> 1. Is the default boosted to 64MB from Hadoop 1.0 to Hadoop 2.0?  I
>>> believe
>>> it is, but want validation on that.
>>
>>
>> The dfs.blocksize, which applies to HDFS, has not changed from its 64
>> MB default.
>
>
> I was referring to RawLocalFileSystem, not DistributedFileSystem.  I am
> fairly certain from my tests and from the code I've dug through that the
> default blocksize is still 32MB at the moment.  Please note that my
> questions here are fairly unconcerned with HDFS, as I'm not using it at all
> in >75% of my tests.
>
>
>>> 2. Which one controls shuffle block-size?
>>
>>
>> There is no "shuffle block-size", as shuffle goes to local filesystems
>> and that has no block size concepts. Can you elaborate on this?
>
>
> This was a plain ol' misconception/mistake on my part, still sticking around
> from when I started working in the Hadoop source just over a year back.  I
> mistook performance increases in TeraGen but performance decreases in
> TeraSort (noted by an elongated shuffle phase) when I increased file:///'s
> blocksize to suggest that the shuffling used the file:/// filesystem as
> well.  I now understand why this can happen, and appreciate you clarifying
> as my digging through the shuffle code has done that indeed, no chunking
> occurs on shuffle.  My apologies for the confusing question, based on errant
> inferences.
>
> Thanks again to both of you!  However, if anyone has better intuition on
> what the file.blocksize parameter does, I'd be happy to hear it.
>
> Best,
>
> ellis



-- 
Harsh J

Re: fs.local.block.size vs file.blocksize

Posted by Harsh J <ha...@cloudera.com>.

Thanks for clarifying Ellis. Am sorry I assumed certain things when
replying here.

I looked at it as well and it does absolutely nothing, nor is referred
by anything, nor can we do anything with it. We may as well remove it
(the tunable), or document it. Please do file a HADOOP JIRA (once
Apache JIRA is up).

On Sun, Aug 12, 2012 at 11:10 PM, Ellis H. Wilson III <el...@cse.psu.edu> wrote:
> Many thanks to Eli and Harsh for their responses!  Comments in-line:
>
>
> On 08/12/2012 09:48 AM, Harsh J wrote:
>>
>> Hi Ellis,
>>
>> Note that when in Hadoop-land, a "block size" term generally means the
>> chunking size of HDFS writers and readers, and that is not the same as
>> the FS term "block size" in any way.
>
>
> Yes, I do know that, but I was confused about something else.  More on that
> later in #2.
>
>> On Thu, Aug 9, 2012 at 6:40 PM, Ellis H. Wilson III<el...@cse.psu.edu>
>> wrote:
>>>
>>> Can someone please briefly explain the difference?  I do not see
>>> deprecated
>>>
>>> warnings for fs.local.block.size when I run with them set and I see two
>>> copies of RawLocalFileSystem.java (the other is local/RawLocalFs.java).
>>
>>
>> The right param still seems to be "fs.local.block.size", when it comes
>> to using "getDefaultBlocksize" calls via the file:/// filesystems or
>> other filesystems that have not over-riden the default behavior.
>
>
> This question was more out of curiosity than anything.  My experiments agree
> that "fs.local.blocksize" is the right parameter for controlling the
> blocksize of file:///, but I'm still quite perplexed as to where
> file.blocksize actually is used.  I chased it around for a while in Eclipse
> last night, but have yet to see where it is directly resourced (keyconfigs
> sets it and suggests FileSystem, RawLocalFileSystem and CheckSumFileSystem
> all use it, but I don't see it being used in any practical way).
>
>
>>> The things I really need to get answers to are:
>>> 1. Is the default boosted to 64MB from Hadoop 1.0 to Hadoop 2.0?  I
>>> believe
>>> it is, but want validation on that.
>>
>>
>> The dfs.blocksize, which applies to HDFS, has not changed from its 64
>> MB default.
>
>
> I was referring to RawLocalFileSystem, not DistributedFileSystem.  I am
> fairly certain from my tests and from the code I've dug through that the
> default blocksize is still 32MB at the moment.  Please note that my
> questions here are fairly unconcerned with HDFS, as I'm not using it at all
> in >75% of my tests.
>
>
>>> 2. Which one controls shuffle block-size?
>>
>>
>> There is no "shuffle block-size", as shuffle goes to local filesystems
>> and that has no block size concepts. Can you elaborate on this?
>
>
> This was a plain ol' misconception/mistake on my part, still sticking around
> from when I started working in the Hadoop source just over a year back.  I
> mistook performance increases in TeraGen but performance decreases in
> TeraSort (noted by an elongated shuffle phase) when I increased file:///'s
> blocksize to suggest that the shuffling used the file:/// filesystem as
> well.  I now understand why this can happen, and appreciate you clarifying
> as my digging through the shuffle code has done that indeed, no chunking
> occurs on shuffle.  My apologies for the confusing question, based on errant
> inferences.
>
> Thanks again to both of you!  However, if anyone has better intuition on
> what the file.blocksize parameter does, I'd be happy to hear it.
>
> Best,
>
> ellis



-- 
Harsh J

Re: fs.local.block.size vs file.blocksize

Posted by "Ellis H. Wilson III" <el...@cse.psu.edu>.

Many thanks to Eli and Harsh for their responses!  Comments in-line:

On 08/12/2012 09:48 AM, Harsh J wrote:
> Hi Ellis,
>
> Note that when in Hadoop-land, a "block size" term generally means the
> chunking size of HDFS writers and readers, and that is not the same as
> the FS term "block size" in any way.

Yes, I do know that, but I was confused about something else.  More on 
that later in #2.

> On Thu, Aug 9, 2012 at 6:40 PM, Ellis H. Wilson III<el...@cse.psu.edu>  wrote:
>> Can someone please briefly explain the difference?  I do not see deprecated
>> warnings for fs.local.block.size when I run with them set and I see two
>> copies of RawLocalFileSystem.java (the other is local/RawLocalFs.java).
>
> The right param still seems to be "fs.local.block.size", when it comes
> to using "getDefaultBlocksize" calls via the file:/// filesystems or
> other filesystems that have not over-riden the default behavior.

This question was more out of curiosity than anything.  My experiments 
agree that "fs.local.blocksize" is the right parameter for controlling 
the blocksize of file:///, but I'm still quite perplexed as to where 
file.blocksize actually is used.  I chased it around for a while in 
Eclipse last night, but have yet to see where it is directly resourced 
(keyconfigs sets it and suggests FileSystem, RawLocalFileSystem and 
CheckSumFileSystem all use it, but I don't see it being used in any 
practical way).

>> The things I really need to get answers to are:
>> 1. Is the default boosted to 64MB from Hadoop 1.0 to Hadoop 2.0?  I believe
>> it is, but want validation on that.
>
> The dfs.blocksize, which applies to HDFS, has not changed from its 64
> MB default.

I was referring to RawLocalFileSystem, not DistributedFileSystem.  I am 
fairly certain from my tests and from the code I've dug through that the 
default blocksize is still 32MB at the moment.  Please note that my 
questions here are fairly unconcerned with HDFS, as I'm not using it at 
all in >75% of my tests.

>> 2. Which one controls shuffle block-size?
>
> There is no "shuffle block-size", as shuffle goes to local filesystems
> and that has no block size concepts. Can you elaborate on this?

This was a plain ol' misconception/mistake on my part, still sticking 
around from when I started working in the Hadoop source just over a year 
back.  I mistook performance increases in TeraGen but performance 
decreases in TeraSort (noted by an elongated shuffle phase) when I 
increased file:///'s blocksize to suggest that the shuffling used the 
file:/// filesystem as well.  I now understand why this can happen, and 
appreciate you clarifying as my digging through the shuffle code has 
done that indeed, no chunking occurs on shuffle.  My apologies for the 
confusing question, based on errant inferences.

Thanks again to both of you!  However, if anyone has better intuition on 
what the file.blocksize parameter does, I'd be happy to hear it.

Best,

ellis

Re: fs.local.block.size vs file.blocksize

Posted by "Ellis H. Wilson III" <el...@cse.psu.edu>.

Many thanks to Eli and Harsh for their responses!  Comments in-line:

On 08/12/2012 09:48 AM, Harsh J wrote:
> Hi Ellis,
>
> Note that when in Hadoop-land, a "block size" term generally means the
> chunking size of HDFS writers and readers, and that is not the same as
> the FS term "block size" in any way.

Yes, I do know that, but I was confused about something else.  More on 
that later in #2.

> On Thu, Aug 9, 2012 at 6:40 PM, Ellis H. Wilson III<el...@cse.psu.edu>  wrote:
>> Can someone please briefly explain the difference?  I do not see deprecated
>> warnings for fs.local.block.size when I run with them set and I see two
>> copies of RawLocalFileSystem.java (the other is local/RawLocalFs.java).
>
> The right param still seems to be "fs.local.block.size", when it comes
> to using "getDefaultBlocksize" calls via the file:/// filesystems or
> other filesystems that have not over-riden the default behavior.

This question was more out of curiosity than anything.  My experiments 
agree that "fs.local.blocksize" is the right parameter for controlling 
the blocksize of file:///, but I'm still quite perplexed as to where 
file.blocksize actually is used.  I chased it around for a while in 
Eclipse last night, but have yet to see where it is directly resourced 
(keyconfigs sets it and suggests FileSystem, RawLocalFileSystem and 
CheckSumFileSystem all use it, but I don't see it being used in any 
practical way).

>> The things I really need to get answers to are:
>> 1. Is the default boosted to 64MB from Hadoop 1.0 to Hadoop 2.0?  I believe
>> it is, but want validation on that.
>
> The dfs.blocksize, which applies to HDFS, has not changed from its 64
> MB default.

I was referring to RawLocalFileSystem, not DistributedFileSystem.  I am 
fairly certain from my tests and from the code I've dug through that the 
default blocksize is still 32MB at the moment.  Please note that my 
questions here are fairly unconcerned with HDFS, as I'm not using it at 
all in >75% of my tests.

>> 2. Which one controls shuffle block-size?
>
> There is no "shuffle block-size", as shuffle goes to local filesystems
> and that has no block size concepts. Can you elaborate on this?

This was a plain ol' misconception/mistake on my part, still sticking 
around from when I started working in the Hadoop source just over a year 
back.  I mistook performance increases in TeraGen but performance 
decreases in TeraSort (noted by an elongated shuffle phase) when I 
increased file:///'s blocksize to suggest that the shuffling used the 
file:/// filesystem as well.  I now understand why this can happen, and 
appreciate you clarifying as my digging through the shuffle code has 
done that indeed, no chunking occurs on shuffle.  My apologies for the 
confusing question, based on errant inferences.

Thanks again to both of you!  However, if anyone has better intuition on 
what the file.blocksize parameter does, I'd be happy to hear it.

Best,

ellis

Re: fs.local.block.size vs file.blocksize

Posted by "Ellis H. Wilson III" <el...@cse.psu.edu>.

Many thanks to Eli and Harsh for their responses!  Comments in-line:

On 08/12/2012 09:48 AM, Harsh J wrote:
> Hi Ellis,
>
> Note that when in Hadoop-land, a "block size" term generally means the
> chunking size of HDFS writers and readers, and that is not the same as
> the FS term "block size" in any way.

Yes, I do know that, but I was confused about something else.  More on 
that later in #2.

> On Thu, Aug 9, 2012 at 6:40 PM, Ellis H. Wilson III<el...@cse.psu.edu>  wrote:
>> Can someone please briefly explain the difference?  I do not see deprecated
>> warnings for fs.local.block.size when I run with them set and I see two
>> copies of RawLocalFileSystem.java (the other is local/RawLocalFs.java).
>
> The right param still seems to be "fs.local.block.size", when it comes
> to using "getDefaultBlocksize" calls via the file:/// filesystems or
> other filesystems that have not over-riden the default behavior.

This question was more out of curiosity than anything.  My experiments 
agree that "fs.local.blocksize" is the right parameter for controlling 
the blocksize of file:///, but I'm still quite perplexed as to where 
file.blocksize actually is used.  I chased it around for a while in 
Eclipse last night, but have yet to see where it is directly resourced 
(keyconfigs sets it and suggests FileSystem, RawLocalFileSystem and 
CheckSumFileSystem all use it, but I don't see it being used in any 
practical way).

>> The things I really need to get answers to are:
>> 1. Is the default boosted to 64MB from Hadoop 1.0 to Hadoop 2.0?  I believe
>> it is, but want validation on that.
>
> The dfs.blocksize, which applies to HDFS, has not changed from its 64
> MB default.

I was referring to RawLocalFileSystem, not DistributedFileSystem.  I am 
fairly certain from my tests and from the code I've dug through that the 
default blocksize is still 32MB at the moment.  Please note that my 
questions here are fairly unconcerned with HDFS, as I'm not using it at 
all in >75% of my tests.

>> 2. Which one controls shuffle block-size?
>
> There is no "shuffle block-size", as shuffle goes to local filesystems
> and that has no block size concepts. Can you elaborate on this?

This was a plain ol' misconception/mistake on my part, still sticking 
around from when I started working in the Hadoop source just over a year 
back.  I mistook performance increases in TeraGen but performance 
decreases in TeraSort (noted by an elongated shuffle phase) when I 
increased file:///'s blocksize to suggest that the shuffling used the 
file:/// filesystem as well.  I now understand why this can happen, and 
appreciate you clarifying as my digging through the shuffle code has 
done that indeed, no chunking occurs on shuffle.  My apologies for the 
confusing question, based on errant inferences.

Thanks again to both of you!  However, if anyone has better intuition on 
what the file.blocksize parameter does, I'd be happy to hear it.

Best,

ellis

Re: fs.local.block.size vs file.blocksize

Posted by "Ellis H. Wilson III" <el...@cse.psu.edu>.

Many thanks to Eli and Harsh for their responses!  Comments in-line:

On 08/12/2012 09:48 AM, Harsh J wrote:
> Hi Ellis,
>
> Note that when in Hadoop-land, a "block size" term generally means the
> chunking size of HDFS writers and readers, and that is not the same as
> the FS term "block size" in any way.

Yes, I do know that, but I was confused about something else.  More on 
that later in #2.

> On Thu, Aug 9, 2012 at 6:40 PM, Ellis H. Wilson III<el...@cse.psu.edu>  wrote:
>> Can someone please briefly explain the difference?  I do not see deprecated
>> warnings for fs.local.block.size when I run with them set and I see two
>> copies of RawLocalFileSystem.java (the other is local/RawLocalFs.java).
>
> The right param still seems to be "fs.local.block.size", when it comes
> to using "getDefaultBlocksize" calls via the file:/// filesystems or
> other filesystems that have not over-riden the default behavior.

This question was more out of curiosity than anything.  My experiments 
agree that "fs.local.blocksize" is the right parameter for controlling 
the blocksize of file:///, but I'm still quite perplexed as to where 
file.blocksize actually is used.  I chased it around for a while in 
Eclipse last night, but have yet to see where it is directly resourced 
(keyconfigs sets it and suggests FileSystem, RawLocalFileSystem and 
CheckSumFileSystem all use it, but I don't see it being used in any 
practical way).

>> The things I really need to get answers to are:
>> 1. Is the default boosted to 64MB from Hadoop 1.0 to Hadoop 2.0?  I believe
>> it is, but want validation on that.
>
> The dfs.blocksize, which applies to HDFS, has not changed from its 64
> MB default.

I was referring to RawLocalFileSystem, not DistributedFileSystem.  I am 
fairly certain from my tests and from the code I've dug through that the 
default blocksize is still 32MB at the moment.  Please note that my 
questions here are fairly unconcerned with HDFS, as I'm not using it at 
all in >75% of my tests.

>> 2. Which one controls shuffle block-size?
>
> There is no "shuffle block-size", as shuffle goes to local filesystems
> and that has no block size concepts. Can you elaborate on this?

This was a plain ol' misconception/mistake on my part, still sticking 
around from when I started working in the Hadoop source just over a year 
back.  I mistook performance increases in TeraGen but performance 
decreases in TeraSort (noted by an elongated shuffle phase) when I 
increased file:///'s blocksize to suggest that the shuffling used the 
file:/// filesystem as well.  I now understand why this can happen, and 
appreciate you clarifying as my digging through the shuffle code has 
done that indeed, no chunking occurs on shuffle.  My apologies for the 
confusing question, based on errant inferences.

Thanks again to both of you!  However, if anyone has better intuition on 
what the file.blocksize parameter does, I'd be happy to hear it.

Best,

ellis

Re: fs.local.block.size vs file.blocksize

Posted by Harsh J <ha...@cloudera.com>.

Hi Ellis,

Note that when in Hadoop-land, a "block size" term generally means the
chunking size of HDFS writers and readers, and that is not the same as
the FS term "block size" in any way.

On Thu, Aug 9, 2012 at 6:40 PM, Ellis H. Wilson III <el...@cse.psu.edu> wrote:
> Hi all!
>
> Can someone please briefly explain the difference?  I do not see deprecated
> warnings for fs.local.block.size when I run with them set and I see two
> copies of RawLocalFileSystem.java (the other is local/RawLocalFs.java).

The right param still seems to be "fs.local.block.size", when it comes
to using "getDefaultBlocksize" calls via the file:/// filesystems or
other filesystems that have not over-riden the default behavior.

> The things I really need to get answers to are:
> 1. Is the default boosted to 64MB from Hadoop 1.0 to Hadoop 2.0?  I believe
> it is, but want validation on that.

The dfs.blocksize, which applies to HDFS, has not changed from its 64
MB default.

> 2. Which one controls shuffle block-size?

There is no "shuffle block-size", as shuffle goes to local filesystems
and that has no block size concepts. Can you elaborate on this?

> 3. If I have a single machine non-distributed instance, and point it at
> file://, do both of these control the persistent data's block size or just
> one of them or what?

LocalFileSystem does not chunk files into blocks. It writes/reads
regular files as you would in any language.

> 4. Is there any way to run with say a 512MB blocksize for the persistent
> data and the default 64MB blocksize for the shuffled data?

See (2).

> Thanks!

Do let us know if you have further questions.

> ellis

-- 
Harsh J

Re: fs.local.block.size vs file.blocksize

Posted by Harsh J <ha...@cloudera.com>.

Hi Ellis,

Note that when in Hadoop-land, a "block size" term generally means the
chunking size of HDFS writers and readers, and that is not the same as
the FS term "block size" in any way.

On Thu, Aug 9, 2012 at 6:40 PM, Ellis H. Wilson III <el...@cse.psu.edu> wrote:
> Hi all!
>
> Can someone please briefly explain the difference?  I do not see deprecated
> warnings for fs.local.block.size when I run with them set and I see two
> copies of RawLocalFileSystem.java (the other is local/RawLocalFs.java).

The right param still seems to be "fs.local.block.size", when it comes
to using "getDefaultBlocksize" calls via the file:/// filesystems or
other filesystems that have not over-riden the default behavior.

> The things I really need to get answers to are:
> 1. Is the default boosted to 64MB from Hadoop 1.0 to Hadoop 2.0?  I believe
> it is, but want validation on that.

The dfs.blocksize, which applies to HDFS, has not changed from its 64
MB default.

> 2. Which one controls shuffle block-size?

There is no "shuffle block-size", as shuffle goes to local filesystems
and that has no block size concepts. Can you elaborate on this?

> 3. If I have a single machine non-distributed instance, and point it at
> file://, do both of these control the persistent data's block size or just
> one of them or what?

LocalFileSystem does not chunk files into blocks. It writes/reads
regular files as you would in any language.

> 4. Is there any way to run with say a 512MB blocksize for the persistent
> data and the default 64MB blocksize for the shuffled data?

See (2).

> Thanks!

Do let us know if you have further questions.

> ellis

-- 
Harsh J

Re: fs.local.block.size vs file.blocksize

Posted by Bejoy Ks <be...@gmail.com>.

HI Rahul

Better to to start a new thread than hijacking others .:) It helps to keep
the mailing list archives clean.

Learning java, you need to get some JAVA books and start off.

If you just want to run wordcount example just follow the steps in below url
http://wiki.apache.org/hadoop/WordCount

To understand more details on the working, i have just scribbled something
long back, may be it can help you start off
http://kickstarthadoop.blogspot.in/2011/04/word-count-hadoop-map-reduce-example.html


Regards
Bejoy KS

Re: fs.local.block.size vs file.blocksize

Posted by Bejoy Ks <be...@gmail.com>.

HI Rahul

Better to to start a new thread than hijacking others .:) It helps to keep
the mailing list archives clean.

Learning java, you need to get some JAVA books and start off.

If you just want to run wordcount example just follow the steps in below url
http://wiki.apache.org/hadoop/WordCount

To understand more details on the working, i have just scribbled something
long back, may be it can help you start off
http://kickstarthadoop.blogspot.in/2011/04/word-count-hadoop-map-reduce-example.html


Regards
Bejoy KS

Re: fs.local.block.size vs file.blocksize

Posted by Bejoy Ks <be...@gmail.com>.

HI Rahul

Better to to start a new thread than hijacking others .:) It helps to keep
the mailing list archives clean.

Learning java, you need to get some JAVA books and start off.

If you just want to run wordcount example just follow the steps in below url
http://wiki.apache.org/hadoop/WordCount

To understand more details on the working, i have just scribbled something
long back, may be it can help you start off
http://kickstarthadoop.blogspot.in/2011/04/word-count-hadoop-map-reduce-example.html


Regards
Bejoy KS

Re: fs.local.block.size vs file.blocksize

Posted by Bejoy Ks <be...@gmail.com>.

HI Rahul

Better to to start a new thread than hijacking others .:) It helps to keep
the mailing list archives clean.

Learning java, you need to get some JAVA books and start off.

If you just want to run wordcount example just follow the steps in below url
http://wiki.apache.org/hadoop/WordCount

To understand more details on the working, i have just scribbled something
long back, may be it can help you start off
http://kickstarthadoop.blogspot.in/2011/04/word-count-hadoop-map-reduce-example.html


Regards
Bejoy KS

Re: fs.local.block.size vs file.blocksize

Posted by rahul p <ra...@gmail.com>.

Hi Tariq,
I am trying to start wordcount mapreduce, i am not getting how to start and
where to start ..
i very new to java.
can you help how to work with this..any help will appreciated.


Hi All,
Please help start with Hadoop on CDH , i have instaleed in my local PC.
any help will appreciated.

On Thu, Aug 9, 2012 at 9:10 PM, Ellis H. Wilson III <el...@cse.psu.edu>wrote:

> Hi all!
>
> Can someone please briefly explain the difference?  I do not see
> deprecated warnings for fs.local.block.size when I run with them set and I
> see two copies of RawLocalFileSystem.java (the other is
> local/RawLocalFs.java).
>
> The things I really need to get answers to are:
> 1. Is the default boosted to 64MB from Hadoop 1.0 to Hadoop 2.0?  I
> believe it is, but want validation on that.
> 2. Which one controls shuffle block-size?
> 3. If I have a single machine non-distributed instance, and point it at
> file://, do both of these control the persistent data's block size or just
> one of them or what?
> 4. Is there any way to run with say a 512MB blocksize for the persistent
> data and the default 64MB blocksize for the shuffled data?
>
> Thanks!
>
> ellis
>

Re: fs.local.block.size vs file.blocksize

Posted by rahul p <ra...@gmail.com>.

Hi Tariq,
I am trying to start wordcount mapreduce, i am not getting how to start and
where to start ..
i very new to java.
can you help how to work with this..any help will appreciated.


Hi All,
Please help start with Hadoop on CDH , i have instaleed in my local PC.
any help will appreciated.

On Thu, Aug 9, 2012 at 9:10 PM, Ellis H. Wilson III <el...@cse.psu.edu>wrote:

> Hi all!
>
> Can someone please briefly explain the difference?  I do not see
> deprecated warnings for fs.local.block.size when I run with them set and I
> see two copies of RawLocalFileSystem.java (the other is
> local/RawLocalFs.java).
>
> The things I really need to get answers to are:
> 1. Is the default boosted to 64MB from Hadoop 1.0 to Hadoop 2.0?  I
> believe it is, but want validation on that.
> 2. Which one controls shuffle block-size?
> 3. If I have a single machine non-distributed instance, and point it at
> file://, do both of these control the persistent data's block size or just
> one of them or what?
> 4. Is there any way to run with say a 512MB blocksize for the persistent
> data and the default 64MB blocksize for the shuffled data?
>
> Thanks!
>
> ellis
>

Re: fs.local.block.size vs file.blocksize

Posted by rahul p <ra...@gmail.com>.

Hi Tariq,
I am trying to start wordcount mapreduce, i am not getting how to start and
where to start ..
i very new to java.
can you help how to work with this..any help will appreciated.


Hi All,
Please help start with Hadoop on CDH , i have instaleed in my local PC.
any help will appreciated.

On Thu, Aug 9, 2012 at 9:10 PM, Ellis H. Wilson III <el...@cse.psu.edu>wrote:

> Hi all!
>
> Can someone please briefly explain the difference?  I do not see
> deprecated warnings for fs.local.block.size when I run with them set and I
> see two copies of RawLocalFileSystem.java (the other is
> local/RawLocalFs.java).
>
> The things I really need to get answers to are:
> 1. Is the default boosted to 64MB from Hadoop 1.0 to Hadoop 2.0?  I
> believe it is, but want validation on that.
> 2. Which one controls shuffle block-size?
> 3. If I have a single machine non-distributed instance, and point it at
> file://, do both of these control the persistent data's block size or just
> one of them or what?
> 4. Is there any way to run with say a 512MB blocksize for the persistent
> data and the default 64MB blocksize for the shuffled data?
>
> Thanks!
>
> ellis
>

Re: fs.local.block.size vs file.blocksize

Posted by Harsh J <ha...@cloudera.com>.

Hi Ellis,

Note that when in Hadoop-land, a "block size" term generally means the
chunking size of HDFS writers and readers, and that is not the same as
the FS term "block size" in any way.

On Thu, Aug 9, 2012 at 6:40 PM, Ellis H. Wilson III <el...@cse.psu.edu> wrote:
> Hi all!
>
> Can someone please briefly explain the difference?  I do not see deprecated
> warnings for fs.local.block.size when I run with them set and I see two
> copies of RawLocalFileSystem.java (the other is local/RawLocalFs.java).

The right param still seems to be "fs.local.block.size", when it comes
to using "getDefaultBlocksize" calls via the file:/// filesystems or
other filesystems that have not over-riden the default behavior.

> The things I really need to get answers to are:
> 1. Is the default boosted to 64MB from Hadoop 1.0 to Hadoop 2.0?  I believe
> it is, but want validation on that.

The dfs.blocksize, which applies to HDFS, has not changed from its 64
MB default.

> 2. Which one controls shuffle block-size?

There is no "shuffle block-size", as shuffle goes to local filesystems
and that has no block size concepts. Can you elaborate on this?

> 3. If I have a single machine non-distributed instance, and point it at
> file://, do both of these control the persistent data's block size or just
> one of them or what?

LocalFileSystem does not chunk files into blocks. It writes/reads
regular files as you would in any language.

> 4. Is there any way to run with say a 512MB blocksize for the persistent
> data and the default 64MB blocksize for the shuffled data?

See (2).

> Thanks!

Do let us know if you have further questions.

> ellis

-- 
Harsh J

Re: fs.local.block.size vs file.blocksize

Posted by rahul p <ra...@gmail.com>.

Hi Tariq,
I am trying to start wordcount mapreduce, i am not getting how to start and
where to start ..
i very new to java.
can you help how to work with this..any help will appreciated.


Hi All,
Please help start with Hadoop on CDH , i have instaleed in my local PC.
any help will appreciated.

On Thu, Aug 9, 2012 at 9:10 PM, Ellis H. Wilson III <el...@cse.psu.edu>wrote:

> Hi all!
>
> Can someone please briefly explain the difference?  I do not see
> deprecated warnings for fs.local.block.size when I run with them set and I
> see two copies of RawLocalFileSystem.java (the other is
> local/RawLocalFs.java).
>
> The things I really need to get answers to are:
> 1. Is the default boosted to 64MB from Hadoop 1.0 to Hadoop 2.0?  I
> believe it is, but want validation on that.
> 2. Which one controls shuffle block-size?
> 3. If I have a single machine non-distributed instance, and point it at
> file://, do both of these control the persistent data's block size or just
> one of them or what?
> 4. Is there any way to run with say a 512MB blocksize for the persistent
> data and the default 64MB blocksize for the shuffled data?
>
> Thanks!
>
> ellis
>

Re: fs.local.block.size vs file.blocksize

Posted by Harsh J <ha...@cloudera.com>.

Hi Ellis,

Note that when in Hadoop-land, a "block size" term generally means the
chunking size of HDFS writers and readers, and that is not the same as
the FS term "block size" in any way.

On Thu, Aug 9, 2012 at 6:40 PM, Ellis H. Wilson III <el...@cse.psu.edu> wrote:
> Hi all!
>
> Can someone please briefly explain the difference?  I do not see deprecated
> warnings for fs.local.block.size when I run with them set and I see two
> copies of RawLocalFileSystem.java (the other is local/RawLocalFs.java).

The right param still seems to be "fs.local.block.size", when it comes
to using "getDefaultBlocksize" calls via the file:/// filesystems or
other filesystems that have not over-riden the default behavior.

> The things I really need to get answers to are:
> 1. Is the default boosted to 64MB from Hadoop 1.0 to Hadoop 2.0?  I believe
> it is, but want validation on that.

The dfs.blocksize, which applies to HDFS, has not changed from its 64
MB default.

> 2. Which one controls shuffle block-size?

There is no "shuffle block-size", as shuffle goes to local filesystems
and that has no block size concepts. Can you elaborate on this?

> 3. If I have a single machine non-distributed instance, and point it at
> file://, do both of these control the persistent data's block size or just
> one of them or what?

LocalFileSystem does not chunk files into blocks. It writes/reads
regular files as you would in any language.

> 4. Is there any way to run with say a 512MB blocksize for the persistent
> data and the default 64MB blocksize for the shuffled data?

See (2).

> Thanks!

Do let us know if you have further questions.

> ellis

-- 
Harsh J