You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Bryan Keller <br...@gmail.com> on 2012/02/17 22:21:44 UTC

Re: Is hadoop 1.0.0 + HBase 0.90.5 the best combination for production cluster?

I have been experimenting with local reads. For me, enabling did not help improve read performance at all, I get the same performance either way. I can see in the data node logs it is passing back the local path, so it is enabled properly.

Perhaps the benefits of local reads are dependent on the type of data and the workload? In my test I'm scanning through the entire table via a map reduce job. It's a wide table with maybe 20k columns per row on average. I have scanner caching set to 10.

My read performance is about 10% of the disk max read throughput, i.e. my disks can get 100 mb/sec tested with hdparm and scan performance is about 10 mb/sec. Not too bad I suppose. 

On Jan 8, 2012, at 6:35 PM, Zizon Qiu wrote:

> It should be the same as hbase daemon user.
> 
> the check perform by datanode are implement as follow, inside a RPC call.
> the "current user" refer to the remote user,in this case, should the same
> as your hbase user
> 
>  private void checkBlockLocalPathAccess() throws IOException {
>    checkKerberosAuthMethod("getBlockLocalPathInfo()");
>    *String currentUser =
> UserGroupInformation.getCurrentUser().getShortUserName();*
>    if (!*currentUser*.equals(this.userWithLocalPathAccess)) {
>      throw new AccessControlException(
>          "Can't continue with getBlockLocalPathInfo() "
>              + "authorization. The user " + currentUser
>              + " is not allowed to call getBlockLocalPathInfo");
>    }
>  }
>

Re: Is hadoop 1.0.0 + HBase 0.90.5 the best combination for production cluster?

Posted by Jean-Daniel Cryans <jd...@apache.org>.

The gist of the answer is that, unlike random reads, the blocks we
read sequentially from the fs are wholly consumed so you end up doing
less fs calls thus the total proportion of the time spent talking to
datanodes is lessened (which is what local reads help).

Also the dfs client keeps a block reader opened so that every time you
read from the same hdfs block it doesn't have to setup the socket to
the datanode again (which is what random reading does if you don't
setup local reads).

J-D

On Fri, Feb 17, 2012 at 1:48 PM, Bryan Keller <br...@gmail.com> wrote:
> I was thinking (wrongly it seems) that having the region server read directly from the local file system would be faster than going through the data node, even with sequential access.
>
> On Feb 17, 2012, at 1:28 PM, Jean-Daniel Cryans wrote:
>
>> On Fri, Feb 17, 2012 at 1:21 PM, Bryan Keller <br...@gmail.com> wrote:
>>> I have been experimenting with local reads. For me, enabling did not help improve read performance at all, I get the same performance either way. I can see in the data node logs it is passing back the local path, so it is enabled properly.
>>
>> I was surprised when I read this until I saw this:
>>
>>>
>>> Perhaps the benefits of local reads are dependent on the type of data and the workload? In my test I'm scanning through the entire table via a map reduce job. It's a wide table with maybe 20k columns per row on average. I have scanner caching set to 10.
>>
>> It's definitely not going to help make sequential reads faster.
>>
>>>
>>> My read performance is about 10% of the disk max read throughput, i.e. my disks can get 100 mb/sec tested with hdparm and scan performance is about 10 mb/sec. Not too bad I suppose.
>>
>> Maybe you're not pushing it enough?
>>
>> J-D
>

Re: Is hadoop 1.0.0 + HBase 0.90.5 the best combination for production cluster?

Posted by Bryan Keller <br...@gmail.com>.

I was thinking (wrongly it seems) that having the region server read directly from the local file system would be faster than going through the data node, even with sequential access.

On Feb 17, 2012, at 1:28 PM, Jean-Daniel Cryans wrote:

> On Fri, Feb 17, 2012 at 1:21 PM, Bryan Keller <br...@gmail.com> wrote:
>> I have been experimenting with local reads. For me, enabling did not help improve read performance at all, I get the same performance either way. I can see in the data node logs it is passing back the local path, so it is enabled properly.
> 
> I was surprised when I read this until I saw this:
> 
>> 
>> Perhaps the benefits of local reads are dependent on the type of data and the workload? In my test I'm scanning through the entire table via a map reduce job. It's a wide table with maybe 20k columns per row on average. I have scanner caching set to 10.
> 
> It's definitely not going to help make sequential reads faster.
> 
>> 
>> My read performance is about 10% of the disk max read throughput, i.e. my disks can get 100 mb/sec tested with hdparm and scan performance is about 10 mb/sec. Not too bad I suppose.
> 
> Maybe you're not pushing it enough?
> 
> J-D

Re: Is hadoop 1.0.0 + HBase 0.90.5 the best combination for production cluster?

Posted by Jean-Daniel Cryans <jd...@apache.org>.

On Fri, Feb 17, 2012 at 1:21 PM, Bryan Keller <br...@gmail.com> wrote:
> I have been experimenting with local reads. For me, enabling did not help improve read performance at all, I get the same performance either way. I can see in the data node logs it is passing back the local path, so it is enabled properly.

I was surprised when I read this until I saw this:

>
> Perhaps the benefits of local reads are dependent on the type of data and the workload? In my test I'm scanning through the entire table via a map reduce job. It's a wide table with maybe 20k columns per row on average. I have scanner caching set to 10.

It's definitely not going to help make sequential reads faster.

>
> My read performance is about 10% of the disk max read throughput, i.e. my disks can get 100 mb/sec tested with hdparm and scan performance is about 10 mb/sec. Not too bad I suppose.

Maybe you're not pushing it enough?

J-D