You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Marc Sturlese <ma...@gmail.com> on 2012/08/24 21:26:48 UTC

RS, TT, shared DN and good performance on random Hbase random reads.

Hey there, 
I am wondering if this is a good practice:
I have a 10 nodes cluster, running datanodes and tasktrackers, and
continuously running MR jobs.
My replication factor is 3.
I need to put the results of a couple of jobs into Hbase tables to be able
to do random seek search. The Hbase tables would be almost just for reading,
just with a few additions. They would almost act as a view and would be
build every 5 hours. I want to minimize the impact of the MR jobs that are
running on the cluster to the random hbase reads. My idea is:
-Keep 10 nodes with datanodes and tasktrackers
-Add 2 nodes (the data to save into hbase is smaller compared to all the
data of the cluster) with datanode, and RS
-run bulk import creating HFiles (for a pre-splited table) and then manually
run compaction (would be deactivated by default)

The reasons for that would be:
-After running full compaction, HFiles end up in the RS nodes, so would
achieve data locality.
-As I have replication factor 3 and just 2 Hbase nodes, I know that no map
task would try to read in the RS nodes. The reduce tasks will write first in
the node where they exist (which will never be a RS node).
-So, in the RS I would end up having the Hbase tables and block replicas of
the MR jobs that will never be read (as Maps do data locality and at least a
replica of each block will be in a MR node)

In case this would work, if I add more nodes with RS and datanode, could I
guarantee that no map task would ever read in them? (assuming that a reduce
task always writes first in the node where it exists, correct me if I'm
wrong please as I'm not sure about this).

Probably I've done some wrong assumptions here. Would this be a good way to
achieve my goal? In case not, and advices (not counting splitting in 2
different clusters)
-- 
View this message in context: http://old.nabble.com/RS%2C-TT%2C-shared-DN-and-good-performance-on-random-Hbase-random-reads.-tp34345744p34345744.html
Sent from the HBase User mailing list archive at Nabble.com.


Re: RS, TT, shared DN and good performance on random Hbase random reads.

Posted by Harsh J <ha...@cloudera.com>.
Yes. What I meant was a low number of slots on these TTs alone (those
that are co-located with RS, if you want to do that) by having a
limited maximum of map and reduce slots configured on it specially. Or
if you use MR2 over YARN, you must limit the NodeManager's maximum
memory usage.

On Sat, Aug 25, 2012 at 10:49 PM, Adrien Mogenet
<ad...@gmail.com> wrote:
> How would you define a "low slotted" ?
> A poor scheduling capacity to avoid high number of mappers ?
>
> On Sat, Aug 25, 2012 at 3:32 PM, Harsh J <ha...@cloudera.com> wrote:
>> Hi Marc,
>>
>> On Sat, Aug 25, 2012 at 12:56 AM, Marc Sturlese <ma...@gmail.com> wrote:
>>> The reasons for that would be:
>>> -After running full compaction, HFiles end up in the RS nodes, so would
>>> achieve data locality.
>>> -As I have replication factor 3 and just 2 Hbase nodes, I know that no map
>>> task would try to read in the RS nodes. The reduce tasks will write first in
>>> the node where they exist (which will never be a RS node).
>>> -So, in the RS I would end up having the Hbase tables and block replicas of
>>> the MR jobs that will never be read (as Maps do data locality and at least a
>>> replica of each block will be in a MR node)
>>
>> Just to keep in mind: All HBase read/write requests are made via the
>> RS. The RS's held blocks of HDFS data isn't directly accessed by any
>> client (RS is THE data server for HBase client).
>>
>>> In case this would work, if I add more nodes with RS and datanode, could I
>>> guarantee that no map task would ever read in them? (assuming that a reduce
>>> task always writes first in the node where it exists, correct me if I'm
>>> wrong please as I'm not sure about this).
>>
>> Yes, you can guarantee this to a certain extent. In case data-locality
>> is absent in some tasks (due to scheduling constraints), a few blocks
>> may be read out by the RS-node's DNs, but shouldn't be a big impact
>> given that a good scheduler in MR usually helps avoid having to do
>> that.
>>
>> Alternatively you can also consider running low-slotted TTs to use up
>> the RS machines but in a safer way.
>>
>> --
>> Harsh J
>
>
>
> --
> Adrien Mogenet
> 06.59.16.64.22
> http://www.mogenet.me



-- 
Harsh J

Re: RS, TT, shared DN and good performance on random Hbase random reads.

Posted by Adrien Mogenet <ad...@gmail.com>.
How would you define a "low slotted" ?
A poor scheduling capacity to avoid high number of mappers ?

On Sat, Aug 25, 2012 at 3:32 PM, Harsh J <ha...@cloudera.com> wrote:
> Hi Marc,
>
> On Sat, Aug 25, 2012 at 12:56 AM, Marc Sturlese <ma...@gmail.com> wrote:
>> The reasons for that would be:
>> -After running full compaction, HFiles end up in the RS nodes, so would
>> achieve data locality.
>> -As I have replication factor 3 and just 2 Hbase nodes, I know that no map
>> task would try to read in the RS nodes. The reduce tasks will write first in
>> the node where they exist (which will never be a RS node).
>> -So, in the RS I would end up having the Hbase tables and block replicas of
>> the MR jobs that will never be read (as Maps do data locality and at least a
>> replica of each block will be in a MR node)
>
> Just to keep in mind: All HBase read/write requests are made via the
> RS. The RS's held blocks of HDFS data isn't directly accessed by any
> client (RS is THE data server for HBase client).
>
>> In case this would work, if I add more nodes with RS and datanode, could I
>> guarantee that no map task would ever read in them? (assuming that a reduce
>> task always writes first in the node where it exists, correct me if I'm
>> wrong please as I'm not sure about this).
>
> Yes, you can guarantee this to a certain extent. In case data-locality
> is absent in some tasks (due to scheduling constraints), a few blocks
> may be read out by the RS-node's DNs, but shouldn't be a big impact
> given that a good scheduler in MR usually helps avoid having to do
> that.
>
> Alternatively you can also consider running low-slotted TTs to use up
> the RS machines but in a safer way.
>
> --
> Harsh J



-- 
Adrien Mogenet
06.59.16.64.22
http://www.mogenet.me

Re: RS, TT, shared DN and good performance on random Hbase random reads.

Posted by Harsh J <ha...@cloudera.com>.
Hi Marc,

On Sat, Aug 25, 2012 at 12:56 AM, Marc Sturlese <ma...@gmail.com> wrote:
> The reasons for that would be:
> -After running full compaction, HFiles end up in the RS nodes, so would
> achieve data locality.
> -As I have replication factor 3 and just 2 Hbase nodes, I know that no map
> task would try to read in the RS nodes. The reduce tasks will write first in
> the node where they exist (which will never be a RS node).
> -So, in the RS I would end up having the Hbase tables and block replicas of
> the MR jobs that will never be read (as Maps do data locality and at least a
> replica of each block will be in a MR node)

Just to keep in mind: All HBase read/write requests are made via the
RS. The RS's held blocks of HDFS data isn't directly accessed by any
client (RS is THE data server for HBase client).

> In case this would work, if I add more nodes with RS and datanode, could I
> guarantee that no map task would ever read in them? (assuming that a reduce
> task always writes first in the node where it exists, correct me if I'm
> wrong please as I'm not sure about this).

Yes, you can guarantee this to a certain extent. In case data-locality
is absent in some tasks (due to scheduling constraints), a few blocks
may be read out by the RS-node's DNs, but shouldn't be a big impact
given that a good scheduler in MR usually helps avoid having to do
that.

Alternatively you can also consider running low-slotted TTs to use up
the RS machines but in a safer way.

-- 
Harsh J