You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Yossi Ittach <yo...@gmail.com> on 2008/12/14 09:18:40 UTC

Building an Hadoop & Hbase Cluster

Hi

I want to build a little cluster of Hbase & Hadoop. Starngely , I couldn't
find any recommendations on the web.
If you could share your experience , it would be great , specially in
concern with these questions :

1. How many RegionServers I need for X DataNodes?
2. Can I (and should I?) run a RegionServer and a DataNode on the same
machine?
3. Can I run the NameNode and the HBaseMaster on the same machine?
4. Can I run a Secondary Name Node and a DataNode on the same machine?

Thanks!


Vale et me ama
Yossi

Re: Suggestions of proper usage of "key" parameter ?

Posted by Owen O'Malley <om...@apache.org>.

On Dec 15, 2008, at 9:06 AM, Ricky Ho wrote:

> Choice 1:  Emit one entry in the reduce(), using doc_name as key
> ==================================================================
> In this case, the output will still be a single entry per invocation  
> of reduce() ...
>
> {
>  "key":"file1",
>  "value":[["apple", 25],
>           ["orange", 16]]
> }

In general, you don't want to gather the inputs in memory, because  
that limits the scalability of your application to files that only  
have word counts that fit in memory. For word counts, it isn't a  
problem, but I've seen applications that take this approach end up  
with records where the large ones are 500mb for a single record.

> Choice 2:  Emit multiple entries in the reduce(), using [doc_name,  
> word] as key
> = 
> = 
> ======================================================================
> In this case, the output will have multiple entries per invocation  
> of reduce() ...
>
> {
>  "key":["file1", "apple"],
>  "value":25
> }
>
> {
>  "key":["file1", "orange"],
>  "value":16
> }

This looks pretty reasonable to me. Especially if the partitioning is  
on both the filename and word so that the loading between the reduces  
is relatively even.

> Choice 3:  Emit one entry in the reduce(), using null key
> ============================================================
> In this case, the output will be a single entry per invocation of  
> reduce() ...
>
> {
>  "key":null,
>  "value":[["file1", "apple", 25],
>           ["file1", "orange", 16]]
> }

I don't see any advantage to this one. It has all of the memory- 
limiting problems of option 1 and will of course do bad things if the  
down stream user isn't expecting null keys.

-- Owen

RE: Suggestions of proper usage of "key" parameter ?

Posted by Ricky Ho <rh...@adobe.com>.

Thanks for the detail explanation of Aaron and Owen's response.

One key takeaway point is that the "OutputFormat" of a previous Job need to be compatible to the "InputFormat" of its subsequent job.  Otherwise, the key/value demarcation will screw up.

It still not very clear to me when to emit multiple entries versus single entry in the reduce() function.  Here is a typical case where a reduce() function counts the number of times a particular word appears in a particular document.  There are multiple possible choices in how the reduce() function can be structured.  And which one to choose is not clear to me ...

Choice 1:  Emit one entry in the reduce(), using doc_name as key
==================================================================

# k2 is doc_name,  v2_list is a list of [word, count]

reduce(k2, v2_list) {
    count_holder = Hash.new

    for v in v2_list {
        word = v[0]
        count = v[1]
        count_holder[word] += count
    }

    arr = Array.new
    for e in count_holder.entries {
        word = e.key
        count = e.value
        arr.add([word, count])
    }

    doc_name = k2
    emit(doc_name, arr)
}

In this case, the output will still be a single entry per invocation of reduce() ...

{
  "key":"file1",
  "value":[["apple", 25],
           ["orange", 16]]
}

Choice 2:  Emit multiple entries in the reduce(), using [doc_name, word] as key
================================================================================

reduce(k2, v2_list) {
    count_holder = Hash.new

    for v in v2_list {
        word = v[0]
        count = v[1]
        count_holder[word] += count
    }

    doc_name = k2
    for e in count_holder.entries {
        word = e.key
        count = e.value
        emit([doc_name, word], count)
    }
}

In this case, the output will have multiple entries per invocation of reduce() ...

{
  "key":["file1", "apple"],
  "value":25
}

{
  "key":["file1", "orange"],
  "value":16
}

Choice 3:  Emit one entry in the reduce(), using null key
============================================================

reduce(k2, v2_list) {
    doc_name = k2
    count_holder = Hash.new

    for v in v2_list {
        word = v[0]
        count = v[1]
        count_holder[word] += count
    }

    arr = Array.new
    for e in count_holder.entries {
        word = e.key
        count = e.value
        arr.add([doc_name, word, count])
    }

    emit(null, arr)
}

In this case, the output will be a single entry per invocation of reduce() ...

{
  "key":null,
  "value":[["file1", "apple", 25],
           ["file1", "orange", 16]]
}

Can you compare the above options and shed some light ?

Rgds, Ricky

-----Original Message-----
From: Aaron Kimball [mailto:aaron@cloudera.com]
Sent: Monday, December 15, 2008 4:13 AM
To: core-user@hadoop.apache.org
Subject: Re: Suggestions of proper usage of "key" parameter ?

To expand a bit on Owen's remarks:

It should be pointed out that in the case of a single MapReduce pass from an
input dataset to an output dataset, the keys into the mapper and out of the
reducer may not be particularly interesting to you. However, more
complicated algorithms often involve multiple MapReduce jobs, where an input
dataset has an MR pass over it, yielding an intermediate dataset which
undergoes yet another MR pass, yielding a final dataset.

In such cases, this provides continuity of interface, where every stage has
both "key" and "value" components. Very often the dataset gets partitioned
or organized in such a way that it makes sense to stamp a key on to values
early on in the process, and continue to allow the keys to flow through the
system between passes. This interface makes that much more convenient. For
example, you may have an input record which is joined against multiple other
datasets. Each other data set join may involve a separate mapreduce pass,
but the primary key will be the same the whole time.

As for determining the input key types: The default TextInputFormat gives
you a line offset and a line of text. This offset may not be particularly
useful to your application.  On the other hand, the KeyValueTextInputFormat
will read each line of text from the input file, and split this into a key
Text object and a value Text object based on the first tab char in the line.
This matches the formatting of output files done by the default
TextOutputFormat.  Chained MRs should set the input format to this one, as
Hadoop won't "know" that this is your intended use case.

If your intermediate reducer outputs more complicated datatypes, you may
want to use SequenceFileOutputFormat, which marshals your data types into a
binary file format. The SequenceFileInputFormat will then read in the data,
and demarshal it into the same data types that you had already encoded.
(Your final pass may want to use TextOutputFormat to get everything back to
a human-readable form)

- Aaron

On Sun, Dec 14, 2008 at 11:39 PM, Owen O'Malley <om...@apache.org> wrote:

>
> On Dec 14, 2008, at 4:47 PM, Ricky Ho wrote:
>
>  Yes, I am referring to the "key" INPUT INTO the map() function and the
>> "key" EMITTED FROM the reduce() function.  Can someone explain why do we
>> need a "key" in these cases and what is the proper use of it ?
>>
>
> It was a design choice and could have been done as:
>
> R1 -> map -> K,V -> reduce -> R2
>
> instead of
>
> K1,V1 -> map -> K2,V2 -> reduce -> K3,V3
>
> but since the input of the reduce is sorted on K2, the output of the reduce
> is also typically sorted and therefore keyed. Since jobs are often chained
> together, it makes sense to make the reduce input match the map input. Of
> course everything you could do with the first option is possible with the
> second using either K1 = R1 or V1 = R1. Having the keys is often
> convenient...
>
>  Who determines what the "key" should be ?  (by the corresponding
>> "InputFormat" implementation class) ?
>>
>
> The InputFormat makes the choice.
>
>  In this case, what is the key in the map() call ?  (name of the input
>> file) ?
>>
>
> TextInputFormat uses the byte offset as the key and the line as the value.
>
>  What if the reduce() function emits multiple <key, value> entries or not
>> emitting any entry at all ?  Is this considered OK ?
>>
>
> Yes.
>
>  What if the reduce() function emits a <key, value> entry whose key is not
>> the same as the input key parameter to the reduce() function ?  Is this OK ?
>>
>
> Yes, although the reduce output is not re-sorted, so the results won't be
> sorted unless K3 is a subset of K2.
>
>  If there is a two Map/Reduce cycle chained together.  Is the "key" input
>> into the 2nd round map() function determined by the "key" emitted from the
>> 1st round reduce() function ?
>>
>
> Yes.
>
> -- Owen
>

Re: Suggestions of proper usage of "key" parameter ?

Posted by Aaron Kimball <aa...@cloudera.com>.

To expand a bit on Owen's remarks:

It should be pointed out that in the case of a single MapReduce pass from an
input dataset to an output dataset, the keys into the mapper and out of the
reducer may not be particularly interesting to you. However, more
complicated algorithms often involve multiple MapReduce jobs, where an input
dataset has an MR pass over it, yielding an intermediate dataset which
undergoes yet another MR pass, yielding a final dataset.

In such cases, this provides continuity of interface, where every stage has
both "key" and "value" components. Very often the dataset gets partitioned
or organized in such a way that it makes sense to stamp a key on to values
early on in the process, and continue to allow the keys to flow through the
system between passes. This interface makes that much more convenient. For
example, you may have an input record which is joined against multiple other
datasets. Each other data set join may involve a separate mapreduce pass,
but the primary key will be the same the whole time.

As for determining the input key types: The default TextInputFormat gives
you a line offset and a line of text. This offset may not be particularly
useful to your application.  On the other hand, the KeyValueTextInputFormat
will read each line of text from the input file, and split this into a key
Text object and a value Text object based on the first tab char in the line.
This matches the formatting of output files done by the default
TextOutputFormat.  Chained MRs should set the input format to this one, as
Hadoop won't "know" that this is your intended use case.

If your intermediate reducer outputs more complicated datatypes, you may
want to use SequenceFileOutputFormat, which marshals your data types into a
binary file format. The SequenceFileInputFormat will then read in the data,
and demarshal it into the same data types that you had already encoded.
(Your final pass may want to use TextOutputFormat to get everything back to
a human-readable form)

- Aaron

On Sun, Dec 14, 2008 at 11:39 PM, Owen O'Malley <om...@apache.org> wrote:

>
> On Dec 14, 2008, at 4:47 PM, Ricky Ho wrote:
>
>  Yes, I am referring to the "key" INPUT INTO the map() function and the
>> "key" EMITTED FROM the reduce() function.  Can someone explain why do we
>> need a "key" in these cases and what is the proper use of it ?
>>
>
> It was a design choice and could have been done as:
>
> R1 -> map -> K,V -> reduce -> R2
>
> instead of
>
> K1,V1 -> map -> K2,V2 -> reduce -> K3,V3
>
> but since the input of the reduce is sorted on K2, the output of the reduce
> is also typically sorted and therefore keyed. Since jobs are often chained
> together, it makes sense to make the reduce input match the map input. Of
> course everything you could do with the first option is possible with the
> second using either K1 = R1 or V1 = R1. Having the keys is often
> convenient...
>
>  Who determines what the "key" should be ?  (by the corresponding
>> "InputFormat" implementation class) ?
>>
>
> The InputFormat makes the choice.
>
>  In this case, what is the key in the map() call ?  (name of the input
>> file) ?
>>
>
> TextInputFormat uses the byte offset as the key and the line as the value.
>
>  What if the reduce() function emits multiple <key, value> entries or not
>> emitting any entry at all ?  Is this considered OK ?
>>
>
> Yes.
>
>  What if the reduce() function emits a <key, value> entry whose key is not
>> the same as the input key parameter to the reduce() function ?  Is this OK ?
>>
>
> Yes, although the reduce output is not re-sorted, so the results won't be
> sorted unless K3 is a subset of K2.
>
>  If there is a two Map/Reduce cycle chained together.  Is the "key" input
>> into the 2nd round map() function determined by the "key" emitted from the
>> 1st round reduce() function ?
>>
>
> Yes.
>
> -- Owen
>

Re: Suggestions of proper usage of "key" parameter ?

Posted by Owen O'Malley <om...@apache.org>.

On Dec 14, 2008, at 4:47 PM, Ricky Ho wrote:

> Yes, I am referring to the "key" INPUT INTO the map() function and  
> the "key" EMITTED FROM the reduce() function.  Can someone explain  
> why do we need a "key" in these cases and what is the proper use of  
> it ?

It was a design choice and could have been done as:

R1 -> map -> K,V -> reduce -> R2

instead of

K1,V1 -> map -> K2,V2 -> reduce -> K3,V3

but since the input of the reduce is sorted on K2, the output of the  
reduce is also typically sorted and therefore keyed. Since jobs are  
often chained together, it makes sense to make the reduce input match  
the map input. Of course everything you could do with the first option  
is possible with the second using either K1 = R1 or V1 = R1. Having  
the keys is often convenient...

> Who determines what the "key" should be ?  (by the corresponding  
> "InputFormat" implementation class) ?

The InputFormat makes the choice.

> In this case, what is the key in the map() call ?  (name of the  
> input file) ?

TextInputFormat uses the byte offset as the key and the line as the  
value.

> What if the reduce() function emits multiple <key, value> entries or  
> not emitting any entry at all ?  Is this considered OK ?

Yes.

> What if the reduce() function emits a <key, value> entry whose key  
> is not the same as the input key parameter to the reduce()  
> function ?  Is this OK ?

Yes, although the reduce output is not re-sorted, so the results won't  
be sorted unless K3 is a subset of K2.

> If there is a two Map/Reduce cycle chained together.  Is the "key"  
> input into the 2nd round map() function determined by the "key"  
> emitted from the 1st round reduce() function ?

Yes.

-- Owen

Suggestions of proper usage of "key" parameter ?

Posted by Ricky Ho <rh...@adobe.com>.

While the "key" EMITTED FROM the map() function and the "key" INPUT INTO the reduce() function has a clear meaning to me, the opposite is not so clear to me.

Yes, I am referring to the "key" INPUT INTO the map() function and the "key" EMITTED FROM the reduce() function.  Can someone explain why do we need a "key" in these cases and what is the proper use of it ?


Map Phase
==========
Why do we need a "key" parameter in the map() function ?  Why isn't it possible to store all information stored in just the "value" parameter ?

Who determines what the "key" should be ?  (by the corresponding "InputFormat" implementation class) ?

E.g.  If I am using the "TextInputFormat" and "FileInputFormat" :
        conf.setInputFormat(TextInputFormat.class);
        FileInputFormat.setInputPaths(conf, filename);

In this case, what is the key in the map() call ?  (name of the input file) ?


Reduce Phase
=============
Why do we need the reduce() function to emit a <key, value> rather than just the value ?

What if the reduce() function emits multiple <key, value> entries or not emitting any entry at all ?  Is this considered OK ?

What if the reduce() function emits a <key, value> entry whose key is not the same as the input key parameter to the reduce() function ?  Is this OK ?

If there is a two Map/Reduce cycle chained together.  Is the "key" input into the 2nd round map() function determined by the "key" emitted from the 1st round reduce() function ?


Rgds,
Ricky

Re: Building an Hadoop & Hbase Cluster

Posted by Yossi Ittach <yo...@gmail.com>.

A bit too late , but thank you all.  I will tell you about my experience
when it' over.


Vale et me ama
Yossi


On Sun, Dec 14, 2008 at 7:22 PM, Jean-Daniel Cryans <jd...@gmail.com>wrote:

> It would be interesting to see how HBase would react to a loss of the NN
> when trying to shutdown HBase in the following minutes, but I think it
> would
> be better to try than just losing the two at the same time.
>
> J-D
>
> On Sun, Dec 14, 2008 at 12:13 PM, Wesley Chow <we...@s7labs.com> wrote:
>
> >
> >  3. Can I run the NameNode and the HBaseMaster on the same machine?
> >>>
> >>
> >>
> >  3. No, that would make 2 SPOFs at the same place. (but we are currently
> >> integrating Zookeeper in HBase so the HMaster SPOF will be fixed
> hopefully
> >> in 0.20)
> >>
> >
> > Isn't it the case that if the NN goes down then HBase doesn't function
> > properly? If so, then from the perspective of HBase reliability, you'd
> want
> > to have the HBaseMaster on the same machine as NN, because that would
> make a
> > single machine a point of failure, whereas having them on separate
> machines
> > would mean you'd have 2 machines that are each single point of failures.
> >
> >
> > Wes
> >
> >
>

Re: Building an Hadoop & Hbase Cluster

Posted by Jean-Daniel Cryans <jd...@gmail.com>.

It would be interesting to see how HBase would react to a loss of the NN
when trying to shutdown HBase in the following minutes, but I think it would
be better to try than just losing the two at the same time.

J-D

On Sun, Dec 14, 2008 at 12:13 PM, Wesley Chow <we...@s7labs.com> wrote:

>
>  3. Can I run the NameNode and the HBaseMaster on the same machine?
>>>
>>
>>
>  3. No, that would make 2 SPOFs at the same place. (but we are currently
>> integrating Zookeeper in HBase so the HMaster SPOF will be fixed hopefully
>> in 0.20)
>>
>
> Isn't it the case that if the NN goes down then HBase doesn't function
> properly? If so, then from the perspective of HBase reliability, you'd want
> to have the HBaseMaster on the same machine as NN, because that would make a
> single machine a point of failure, whereas having them on separate machines
> would mean you'd have 2 machines that are each single point of failures.
>
>
> Wes
>
>

Re: Building an Hadoop & Hbase Cluster

Posted by Andrew Purtell <ap...@yahoo.com>.

Hi Wesley,

I follow your logic and would agree except that the NameNode
on a busy cluster can really become overloaded dealing with
replication, etc. so it is best to run as few services on the
NameNode as possible.

   - Andy

> From: Wesley Chow <we...@s7labs.com>
> Subject: Re: Building an Hadoop & Hbase Cluster
> To: hbase-user@hadoop.apache.org
> Date: Sunday, December 14, 2008, 9:13 AM
> >> 3. Can I run the NameNode and the HBaseMaster on
> the same machine?
> > 
> 
> > 3. No, that would make 2 SPOFs at the same place. (but
> we are currently
> > integrating Zookeeper in HBase so the HMaster SPOF
> will be fixed hopefully
> > in 0.20)
> 
> Isn't it the case that if the NN goes down then HBase
> doesn't function properly? If so, then from the
> perspective of HBase reliability, you'd want to have the
> HBaseMaster on the same machine as NN, because that would
> make a single machine a point of failure, whereas having
> them on separate machines would mean you'd have 2
> machines that are each single point of failures.
> 
> Wes

Re: Building an Hadoop & Hbase Cluster

Posted by Wesley Chow <we...@s7labs.com>.

>> 3. Can I run the NameNode and the HBaseMaster on the same machine?
>

> 3. No, that would make 2 SPOFs at the same place. (but we are  
> currently
> integrating Zookeeper in HBase so the HMaster SPOF will be fixed  
> hopefully
> in 0.20)

Isn't it the case that if the NN goes down then HBase doesn't function  
properly? If so, then from the perspective of HBase reliability, you'd  
want to have the HBaseMaster on the same machine as NN, because that  
would make a single machine a point of failure, whereas having them on  
separate machines would mean you'd have 2 machines that are each  
single point of failures.


Wes

Re: Building an Hadoop & Hbase Cluster

Posted by Jean-Daniel Cryans <jd...@gmail.com>.

Yossi,

I will answer according to what I think are the best practices but this will
require many machines and for a little cluster, as you will see, it may be
hard to apply.

1 and 2. You should have X RS for X DN on X machines. This is to benefit
from the locality of data (when it does happen).

3. No, that would make 2 SPOFs at the same place. (but we are currently
integrating Zookeeper in HBase so the HMaster SPOF will be fixed hopefully
in 0.20)

4. IIRC, the SNN has the same memory requirements that the NN and that can
get very big depending on your number of files. So I would say that you
shouldn't do that especially because a RS will also live on that node.

I also recommend looking at Jeff Dean's slides presented at the
Data-Intensive Computing Symposium since what you want to do is similar to
what Google already does: http://research.yahoo.com/files/6DeanGoogle.pdf

J-D

On Sun, Dec 14, 2008 at 3:18 AM, Yossi Ittach <yo...@gmail.com> wrote:

> Hi
>
> I want to build a little cluster of Hbase & Hadoop. Starngely , I couldn't
> find any recommendations on the web.
> If you could share your experience , it would be great , specially in
> concern with these questions :
>
> 1. How many RegionServers I need for X DataNodes?
> 2. Can I (and should I?) run a RegionServer and a DataNode on the same
> machine?
> 3. Can I run the NameNode and the HBaseMaster on the same machine?
> 4. Can I run a Secondary Name Node and a DataNode on the same machine?
>
> Thanks!
>
>
> Vale et me ama
> Yossi
>

Re: Building an Hadoop & Hbase Cluster

Posted by Jean-Daniel Cryans <jd...@gmail.com>.

Yossi,

I will answer according to what I think are the best practices but this will
require many machines and for a little cluster, as you will see, it may be
hard to apply.

1 and 2. You should have X RS for X DN on X machines. This is to benefit
from the locality of data (when it does happen).

3. No, that would make 2 SPOFs at the same place. (but we are currently
integrating Zookeeper in HBase so the HMaster SPOF will be fixed hopefully
in 0.20)

4. IIRC, the SNN has the same memory requirements that the NN and that can
get very big depending on your number of files. So I would say that you
shouldn't do that especially because a RS will also live on that node.

I also recommend looking at Jeff Dean's slides presented at the
Data-Intensive Computing Symposium since what you want to do is similar to
what Google already does: http://research.yahoo.com/files/6DeanGoogle.pdf

J-D

On Sun, Dec 14, 2008 at 3:18 AM, Yossi Ittach <yo...@gmail.com> wrote:

> Hi
>
> I want to build a little cluster of Hbase & Hadoop. Starngely , I couldn't
> find any recommendations on the web.
> If you could share your experience , it would be great , specially in
> concern with these questions :
>
> 1. How many RegionServers I need for X DataNodes?
> 2. Can I (and should I?) run a RegionServer and a DataNode on the same
> machine?
> 3. Can I run the NameNode and the HBaseMaster on the same machine?
> 4. Can I run a Secondary Name Node and a DataNode on the same machine?
>
> Thanks!
>
>
> Vale et me ama
> Yossi
>