You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Wilm Schumacher <wi...@cawoom.com> on 2014/07/31 17:17:52 UTC

hbase and hadoop (for "normal" hdfs) cluster together?

Hi,

I have a "conceptional" question and would appreciate hints.

My task is to save files to hdfs and to maintain some informations about
them in a hbase db and then serve both to the application.

Per file I have around 50 rows with 10 columns (in 2 column families) in
the tables, which have string values of length around 100.

The files have normal size (perhaps between some kB to 100 MB or so).

By this estimation the number of files are way smaller than the the
number of rows (times columns), but the space on disk is way larger for
the files than the space for the hbase. I would further estimate, that
for every get on a file there should be around hundreds of getRows on
the hbase.

For the files I want to run an hadoop cluster (obviously). The question
now arises: should I run the hbase on the same hadoop cluster?

The pro of running together is obvious: i would only have to run one
hadoop cluster which would which would save time, money and nerves.

On the other hand it wouldn't be possible to make special adjustments
for optimizing the cluster for one or the other task. E.g. if I want to
make the hbase more "distributed" by optimizing the replication (to
let's say 6) I would have to use a doubled amount of disk for the
"normal" files, too.

So: what should I do?

Do you have any comments or hints on this question

Best wishes,

wilm

Re: hbase and hadoop (for "normal" hdfs) cluster together?

Posted by Wilm Schumacher <wi...@cawoom.com>.
Hi,

Am 31.07.2014 um 20:28 schrieb Nick Dimiduk:
> What else will this cluster do? Are you planning to run MR against the data
> here?
The cluster does nothing else than this application. The application is
the "hdfs part", and the "hbase part".

And yes, I plan to run some MR jobs against the data. Sort of health
checks of the data. But this jobs are of low priority and can run when
the other parts are not needed.

> If this cluster is dedicated to your application and you have enough
> IO capacity to support all application needs on the cluster, I see no
> reason to run two clusters.
okay.

> The reason we recommend against running mixed-workload clusters is those
> additional tasks compete for the resources hbase needs to meet its online
> SLAs.
okay.

So, if I understand correctly, you recommend to put it together. I will
do checks to test the io capacity. If hdfs gets do not kill the hbase
performance, I go with a combinded cluster.

Thanks a lot,

Wilm

Re: hbase and hadoop (for "normal" hdfs) cluster together?

Posted by Nick Dimiduk <nd...@gmail.com>.
Hi Wilm,

What else will this cluster do? Are you planning to run MR against the data
here? If this cluster is dedicated to your application and you have enough
IO capacity to support all application needs on the cluster, I see no
reason to run two clusters.

The reason we recommend against running mixed-workload clusters is those
additional tasks compete for the resources hbase needs to meet its online
SLAs.

-n


On Thu, Jul 31, 2014 at 9:25 AM, Ted Yu <yu...@gmail.com> wrote:

> HBASE-11339 'HBase MOB' may be of interest to you - it is still in
> development.
>
> Cheers
>
>
> On Thu, Jul 31, 2014 at 9:21 AM, Wilm Schumacher <
> wilm.schumacher@cawoom.com
> > wrote:
>
> >
> >
> > Am 31.07.2014 um 18:08 schrieb Ted Yu:
> > > What's the read / write mix in your workload ?
> > I would think around
> >
> > 1 put to 2-5 reads for the "hdfs files" (estimated)
> >
> > and
> >
> > 1 put to hundreds of reads in the hbase table
> >
> > So in short form:
> >
> > = for the files
> > * number of puts ~ gets
> > * "small" number of puts and gets
> > * "small" number of files
> > * "large" amount of disk space
> >
> > = for the rows
> > * number of puts << gets
> > * "large" number of gets (and puts)
> > * "large" number of rows (and rows*columns)
> > * "small" amount of disk space
> >
> > Small and large are in '"' because large and small are only understood
> > in comparison to the other task.
> >
> > So, basically, this are two completely different tasks. This is the
> > origin of my question.
> >
> > > Have you looked at HBASE-10070 'HBase read high-availability using
> > > timeline-consistent region replicas' (phase 1 has been merged for the
> > > upcoming 1.0 release) ?
> > No. But I look now ;).
> >
> > Best wishes and thanks for the reply
> >
> > Wilm
> >
>

Re: hbase and hadoop (for "normal" hdfs) cluster together?

Posted by Ted Yu <yu...@gmail.com>.
HBASE-11339 'HBase MOB' may be of interest to you - it is still in
development.

Cheers


On Thu, Jul 31, 2014 at 9:21 AM, Wilm Schumacher <wilm.schumacher@cawoom.com
> wrote:

>
>
> Am 31.07.2014 um 18:08 schrieb Ted Yu:
> > What's the read / write mix in your workload ?
> I would think around
>
> 1 put to 2-5 reads for the "hdfs files" (estimated)
>
> and
>
> 1 put to hundreds of reads in the hbase table
>
> So in short form:
>
> = for the files
> * number of puts ~ gets
> * "small" number of puts and gets
> * "small" number of files
> * "large" amount of disk space
>
> = for the rows
> * number of puts << gets
> * "large" number of gets (and puts)
> * "large" number of rows (and rows*columns)
> * "small" amount of disk space
>
> Small and large are in '"' because large and small are only understood
> in comparison to the other task.
>
> So, basically, this are two completely different tasks. This is the
> origin of my question.
>
> > Have you looked at HBASE-10070 'HBase read high-availability using
> > timeline-consistent region replicas' (phase 1 has been merged for the
> > upcoming 1.0 release) ?
> No. But I look now ;).
>
> Best wishes and thanks for the reply
>
> Wilm
>

Re: hbase and hadoop (for "normal" hdfs) cluster together?

Posted by Wilm Schumacher <wi...@cawoom.com>.

Am 31.07.2014 um 18:08 schrieb Ted Yu:
> What's the read / write mix in your workload ?
I would think around

1 put to 2-5 reads for the "hdfs files" (estimated)

and

1 put to hundreds of reads in the hbase table

So in short form:

= for the files
* number of puts ~ gets
* "small" number of puts and gets
* "small" number of files
* "large" amount of disk space

= for the rows
* number of puts << gets
* "large" number of gets (and puts)
* "large" number of rows (and rows*columns)
* "small" amount of disk space

Small and large are in '"' because large and small are only understood
in comparison to the other task.

So, basically, this are two completely different tasks. This is the
origin of my question.

> Have you looked at HBASE-10070 'HBase read high-availability using
> timeline-consistent region replicas' (phase 1 has been merged for the
> upcoming 1.0 release) ?
No. But I look now ;).

Best wishes and thanks for the reply

Wilm

Re: hbase and hadoop (for "normal" hdfs) cluster together?

Posted by Ted Yu <yu...@gmail.com>.
What's the read / write mix in your workload ?

Have you looked at HBASE-10070 'HBase read high-availability using
timeline-consistent region replicas' (phase 1 has been merged for the
upcoming 1.0 release) ?

Cheers


On Thu, Jul 31, 2014 at 8:17 AM, Wilm Schumacher <wilm.schumacher@cawoom.com
> wrote:

> Hi,
>
> I have a "conceptional" question and would appreciate hints.
>
> My task is to save files to hdfs and to maintain some informations about
> them in a hbase db and then serve both to the application.
>
> Per file I have around 50 rows with 10 columns (in 2 column families) in
> the tables, which have string values of length around 100.
>
> The files have normal size (perhaps between some kB to 100 MB or so).
>
> By this estimation the number of files are way smaller than the the
> number of rows (times columns), but the space on disk is way larger for
> the files than the space for the hbase. I would further estimate, that
> for every get on a file there should be around hundreds of getRows on
> the hbase.
>
> For the files I want to run an hadoop cluster (obviously). The question
> now arises: should I run the hbase on the same hadoop cluster?
>
> The pro of running together is obvious: i would only have to run one
> hadoop cluster which would which would save time, money and nerves.
>
> On the other hand it wouldn't be possible to make special adjustments
> for optimizing the cluster for one or the other task. E.g. if I want to
> make the hbase more "distributed" by optimizing the replication (to
> let's say 6) I would have to use a doubled amount of disk for the
> "normal" files, too.
>
> So: what should I do?
>
> Do you have any comments or hints on this question
>
> Best wishes,
>
> wilm
>