You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Peter Veentjer <al...@gmail.com> on 2011/01/05 14:41:47 UTC

Using HBase in combination with HDFS directly

Hi Guys,

I'm currently writing a POC based on hbase and I spend more time on writing
a ui than on writing the hbase functionality. So I'm very excited about
exploring HBase further and doing some serious performance and scalability
tests and see if we can use it as core technology instead of the
time/resource intensive Gigaspaces.

My question:

I'm currently using HBase and I also want to use the HDFS directly to store
files. If the HBase server(s) is installed, can I directly access the HDFS
of these servers or is it better to set up a seperate Hadoop server for
running HDFS.

Re: Using HBase in combination with HDFS directly

Posted by Peter Veentjer <al...@gmail.com>.

I just replaced the native filesystem based solution by HDFS without
introducing any additional servers, And it works perfectly in combination
with encryption of files. For the POC this is sufficient.

I think I have spend more time on typing emails today then on switching to
HDFS.

Thanks!

On Wed, Jan 5, 2011 at 5:06 PM, Peter Veentjer <al...@gmail.com>wrote:

>
>
> On Wed, Jan 5, 2011 at 5:00 PM, Friso van Vollenhoven <
> fvanvollenhoven@xebia.com> wrote:
>
>> I guess so.
>>
>> HBase actually has quite a strong consistency model.
>
>
> It depends on how consistency is defined. HBase supports no repeatable
> reads because there is no concept of transaction, so every time you do a
> read you get a different result. For STM this would be called extremely low
> consistency. There are higher levels of consistency like 'snapshot'
> consistency where your reads are not only repeatable but also are causal
> consistent. And then of course there is the serialized isolation level where
> even writeskews are prevented.
>
>
>> Thing is, that it is just row level. Multi row transactions would require
>> multiple locks and some kind of commit / roll back solution. Have you had a
>> look at Google's percolator paper?
>>
>
> Not yet. I'll check it our.
>
>
>>
>>
>> Friso
>>
>>
>>
>> On 5 jan 2011, at 16:49, Peter Veentjer wrote:
>>
>> > I also want to see if an STM like Multiverse can be aligned with NoSQL
>> > solutions like HBase. But to do that, I first need to get more hands on
>> > experience with NoSQL solutions.
>> >
>> > On Wed, Jan 5, 2011 at 4:34 PM, Peter Veentjer <alarmnummer@gmail.com
>> >wrote:
>> >
>> >>
>> >>
>> >> On Wed, Jan 5, 2011 at 4:03 PM, Friso van Vollenhoven <
>> >> fvanvollenhoven@xebia.com> wrote:
>> >>
>> >>> Hi Peter,
>> >>>
>> >>> Do you mean you want to use the HDFS that HBase relies on for other
>> things
>> >>> and not just exclusively HBase? That should be just fine. We do it all
>> the
>> >>> time.
>> >>>
>> >>>
>> >> Ok thanks.
>> >>
>> >>
>> >>
>> >>> Are you worried about putting to much load on it?
>> >>
>> >>
>> >> For the POC it won't matter that much. I can get my stuff up and
>> running.
>> >>
>> >>
>> >>> I guess that depends on the type of work load that you have and what
>> you
>> >>> do with it. But generally I think it is nice to have all nodes be the
>> same
>> >>> (so all workers are datanode and region server), such that you don't
>> have to
>> >>> scale out them separately.
>> >>>
>> >>
>> >>>> Peter, are you based in The Netherlands by any chance? There is a
>> NoSQL
>> >> meetup group in NL (http://www.meetup.com/nosql-nl/) with >>meetups
>> every
>> >> now and then. Next one is at January 24 and is all about HBase. We're
>> doing
>> >> a on the spot install on a number of present >>laptops to create a
>> temporary
>> >> cluster and play around with it. I have been working with Hadoop and
>> HBase
>> >> for the past couple of months, so if >>you care to come by, I'd be
>> happy to
>> >> share some experiences.
>> >>
>> >> Yet I live in Holland. I'm a former Xebia employee :) I think I'll
>> visit
>> >> one of the nosql meetups.
>> >>
>> >> We are building a kind of application server where instead of providing
>> >> services like JMS, Servlet, EJB's etc we are providing services for
>> secured
>> >> document storage, message exchange, semantic analysis of documents etc.
>> It
>> >> is all based on GigaSpaces but I have the impression (after working
>> more
>> >> than a year with it) that is is very time consuming to get right. Apart
>> from
>> >> all the correctness issues (and there where/are many.. based on bad
>> usage of
>> >> GigaSpaces and architectural choices) there are also some
>> >> performance/scalability issues that need solving.
>> >>
>> >> So I decided to rewrite the main use cases using HBase. I had most of
>> the
>> >> functionality up and running in a few days and most of the 'bad
>> >> architectural choices' we are going to remove in the next 6 months are
>> not
>> >> there from the beginning (e.g. using streams instead of byte arrays for
>> >> document processing.. how stupid can you be). It also was a nice
>> exercise to
>> >> play with HBase and less consistent solutions.
>> >>
>> >> I normally work on realizing very high consistency for Multiverse:
>> >>
>> >> http://multiverse.codehaus.org
>> >>
>> >> So I want to have some hands on experience with using less consistent
>> >> solutions.
>> >>
>> >>
>> >>>
>> >>> Friso
>> >>>
>> >>>
>> >>>
>> >>> On 5 jan 2011, at 14:41, Peter Veentjer wrote:
>> >>>
>> >>>> Hi Guys,
>> >>>>
>> >>>> I'm currently writing a POC based on hbase and I spend more time on
>> >>> writing
>> >>>> a ui than on writing the hbase functionality. So I'm very excited
>> about
>> >>>> exploring HBase further and doing some serious performance and
>> >>> scalability
>> >>>> tests and see if we can use it as core technology instead of the
>> >>>> time/resource intensive Gigaspaces.
>> >>>>
>> >>>> My question:
>> >>>>
>> >>>> I'm currently using HBase and I also want to use the HDFS directly to
>> >>> store
>> >>>> files. If the HBase server(s) is installed, can I directly access the
>> >>> HDFS
>> >>>> of these servers or is it better to set up a seperate Hadoop server
>> for
>> >>>> running HDFS.
>> >>>
>> >>>
>> >>
>>
>>
>

Re: Using HBase in combination with HDFS directly

Posted by Peter Veentjer <al...@gmail.com>.

On Wed, Jan 5, 2011 at 5:00 PM, Friso van Vollenhoven <
fvanvollenhoven@xebia.com> wrote:

> I guess so.
>
> HBase actually has quite a strong consistency model.


It depends on how consistency is defined. HBase supports no repeatable reads
because there is no concept of transaction, so every time you do a read you
get a different result. For STM this would be called extremely low
consistency. There are higher levels of consistency like 'snapshot'
consistency where your reads are not only repeatable but also are causal
consistent. And then of course there is the serialized isolation level where
even writeskews are prevented.


> Thing is, that it is just row level. Multi row transactions would require
> multiple locks and some kind of commit / roll back solution. Have you had a
> look at Google's percolator paper?
>

Not yet. I'll check it our.


>
>
> Friso
>
>
>
> On 5 jan 2011, at 16:49, Peter Veentjer wrote:
>
> > I also want to see if an STM like Multiverse can be aligned with NoSQL
> > solutions like HBase. But to do that, I first need to get more hands on
> > experience with NoSQL solutions.
> >
> > On Wed, Jan 5, 2011 at 4:34 PM, Peter Veentjer <alarmnummer@gmail.com
> >wrote:
> >
> >>
> >>
> >> On Wed, Jan 5, 2011 at 4:03 PM, Friso van Vollenhoven <
> >> fvanvollenhoven@xebia.com> wrote:
> >>
> >>> Hi Peter,
> >>>
> >>> Do you mean you want to use the HDFS that HBase relies on for other
> things
> >>> and not just exclusively HBase? That should be just fine. We do it all
> the
> >>> time.
> >>>
> >>>
> >> Ok thanks.
> >>
> >>
> >>
> >>> Are you worried about putting to much load on it?
> >>
> >>
> >> For the POC it won't matter that much. I can get my stuff up and
> running.
> >>
> >>
> >>> I guess that depends on the type of work load that you have and what
> you
> >>> do with it. But generally I think it is nice to have all nodes be the
> same
> >>> (so all workers are datanode and region server), such that you don't
> have to
> >>> scale out them separately.
> >>>
> >>
> >>>> Peter, are you based in The Netherlands by any chance? There is a
> NoSQL
> >> meetup group in NL (http://www.meetup.com/nosql-nl/) with >>meetups
> every
> >> now and then. Next one is at January 24 and is all about HBase. We're
> doing
> >> a on the spot install on a number of present >>laptops to create a
> temporary
> >> cluster and play around with it. I have been working with Hadoop and
> HBase
> >> for the past couple of months, so if >>you care to come by, I'd be happy
> to
> >> share some experiences.
> >>
> >> Yet I live in Holland. I'm a former Xebia employee :) I think I'll visit
> >> one of the nosql meetups.
> >>
> >> We are building a kind of application server where instead of providing
> >> services like JMS, Servlet, EJB's etc we are providing services for
> secured
> >> document storage, message exchange, semantic analysis of documents etc.
> It
> >> is all based on GigaSpaces but I have the impression (after working more
> >> than a year with it) that is is very time consuming to get right. Apart
> from
> >> all the correctness issues (and there where/are many.. based on bad
> usage of
> >> GigaSpaces and architectural choices) there are also some
> >> performance/scalability issues that need solving.
> >>
> >> So I decided to rewrite the main use cases using HBase. I had most of
> the
> >> functionality up and running in a few days and most of the 'bad
> >> architectural choices' we are going to remove in the next 6 months are
> not
> >> there from the beginning (e.g. using streams instead of byte arrays for
> >> document processing.. how stupid can you be). It also was a nice
> exercise to
> >> play with HBase and less consistent solutions.
> >>
> >> I normally work on realizing very high consistency for Multiverse:
> >>
> >> http://multiverse.codehaus.org
> >>
> >> So I want to have some hands on experience with using less consistent
> >> solutions.
> >>
> >>
> >>>
> >>> Friso
> >>>
> >>>
> >>>
> >>> On 5 jan 2011, at 14:41, Peter Veentjer wrote:
> >>>
> >>>> Hi Guys,
> >>>>
> >>>> I'm currently writing a POC based on hbase and I spend more time on
> >>> writing
> >>>> a ui than on writing the hbase functionality. So I'm very excited
> about
> >>>> exploring HBase further and doing some serious performance and
> >>> scalability
> >>>> tests and see if we can use it as core technology instead of the
> >>>> time/resource intensive Gigaspaces.
> >>>>
> >>>> My question:
> >>>>
> >>>> I'm currently using HBase and I also want to use the HDFS directly to
> >>> store
> >>>> files. If the HBase server(s) is installed, can I directly access the
> >>> HDFS
> >>>> of these servers or is it better to set up a seperate Hadoop server
> for
> >>>> running HDFS.
> >>>
> >>>
> >>
>
>

Re: Using HBase in combination with HDFS directly

Posted by Friso van Vollenhoven <fv...@xebia.com>.

I guess so.

HBase actually has quite a strong consistency model. Thing is, that it is just row level. Multi row transactions would require multiple locks and some kind of commit / roll back solution. Have you had a look at Google's percolator paper?


Friso



On 5 jan 2011, at 16:49, Peter Veentjer wrote:

> I also want to see if an STM like Multiverse can be aligned with NoSQL
> solutions like HBase. But to do that, I first need to get more hands on
> experience with NoSQL solutions.
> 
> On Wed, Jan 5, 2011 at 4:34 PM, Peter Veentjer <al...@gmail.com>wrote:
> 
>> 
>> 
>> On Wed, Jan 5, 2011 at 4:03 PM, Friso van Vollenhoven <
>> fvanvollenhoven@xebia.com> wrote:
>> 
>>> Hi Peter,
>>> 
>>> Do you mean you want to use the HDFS that HBase relies on for other things
>>> and not just exclusively HBase? That should be just fine. We do it all the
>>> time.
>>> 
>>> 
>> Ok thanks.
>> 
>> 
>> 
>>> Are you worried about putting to much load on it?
>> 
>> 
>> For the POC it won't matter that much. I can get my stuff up and running.
>> 
>> 
>>> I guess that depends on the type of work load that you have and what you
>>> do with it. But generally I think it is nice to have all nodes be the same
>>> (so all workers are datanode and region server), such that you don't have to
>>> scale out them separately.
>>> 
>> 
>>>> Peter, are you based in The Netherlands by any chance? There is a NoSQL
>> meetup group in NL (http://www.meetup.com/nosql-nl/) with >>meetups every
>> now and then. Next one is at January 24 and is all about HBase. We're doing
>> a on the spot install on a number of present >>laptops to create a temporary
>> cluster and play around with it. I have been working with Hadoop and HBase
>> for the past couple of months, so if >>you care to come by, I'd be happy to
>> share some experiences.
>> 
>> Yet I live in Holland. I'm a former Xebia employee :) I think I'll visit
>> one of the nosql meetups.
>> 
>> We are building a kind of application server where instead of providing
>> services like JMS, Servlet, EJB's etc we are providing services for secured
>> document storage, message exchange, semantic analysis of documents etc. It
>> is all based on GigaSpaces but I have the impression (after working more
>> than a year with it) that is is very time consuming to get right. Apart from
>> all the correctness issues (and there where/are many.. based on bad usage of
>> GigaSpaces and architectural choices) there are also some
>> performance/scalability issues that need solving.
>> 
>> So I decided to rewrite the main use cases using HBase. I had most of the
>> functionality up and running in a few days and most of the 'bad
>> architectural choices' we are going to remove in the next 6 months are not
>> there from the beginning (e.g. using streams instead of byte arrays for
>> document processing.. how stupid can you be). It also was a nice exercise to
>> play with HBase and less consistent solutions.
>> 
>> I normally work on realizing very high consistency for Multiverse:
>> 
>> http://multiverse.codehaus.org
>> 
>> So I want to have some hands on experience with using less consistent
>> solutions.
>> 
>> 
>>> 
>>> Friso
>>> 
>>> 
>>> 
>>> On 5 jan 2011, at 14:41, Peter Veentjer wrote:
>>> 
>>>> Hi Guys,
>>>> 
>>>> I'm currently writing a POC based on hbase and I spend more time on
>>> writing
>>>> a ui than on writing the hbase functionality. So I'm very excited about
>>>> exploring HBase further and doing some serious performance and
>>> scalability
>>>> tests and see if we can use it as core technology instead of the
>>>> time/resource intensive Gigaspaces.
>>>> 
>>>> My question:
>>>> 
>>>> I'm currently using HBase and I also want to use the HDFS directly to
>>> store
>>>> files. If the HBase server(s) is installed, can I directly access the
>>> HDFS
>>>> of these servers or is it better to set up a seperate Hadoop server for
>>>> running HDFS.
>>> 
>>> 
>>

Re: Using HBase in combination with HDFS directly

Posted by Peter Veentjer <al...@gmail.com>.

I also want to see if an STM like Multiverse can be aligned with NoSQL
solutions like HBase. But to do that, I first need to get more hands on
experience with NoSQL solutions.

On Wed, Jan 5, 2011 at 4:34 PM, Peter Veentjer <al...@gmail.com>wrote:

>
>
> On Wed, Jan 5, 2011 at 4:03 PM, Friso van Vollenhoven <
> fvanvollenhoven@xebia.com> wrote:
>
>> Hi Peter,
>>
>> Do you mean you want to use the HDFS that HBase relies on for other things
>> and not just exclusively HBase? That should be just fine. We do it all the
>> time.
>>
>>
> Ok thanks.
>
>
>
>> Are you worried about putting to much load on it?
>
>
> For the POC it won't matter that much. I can get my stuff up and running.
>
>
>> I guess that depends on the type of work load that you have and what you
>> do with it. But generally I think it is nice to have all nodes be the same
>> (so all workers are datanode and region server), such that you don't have to
>> scale out them separately.
>>
>
> >>Peter, are you based in The Netherlands by any chance? There is a NoSQL
> meetup group in NL (http://www.meetup.com/nosql-nl/) with >>meetups every
> now and then. Next one is at January 24 and is all about HBase. We're doing
> a on the spot install on a number of present >>laptops to create a temporary
> cluster and play around with it. I have been working with Hadoop and HBase
> for the past couple of months, so if >>you care to come by, I'd be happy to
> share some experiences.
>
> Yet I live in Holland. I'm a former Xebia employee :) I think I'll visit
> one of the nosql meetups.
>
> We are building a kind of application server where instead of providing
> services like JMS, Servlet, EJB's etc we are providing services for secured
> document storage, message exchange, semantic analysis of documents etc. It
> is all based on GigaSpaces but I have the impression (after working more
> than a year with it) that is is very time consuming to get right. Apart from
> all the correctness issues (and there where/are many.. based on bad usage of
> GigaSpaces and architectural choices) there are also some
> performance/scalability issues that need solving.
>
> So I decided to rewrite the main use cases using HBase. I had most of the
> functionality up and running in a few days and most of the 'bad
> architectural choices' we are going to remove in the next 6 months are not
> there from the beginning (e.g. using streams instead of byte arrays for
> document processing.. how stupid can you be). It also was a nice exercise to
> play with HBase and less consistent solutions.
>
> I normally work on realizing very high consistency for Multiverse:
>
> http://multiverse.codehaus.org
>
> So I want to have some hands on experience with using less consistent
> solutions.
>
>
>>
>> Friso
>>
>>
>>
>> On 5 jan 2011, at 14:41, Peter Veentjer wrote:
>>
>> > Hi Guys,
>> >
>> > I'm currently writing a POC based on hbase and I spend more time on
>> writing
>> > a ui than on writing the hbase functionality. So I'm very excited about
>> > exploring HBase further and doing some serious performance and
>> scalability
>> > tests and see if we can use it as core technology instead of the
>> > time/resource intensive Gigaspaces.
>> >
>> > My question:
>> >
>> > I'm currently using HBase and I also want to use the HDFS directly to
>> store
>> > files. If the HBase server(s) is installed, can I directly access the
>> HDFS
>> > of these servers or is it better to set up a seperate Hadoop server for
>> > running HDFS.
>>
>>
>

Re: Using HBase in combination with HDFS directly

Posted by Peter Veentjer <al...@gmail.com>.

On Wed, Jan 5, 2011 at 4:03 PM, Friso van Vollenhoven <
fvanvollenhoven@xebia.com> wrote:

> Hi Peter,
>
> Do you mean you want to use the HDFS that HBase relies on for other things
> and not just exclusively HBase? That should be just fine. We do it all the
> time.
>
>
Ok thanks.

> Are you worried about putting to much load on it?

For the POC it won't matter that much. I can get my stuff up and running.

> I guess that depends on the type of work load that you have and what you do
> with it. But generally I think it is nice to have all nodes be the same (so
> all workers are datanode and region server), such that you don't have to
> scale out them separately.
>

>>Peter, are you based in The Netherlands by any chance? There is a NoSQL
meetup group in NL (http://www.meetup.com/nosql-nl/) with >>meetups every
now and then. Next one is at January 24 and is all about HBase. We're doing
a on the spot install on a number of present >>laptops to create a temporary
cluster and play around with it. I have been working with Hadoop and HBase
for the past couple of months, so if >>you care to come by, I'd be happy to
share some experiences.

Yet I live in Holland. I'm a former Xebia employee :) I think I'll visit one
of the nosql meetups.

We are building a kind of application server where instead of providing
services like JMS, Servlet, EJB's etc we are providing services for secured
document storage, message exchange, semantic analysis of documents etc. It
is all based on GigaSpaces but I have the impression (after working more
than a year with it) that is is very time consuming to get right. Apart from
all the correctness issues (and there where/are many.. based on bad usage of
GigaSpaces and architectural choices) there are also some
performance/scalability issues that need solving.

So I decided to rewrite the main use cases using HBase. I had most of the
functionality up and running in a few days and most of the 'bad
architectural choices' we are going to remove in the next 6 months are not
there from the beginning (e.g. using streams instead of byte arrays for
document processing.. how stupid can you be). It also was a nice exercise to
play with HBase and less consistent solutions.

I normally work on realizing very high consistency for Multiverse:

http://multiverse.codehaus.org

So I want to have some hands on experience with using less consistent
solutions.

>
> Friso
>
>
>
> On 5 jan 2011, at 14:41, Peter Veentjer wrote:
>
> > Hi Guys,
> >
> > I'm currently writing a POC based on hbase and I spend more time on
> writing
> > a ui than on writing the hbase functionality. So I'm very excited about
> > exploring HBase further and doing some serious performance and
> scalability
> > tests and see if we can use it as core technology instead of the
> > time/resource intensive Gigaspaces.
> >
> > My question:
> >
> > I'm currently using HBase and I also want to use the HDFS directly to
> store
> > files. If the HBase server(s) is installed, can I directly access the
> HDFS
> > of these servers or is it better to set up a seperate Hadoop server for
> > running HDFS.
>
>

Re: Using HBase in combination with HDFS directly

Posted by Eric <er...@gmail.com>.

You can install both the datanode and hbase region server on your cluster
machines. This is common practice. Obviously you should use decent hardware,
like a multi core processor, more than enough memory and multiple hard
disks.

2011/1/5 Friso van Vollenhoven <fv...@xebia.com>

> Hi Peter,
>
> Do you mean you want to use the HDFS that HBase relies on for other things
> and not just exclusively HBase? That should be just fine. We do it all the
> time.
>
> Are you worried about putting to much load on it? I guess that depends on
> the type of work load that you have and what you do with it. But generally I
> think it is nice to have all nodes be the same (so all workers are datanode
> and region server), such that you don't have to scale out them separately.
>
>
> Friso
>
>
>
> On 5 jan 2011, at 14:41, Peter Veentjer wrote:
>
> > Hi Guys,
> >
> > I'm currently writing a POC based on hbase and I spend more time on
> writing
> > a ui than on writing the hbase functionality. So I'm very excited about
> > exploring HBase further and doing some serious performance and
> scalability
> > tests and see if we can use it as core technology instead of the
> > time/resource intensive Gigaspaces.
> >
> > My question:
> >
> > I'm currently using HBase and I also want to use the HDFS directly to
> store
> > files. If the HBase server(s) is installed, can I directly access the
> HDFS
> > of these servers or is it better to set up a seperate Hadoop server for
> > running HDFS.
>
>

Re: Using HBase in combination with HDFS directly

Posted by Friso van Vollenhoven <fv...@xebia.com>.

Hi Peter,

Do you mean you want to use the HDFS that HBase relies on for other things and not just exclusively HBase? That should be just fine. We do it all the time.

Are you worried about putting to much load on it? I guess that depends on the type of work load that you have and what you do with it. But generally I think it is nice to have all nodes be the same (so all workers are datanode and region server), such that you don't have to scale out them separately.

Friso

On 5 jan 2011, at 14:41, Peter Veentjer wrote:

> Hi Guys,
> 
> I'm currently writing a POC based on hbase and I spend more time on writing
> a ui than on writing the hbase functionality. So I'm very excited about
> exploring HBase further and doing some serious performance and scalability
> tests and see if we can use it as core technology instead of the
> time/resource intensive Gigaspaces.
> 
> My question:
> 
> I'm currently using HBase and I also want to use the HDFS directly to store
> files. If the HBase server(s) is installed, can I directly access the HDFS
> of these servers or is it better to set up a seperate Hadoop server for
> running HDFS.