You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Nitin Gupta <ni...@gmail.com> on 2009/06/07 15:44:17 UTC

Help needed - Adding HBase to architecture

Hi All,
I am working on an application which is kind of a social network on mobile
WAP. Recently, we have incorporated the files or attachments support in our
application. Right now, since we are not in production yet, we are keeping
all the files in the RDBMS which our application is using. But I am more
than convinvced that this is not going to work once we are in production
mode.

I got to know about HBase and I am making myself convice about its usage for
the file storage, search and retrieval operations. I would like my opinion
to be endorsed by expert HBase users/developers. Just for the clarification,
here is what I am planning to do:

Make use of a RDBMS for relational data in the application.
All the files/blob data to be saved in the HBase.
When required, my application can query app data from the RDBMS and the
files can be retrieved from the HBase data store
I will keep the meta data of the files in my rdbms so that files can be
associated with my apps entities

Please help me decide if this is the right approach. My app is supposed to
provide support for images as well. So if anyone can advice if HBase is the
right solution for me, in conjuction with an imaging tool.

Since my team is predominantly Windows based, I would like to know is it
possible to run HBase on a windows machine in stand alone and in clustered
mode.

Thanks for all your help.

nitin

Re: Help needed - Adding HBase to architecture

Posted by zsongbo <zs...@gmail.com>.
Thanks Andy. I believe that the 0.20+ will be great improved.



On Mon, Jun 15, 2009 at 12:40 AM, Andrew Purtell <ap...@apache.org>wrote:

> Hi Schubert,
>
> I have 2TB and 1TB storage densities, respectively, on my test
> environments,
> so I very much understand your point of view.
>
> I think 0.20 will be able to come a lot closer to the goal of utilizing all
> of that space than 0.19 can. If you can wait for the release of 0.20, it
> may
> be worth experimenting to try and achieve 2000 regions with 1GB store
> files.
> I personally am planning to run such an experiment.
>
>  - Andy
>
>
>
>
> ________________________________
> From: zsongbo <zs...@gmail.com>
> To: hbase-user@hadoop.apache.org
> Sent: Sunday, June 14, 2009 9:34:44 AM
> Subject: Re: Help needed - Adding HBase to architecture
>
> Thank Andy and stack for your experiences sharing.In my data management
> system with HBase, I want to store HUGE size of data.
> But, assumes each node servers 1000 regions (250MB), only 250GB storage are
> used.  We have 2TB disk on each node.
> So, now, we store the data in files in HDFS and create a simple index to
> query and locate these files.
>
> Schubert
>
> On Sun, Jun 14, 2009 at 3:18 AM, Andrew Purtell <ap...@apache.org>
> wrote:
>
> > 0.19 will have trouble compacting regions with large store files (> 1GB),
> > especially if they are compressed.
> >
> > 0.20 is such a game changer that all the old experience and assumptions
> > will have to be thrown out and all of this testing redone. That is a very
> > good thing! :-) Kudos to all those who rebuilt the region server for this
> > release.
> >
> >  - Andy
> >
> >
> >
> >
> > ________________________________
> > From: stack <st...@duboce.net>
> > To: hbase-user@hadoop.apache.org
> > Sent: Saturday, June 13, 2009 12:13:58 PM
> > Subject: Re: Help needed - Adding HBase to architecture
> >
> > At powerset, we have ~80 regions per node on > 100 nodes.
> >
> > I've seen other clusters with hundreds and in testing have come close to
> a
> > thousand per node.
> >
> > When a node has this many regions on board and it crashes, its going to
> > take
> > a while to recover.
> >
> > We've not played with it in a while but regions could be fatter.  By
> > default, biggest store file in a region is < 256M.  Dependent on the type
> > of
> > your data and your access patterns, we should probably look to doubling
> or
> > quadrupling this size.  Then could carry low hundreds of regions but
> they'd
> > have more heft to them.
> >
> > St.Ack
> >
> > On Sat, Jun 13, 2009 at 11:59 AM, zsongbo <zs...@gmail.com> wrote:
> >
> > > Hi Billy,
> > >
> > > I agree "Hbase would be better suited to store the meta data in place
> of
> > > the
> > > images." very much.And store files in HDFS or other storage system such
> > as
> > > S3. But for small files, S3-like object storage system will be better.
> > >
> > > Another issue to discuss with you:
> > > How many tablets/regions served in each of you HBase region server in
> you
> > > practices? The Bigtable paper suggests at most handreds.
> > >
> > > Schubert
> > >
> > > On Mon, Jun 8, 2009 at 2:28 PM, Billy Pearson <
> > sales@pearsonwholesale.com
> > > >wrote:
> > >
> > > > If I was going to use a RDBMS to store the meta data then I would
> just
> > > use
> > > > hadoop hdfs to store the images/video
> > > > I know that hadoop has a thrift api now
> > > > http://wiki.apache.org/hadoop/HDFS-APIs
> > > >
> > > > Hbase would be better suited to store the meta data in place of the
> > > images.
> > > > The biggest benefit to hbase is you can scale the reads and writes to
> > the
> > > > db not just the reads in most RDBMS
> > > >
> > > > So you should be able to work with the files in hadoop in any
> language
> > as
> > > > long as you can get hadoop working correctly on windows.
> > > > The benefit of this is you can scale hadoop as needed to hold more
> > data.
> > > > The downside to this is the memory that will be required for the
> > namenode
> > > > I thank its like 3m files per gb of memory or something like that
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > "Nitin Gupta" <ni...@gmail.com> wrote in message
> > > > news:003c01c9e7fc$2087df00$61979d00$@com...
> > > >
> > > >  Jonathan,
> > > >>
> > > >> Thanks for detailed explanation. Much helpful.
> > > >>
> > > >> As far as file size is concerned, we may be even required to save
> > Videos
> > > >> in
> > > >> future. So we shall def go above the HBase size limit at some point
> in
> > > >> time.
> > > >> Any other solution or key-value database that you can recommend for
> > our
> > > >> case?
> > > >>
> > > >> I am not much knowledgeable about the HDFS either. I think if we go
> > with
> > > >> pure HDFS, then all the required DB operations would have to be
> custom
> > > >> developed on top of HDFS. For our needs, do you think that HDFS
> > already
> > > >> has
> > > >> enough support that we will not need any major custom development.
> We
> > > are
> > > >> just saving the files/attachements and retrieving them with some
> basic
> > > >> search.
> > > >>
> > > >> Regards,
> > > >> Nitin
> > > >>
> > > >> -----Original Message-----
> > > >> From: Jonathan Gray [mailto:jlist@streamy.com]
> > > >> Sent: Sunday, June 07, 2009 9:30 PM
> > > >> To: hbase-user@hadoop.apache.org
> > > >> Subject: Re: Help needed - Adding HBase to architecture
> > > >>
> > > >> Nitin,
> > > >>
> > > >> HBase stores arbitrary binary values (row keys, column qualifiers,
> and
> > > >> column values), so it is certainly capable of storing and serving
> > files
> > > >> and images.
> > > >>
> > > >> My only real question before I would give you a +1 on your idea is
> > what
> > > >> you expect the range of file sizes to be.  While HBase allows you to
> > > store
> > > >> values up to length Integer.MAX_VALUE, that is not recommended and
> in
> > > past
> > > >> versions has lead to memory issues (OOME and such).
> > > >>
> > > >> Images, text, word/excel docs, etc... should be no problem.  But I
> > don't
> > > >> recommend storing things in the upper 10s or 100s of MB, though it's
> > > >> probably possible with a little work adjusting some configuration
> > > >> parameters.  In general, if you are approaching HDFS block size,
> then
> > > you
> > > >> really just want HDFS and not HBase :)
> > > >>
> > > >> We are not currently running this in production, but we have had an
> > > >> experimental version of our media server that runs on top of HBase
> > > rather
> > > >> than the file system.  It has a series of Python scripts (connected
> to
> > > >> HBase through our custom interface, you could use Java directly or
> > > >> Thrift/REST/etc) that are responsible for generating various
> thumbnail
> > > >> sizes.  The originals are stored in HBase, and then a special query
> is
> > > run
> > > >> to grab the thumbnail of a certain size.  If it exists in HBase
> > already,
> > > >> it is just fetched and returned.  Otherwise, it is generated (via
> PIL,
> > > >> Python Imaging Library, and some other custom tools), stored in
> HBase,
> > > and
> > > >> then returned to the client.
> > > >>
> > > >> As far as HBase on Windows goes... It's currently not possible but
> > there
> > > >> has been some effort from Powerset/Microsoft to make it happen.  I
> > will
> > > >> yield to those more familiar with it.
> > > >>
> > > >> Personally, I run Windows on my primary work desktop and spend a
> good
> > > >> chunk of my time on HBase development.  When I've wanted to spin up
> > > >> pseudo-distributed local clusters, I usually use a cheap Linux node
> or
> > > >> local Virtual Machine.  In both cases, I use a Windows X Server and
> > > >> redirect output to my local Windows machine so I can run Eclipse and
> > > unit
> > > >> tests from my Windows GUI.  Others have used Cygwin with some
> success,
> > I
> > > >> believe.
> > > >>
> > > >> Hope that sheds some light for you.
> > > >>
> > > >> You are almost certainly right about not wanting to store this in an
> > > >> RDBMS.  And a hybrid approach seems to make sense, especially as a
> > first
> > > >> step.
> > > >>
> > > >> Jonathan Gray
> > > >>
> > > >>
> > > >>
> > > >> On Sun, June 7, 2009 6:44 am, Nitin Gupta wrote:
> > > >>
> > > >>> Hi All,
> > > >>> I am working on an application which is kind of a social network on
> > > >>> mobile
> > > >>>  WAP. Recently, we have incorporated the files or attachments
> support
> > > in
> > > >>> our application. Right now, since we are not in production yet, we
> > are
> > > >>> keeping all the files in the RDBMS which our application is using.
> > But
> > > I
> > > >>> am more than convinvced that this is not going to work once we are
> in
> > > >>> production mode.
> > > >>>
> > > >>> I got to know about HBase and I am making myself convice about its
> > > usage
> > > >>> for the file storage, search and retrieval operations. I would like
> > my
> > > >>> opinion to be endorsed by expert HBase users/developers. Just for
> the
> > > >>> clarification, here is what I am planning to do:
> > > >>>
> > > >>> Make use of a RDBMS for relational data in the application.
> > > >>> All the files/blob data to be saved in the HBase.
> > > >>> When required, my application can query app data from the RDBMS and
> > the
> > > >>> files can be retrieved from the HBase data store I will keep the
> meta
> > > >>> data
> > > >>> of the files in my rdbms so that files can be associated with my
> apps
> > > >>> entities
> > > >>>
> > > >>> Please help me decide if this is the right approach. My app is
> > supposed
> > > >>> to provide support for images as well. So if anyone can advice if
> > HBase
> > > >>> is
> > > >>> the right solution for me, in conjuction with an imaging tool.
> > > >>>
> > > >>> Since my team is predominantly Windows based, I would like to know
> is
> > > it
> > > >>> possible to run HBase on a windows machine in stand alone and in
> > > >>> clustered
> > > >>>  mode.
> > > >>>
> > > >>> Thanks for all your help.
> > > >>>
> > > >>>
> > > >>> nitin
> > > >>>
> > > >>>
> > > >>
> > > >>
> > > >
> > > >
> > >
> >
> >
> >
> >
> >
>
>
>
>
>

Re: Help needed - Adding HBase to architecture

Posted by Andrew Purtell <ap...@apache.org>.
Hi Schubert,

I have 2TB and 1TB storage densities, respectively, on my test environments,
so I very much understand your point of view.

I think 0.20 will be able to come a lot closer to the goal of utilizing all
of that space than 0.19 can. If you can wait for the release of 0.20, it may
be worth experimenting to try and achieve 2000 regions with 1GB store files.
I personally am planning to run such an experiment. 

  - Andy




________________________________
From: zsongbo <zs...@gmail.com>
To: hbase-user@hadoop.apache.org
Sent: Sunday, June 14, 2009 9:34:44 AM
Subject: Re: Help needed - Adding HBase to architecture

Thank Andy and stack for your experiences sharing.In my data management
system with HBase, I want to store HUGE size of data.
But, assumes each node servers 1000 regions (250MB), only 250GB storage are
used.  We have 2TB disk on each node.
So, now, we store the data in files in HDFS and create a simple index to
query and locate these files.

Schubert

On Sun, Jun 14, 2009 at 3:18 AM, Andrew Purtell <ap...@apache.org> wrote:

> 0.19 will have trouble compacting regions with large store files (> 1GB),
> especially if they are compressed.
>
> 0.20 is such a game changer that all the old experience and assumptions
> will have to be thrown out and all of this testing redone. That is a very
> good thing! :-) Kudos to all those who rebuilt the region server for this
> release.
>
>  - Andy
>
>
>
>
> ________________________________
> From: stack <st...@duboce.net>
> To: hbase-user@hadoop.apache.org
> Sent: Saturday, June 13, 2009 12:13:58 PM
> Subject: Re: Help needed - Adding HBase to architecture
>
> At powerset, we have ~80 regions per node on > 100 nodes.
>
> I've seen other clusters with hundreds and in testing have come close to a
> thousand per node.
>
> When a node has this many regions on board and it crashes, its going to
> take
> a while to recover.
>
> We've not played with it in a while but regions could be fatter.  By
> default, biggest store file in a region is < 256M.  Dependent on the type
> of
> your data and your access patterns, we should probably look to doubling or
> quadrupling this size.  Then could carry low hundreds of regions but they'd
> have more heft to them.
>
> St.Ack
>
> On Sat, Jun 13, 2009 at 11:59 AM, zsongbo <zs...@gmail.com> wrote:
>
> > Hi Billy,
> >
> > I agree "Hbase would be better suited to store the meta data in place of
> > the
> > images." very much.And store files in HDFS or other storage system such
> as
> > S3. But for small files, S3-like object storage system will be better.
> >
> > Another issue to discuss with you:
> > How many tablets/regions served in each of you HBase region server in you
> > practices? The Bigtable paper suggests at most handreds.
> >
> > Schubert
> >
> > On Mon, Jun 8, 2009 at 2:28 PM, Billy Pearson <
> sales@pearsonwholesale.com
> > >wrote:
> >
> > > If I was going to use a RDBMS to store the meta data then I would just
> > use
> > > hadoop hdfs to store the images/video
> > > I know that hadoop has a thrift api now
> > > http://wiki.apache.org/hadoop/HDFS-APIs
> > >
> > > Hbase would be better suited to store the meta data in place of the
> > images.
> > > The biggest benefit to hbase is you can scale the reads and writes to
> the
> > > db not just the reads in most RDBMS
> > >
> > > So you should be able to work with the files in hadoop in any language
> as
> > > long as you can get hadoop working correctly on windows.
> > > The benefit of this is you can scale hadoop as needed to hold more
> data.
> > > The downside to this is the memory that will be required for the
> namenode
> > > I thank its like 3m files per gb of memory or something like that
> > >
> > >
> > >
> > >
> > >
> > > "Nitin Gupta" <ni...@gmail.com> wrote in message
> > > news:003c01c9e7fc$2087df00$61979d00$@com...
> > >
> > >  Jonathan,
> > >>
> > >> Thanks for detailed explanation. Much helpful.
> > >>
> > >> As far as file size is concerned, we may be even required to save
> Videos
> > >> in
> > >> future. So we shall def go above the HBase size limit at some point in
> > >> time.
> > >> Any other solution or key-value database that you can recommend for
> our
> > >> case?
> > >>
> > >> I am not much knowledgeable about the HDFS either. I think if we go
> with
> > >> pure HDFS, then all the required DB operations would have to be custom
> > >> developed on top of HDFS. For our needs, do you think that HDFS
> already
> > >> has
> > >> enough support that we will not need any major custom development. We
> > are
> > >> just saving the files/attachements and retrieving them with some basic
> > >> search.
> > >>
> > >> Regards,
> > >> Nitin
> > >>
> > >> -----Original Message-----
> > >> From: Jonathan Gray [mailto:jlist@streamy.com]
> > >> Sent: Sunday, June 07, 2009 9:30 PM
> > >> To: hbase-user@hadoop.apache.org
> > >> Subject: Re: Help needed - Adding HBase to architecture
> > >>
> > >> Nitin,
> > >>
> > >> HBase stores arbitrary binary values (row keys, column qualifiers, and
> > >> column values), so it is certainly capable of storing and serving
> files
> > >> and images.
> > >>
> > >> My only real question before I would give you a +1 on your idea is
> what
> > >> you expect the range of file sizes to be.  While HBase allows you to
> > store
> > >> values up to length Integer.MAX_VALUE, that is not recommended and in
> > past
> > >> versions has lead to memory issues (OOME and such).
> > >>
> > >> Images, text, word/excel docs, etc... should be no problem.  But I
> don't
> > >> recommend storing things in the upper 10s or 100s of MB, though it's
> > >> probably possible with a little work adjusting some configuration
> > >> parameters.  In general, if you are approaching HDFS block size, then
> > you
> > >> really just want HDFS and not HBase :)
> > >>
> > >> We are not currently running this in production, but we have had an
> > >> experimental version of our media server that runs on top of HBase
> > rather
> > >> than the file system.  It has a series of Python scripts (connected to
> > >> HBase through our custom interface, you could use Java directly or
> > >> Thrift/REST/etc) that are responsible for generating various thumbnail
> > >> sizes.  The originals are stored in HBase, and then a special query is
> > run
> > >> to grab the thumbnail of a certain size.  If it exists in HBase
> already,
> > >> it is just fetched and returned.  Otherwise, it is generated (via PIL,
> > >> Python Imaging Library, and some other custom tools), stored in HBase,
> > and
> > >> then returned to the client.
> > >>
> > >> As far as HBase on Windows goes... It's currently not possible but
> there
> > >> has been some effort from Powerset/Microsoft to make it happen.  I
> will
> > >> yield to those more familiar with it.
> > >>
> > >> Personally, I run Windows on my primary work desktop and spend a good
> > >> chunk of my time on HBase development.  When I've wanted to spin up
> > >> pseudo-distributed local clusters, I usually use a cheap Linux node or
> > >> local Virtual Machine.  In both cases, I use a Windows X Server and
> > >> redirect output to my local Windows machine so I can run Eclipse and
> > unit
> > >> tests from my Windows GUI.  Others have used Cygwin with some success,
> I
> > >> believe.
> > >>
> > >> Hope that sheds some light for you.
> > >>
> > >> You are almost certainly right about not wanting to store this in an
> > >> RDBMS.  And a hybrid approach seems to make sense, especially as a
> first
> > >> step.
> > >>
> > >> Jonathan Gray
> > >>
> > >>
> > >>
> > >> On Sun, June 7, 2009 6:44 am, Nitin Gupta wrote:
> > >>
> > >>> Hi All,
> > >>> I am working on an application which is kind of a social network on
> > >>> mobile
> > >>>  WAP. Recently, we have incorporated the files or attachments support
> > in
> > >>> our application. Right now, since we are not in production yet, we
> are
> > >>> keeping all the files in the RDBMS which our application is using.
> But
> > I
> > >>> am more than convinvced that this is not going to work once we are in
> > >>> production mode.
> > >>>
> > >>> I got to know about HBase and I am making myself convice about its
> > usage
> > >>> for the file storage, search and retrieval operations. I would like
> my
> > >>> opinion to be endorsed by expert HBase users/developers. Just for the
> > >>> clarification, here is what I am planning to do:
> > >>>
> > >>> Make use of a RDBMS for relational data in the application.
> > >>> All the files/blob data to be saved in the HBase.
> > >>> When required, my application can query app data from the RDBMS and
> the
> > >>> files can be retrieved from the HBase data store I will keep the meta
> > >>> data
> > >>> of the files in my rdbms so that files can be associated with my apps
> > >>> entities
> > >>>
> > >>> Please help me decide if this is the right approach. My app is
> supposed
> > >>> to provide support for images as well. So if anyone can advice if
> HBase
> > >>> is
> > >>> the right solution for me, in conjuction with an imaging tool.
> > >>>
> > >>> Since my team is predominantly Windows based, I would like to know is
> > it
> > >>> possible to run HBase on a windows machine in stand alone and in
> > >>> clustered
> > >>>  mode.
> > >>>
> > >>> Thanks for all your help.
> > >>>
> > >>>
> > >>> nitin
> > >>>
> > >>>
> > >>
> > >>
> > >
> > >
> >
>
>
>
>
>



      

Re: Help needed - Adding HBase to architecture

Posted by zsongbo <zs...@gmail.com>.
Thank Andy and stack for your experiences sharing.In my data management
system with HBase, I want to store HUGE size of data.
But, assumes each node servers 1000 regions (250MB), only 250GB storage are
used.  We have 2TB disk on each node.
So, now, we store the data in files in HDFS and create a simple index to
query and locate these files.

Schubert

On Sun, Jun 14, 2009 at 3:18 AM, Andrew Purtell <ap...@apache.org> wrote:

> 0.19 will have trouble compacting regions with large store files (> 1GB),
> especially if they are compressed.
>
> 0.20 is such a game changer that all the old experience and assumptions
> will have to be thrown out and all of this testing redone. That is a very
> good thing! :-) Kudos to all those who rebuilt the region server for this
> release.
>
>  - Andy
>
>
>
>
> ________________________________
> From: stack <st...@duboce.net>
> To: hbase-user@hadoop.apache.org
> Sent: Saturday, June 13, 2009 12:13:58 PM
> Subject: Re: Help needed - Adding HBase to architecture
>
> At powerset, we have ~80 regions per node on > 100 nodes.
>
> I've seen other clusters with hundreds and in testing have come close to a
> thousand per node.
>
> When a node has this many regions on board and it crashes, its going to
> take
> a while to recover.
>
> We've not played with it in a while but regions could be fatter.  By
> default, biggest store file in a region is < 256M.  Dependent on the type
> of
> your data and your access patterns, we should probably look to doubling or
> quadrupling this size.  Then could carry low hundreds of regions but they'd
> have more heft to them.
>
> St.Ack
>
> On Sat, Jun 13, 2009 at 11:59 AM, zsongbo <zs...@gmail.com> wrote:
>
> > Hi Billy,
> >
> > I agree "Hbase would be better suited to store the meta data in place of
> > the
> > images." very much.And store files in HDFS or other storage system such
> as
> > S3. But for small files, S3-like object storage system will be better.
> >
> > Another issue to discuss with you:
> > How many tablets/regions served in each of you HBase region server in you
> > practices? The Bigtable paper suggests at most handreds.
> >
> > Schubert
> >
> > On Mon, Jun 8, 2009 at 2:28 PM, Billy Pearson <
> sales@pearsonwholesale.com
> > >wrote:
> >
> > > If I was going to use a RDBMS to store the meta data then I would just
> > use
> > > hadoop hdfs to store the images/video
> > > I know that hadoop has a thrift api now
> > > http://wiki.apache.org/hadoop/HDFS-APIs
> > >
> > > Hbase would be better suited to store the meta data in place of the
> > images.
> > > The biggest benefit to hbase is you can scale the reads and writes to
> the
> > > db not just the reads in most RDBMS
> > >
> > > So you should be able to work with the files in hadoop in any language
> as
> > > long as you can get hadoop working correctly on windows.
> > > The benefit of this is you can scale hadoop as needed to hold more
> data.
> > > The downside to this is the memory that will be required for the
> namenode
> > > I thank its like 3m files per gb of memory or something like that
> > >
> > >
> > >
> > >
> > >
> > > "Nitin Gupta" <ni...@gmail.com> wrote in message
> > > news:003c01c9e7fc$2087df00$61979d00$@com...
> > >
> > >  Jonathan,
> > >>
> > >> Thanks for detailed explanation. Much helpful.
> > >>
> > >> As far as file size is concerned, we may be even required to save
> Videos
> > >> in
> > >> future. So we shall def go above the HBase size limit at some point in
> > >> time.
> > >> Any other solution or key-value database that you can recommend for
> our
> > >> case?
> > >>
> > >> I am not much knowledgeable about the HDFS either. I think if we go
> with
> > >> pure HDFS, then all the required DB operations would have to be custom
> > >> developed on top of HDFS. For our needs, do you think that HDFS
> already
> > >> has
> > >> enough support that we will not need any major custom development. We
> > are
> > >> just saving the files/attachements and retrieving them with some basic
> > >> search.
> > >>
> > >> Regards,
> > >> Nitin
> > >>
> > >> -----Original Message-----
> > >> From: Jonathan Gray [mailto:jlist@streamy.com]
> > >> Sent: Sunday, June 07, 2009 9:30 PM
> > >> To: hbase-user@hadoop.apache.org
> > >> Subject: Re: Help needed - Adding HBase to architecture
> > >>
> > >> Nitin,
> > >>
> > >> HBase stores arbitrary binary values (row keys, column qualifiers, and
> > >> column values), so it is certainly capable of storing and serving
> files
> > >> and images.
> > >>
> > >> My only real question before I would give you a +1 on your idea is
> what
> > >> you expect the range of file sizes to be.  While HBase allows you to
> > store
> > >> values up to length Integer.MAX_VALUE, that is not recommended and in
> > past
> > >> versions has lead to memory issues (OOME and such).
> > >>
> > >> Images, text, word/excel docs, etc... should be no problem.  But I
> don't
> > >> recommend storing things in the upper 10s or 100s of MB, though it's
> > >> probably possible with a little work adjusting some configuration
> > >> parameters.  In general, if you are approaching HDFS block size, then
> > you
> > >> really just want HDFS and not HBase :)
> > >>
> > >> We are not currently running this in production, but we have had an
> > >> experimental version of our media server that runs on top of HBase
> > rather
> > >> than the file system.  It has a series of Python scripts (connected to
> > >> HBase through our custom interface, you could use Java directly or
> > >> Thrift/REST/etc) that are responsible for generating various thumbnail
> > >> sizes.  The originals are stored in HBase, and then a special query is
> > run
> > >> to grab the thumbnail of a certain size.  If it exists in HBase
> already,
> > >> it is just fetched and returned.  Otherwise, it is generated (via PIL,
> > >> Python Imaging Library, and some other custom tools), stored in HBase,
> > and
> > >> then returned to the client.
> > >>
> > >> As far as HBase on Windows goes... It's currently not possible but
> there
> > >> has been some effort from Powerset/Microsoft to make it happen.  I
> will
> > >> yield to those more familiar with it.
> > >>
> > >> Personally, I run Windows on my primary work desktop and spend a good
> > >> chunk of my time on HBase development.  When I've wanted to spin up
> > >> pseudo-distributed local clusters, I usually use a cheap Linux node or
> > >> local Virtual Machine.  In both cases, I use a Windows X Server and
> > >> redirect output to my local Windows machine so I can run Eclipse and
> > unit
> > >> tests from my Windows GUI.  Others have used Cygwin with some success,
> I
> > >> believe.
> > >>
> > >> Hope that sheds some light for you.
> > >>
> > >> You are almost certainly right about not wanting to store this in an
> > >> RDBMS.  And a hybrid approach seems to make sense, especially as a
> first
> > >> step.
> > >>
> > >> Jonathan Gray
> > >>
> > >>
> > >>
> > >> On Sun, June 7, 2009 6:44 am, Nitin Gupta wrote:
> > >>
> > >>> Hi All,
> > >>> I am working on an application which is kind of a social network on
> > >>> mobile
> > >>>  WAP. Recently, we have incorporated the files or attachments support
> > in
> > >>> our application. Right now, since we are not in production yet, we
> are
> > >>> keeping all the files in the RDBMS which our application is using.
> But
> > I
> > >>> am more than convinvced that this is not going to work once we are in
> > >>> production mode.
> > >>>
> > >>> I got to know about HBase and I am making myself convice about its
> > usage
> > >>> for the file storage, search and retrieval operations. I would like
> my
> > >>> opinion to be endorsed by expert HBase users/developers. Just for the
> > >>> clarification, here is what I am planning to do:
> > >>>
> > >>> Make use of a RDBMS for relational data in the application.
> > >>> All the files/blob data to be saved in the HBase.
> > >>> When required, my application can query app data from the RDBMS and
> the
> > >>> files can be retrieved from the HBase data store I will keep the meta
> > >>> data
> > >>> of the files in my rdbms so that files can be associated with my apps
> > >>> entities
> > >>>
> > >>> Please help me decide if this is the right approach. My app is
> supposed
> > >>> to provide support for images as well. So if anyone can advice if
> HBase
> > >>> is
> > >>> the right solution for me, in conjuction with an imaging tool.
> > >>>
> > >>> Since my team is predominantly Windows based, I would like to know is
> > it
> > >>> possible to run HBase on a windows machine in stand alone and in
> > >>> clustered
> > >>>  mode.
> > >>>
> > >>> Thanks for all your help.
> > >>>
> > >>>
> > >>> nitin
> > >>>
> > >>>
> > >>
> > >>
> > >
> > >
> >
>
>
>
>
>

Re: Help needed - Adding HBase to architecture

Posted by Andrew Purtell <ap...@apache.org>.
0.19 will have trouble compacting regions with large store files (> 1GB),
especially if they are compressed. 

0.20 is such a game changer that all the old experience and assumptions
will have to be thrown out and all of this testing redone. That is a very
good thing! :-) Kudos to all those who rebuilt the region server for this
release.

  - Andy




________________________________
From: stack <st...@duboce.net>
To: hbase-user@hadoop.apache.org
Sent: Saturday, June 13, 2009 12:13:58 PM
Subject: Re: Help needed - Adding HBase to architecture

At powerset, we have ~80 regions per node on > 100 nodes.

I've seen other clusters with hundreds and in testing have come close to a
thousand per node.

When a node has this many regions on board and it crashes, its going to take
a while to recover.

We've not played with it in a while but regions could be fatter.  By
default, biggest store file in a region is < 256M.  Dependent on the type of
your data and your access patterns, we should probably look to doubling or
quadrupling this size.  Then could carry low hundreds of regions but they'd
have more heft to them.

St.Ack

On Sat, Jun 13, 2009 at 11:59 AM, zsongbo <zs...@gmail.com> wrote:

> Hi Billy,
>
> I agree "Hbase would be better suited to store the meta data in place of
> the
> images." very much.And store files in HDFS or other storage system such as
> S3. But for small files, S3-like object storage system will be better.
>
> Another issue to discuss with you:
> How many tablets/regions served in each of you HBase region server in you
> practices? The Bigtable paper suggests at most handreds.
>
> Schubert
>
> On Mon, Jun 8, 2009 at 2:28 PM, Billy Pearson <sales@pearsonwholesale.com
> >wrote:
>
> > If I was going to use a RDBMS to store the meta data then I would just
> use
> > hadoop hdfs to store the images/video
> > I know that hadoop has a thrift api now
> > http://wiki.apache.org/hadoop/HDFS-APIs
> >
> > Hbase would be better suited to store the meta data in place of the
> images.
> > The biggest benefit to hbase is you can scale the reads and writes to the
> > db not just the reads in most RDBMS
> >
> > So you should be able to work with the files in hadoop in any language as
> > long as you can get hadoop working correctly on windows.
> > The benefit of this is you can scale hadoop as needed to hold more data.
> > The downside to this is the memory that will be required for the namenode
> > I thank its like 3m files per gb of memory or something like that
> >
> >
> >
> >
> >
> > "Nitin Gupta" <ni...@gmail.com> wrote in message
> > news:003c01c9e7fc$2087df00$61979d00$@com...
> >
> >  Jonathan,
> >>
> >> Thanks for detailed explanation. Much helpful.
> >>
> >> As far as file size is concerned, we may be even required to save Videos
> >> in
> >> future. So we shall def go above the HBase size limit at some point in
> >> time.
> >> Any other solution or key-value database that you can recommend for our
> >> case?
> >>
> >> I am not much knowledgeable about the HDFS either. I think if we go with
> >> pure HDFS, then all the required DB operations would have to be custom
> >> developed on top of HDFS. For our needs, do you think that HDFS already
> >> has
> >> enough support that we will not need any major custom development. We
> are
> >> just saving the files/attachements and retrieving them with some basic
> >> search.
> >>
> >> Regards,
> >> Nitin
> >>
> >> -----Original Message-----
> >> From: Jonathan Gray [mailto:jlist@streamy.com]
> >> Sent: Sunday, June 07, 2009 9:30 PM
> >> To: hbase-user@hadoop.apache.org
> >> Subject: Re: Help needed - Adding HBase to architecture
> >>
> >> Nitin,
> >>
> >> HBase stores arbitrary binary values (row keys, column qualifiers, and
> >> column values), so it is certainly capable of storing and serving files
> >> and images.
> >>
> >> My only real question before I would give you a +1 on your idea is what
> >> you expect the range of file sizes to be.  While HBase allows you to
> store
> >> values up to length Integer.MAX_VALUE, that is not recommended and in
> past
> >> versions has lead to memory issues (OOME and such).
> >>
> >> Images, text, word/excel docs, etc... should be no problem.  But I don't
> >> recommend storing things in the upper 10s or 100s of MB, though it's
> >> probably possible with a little work adjusting some configuration
> >> parameters.  In general, if you are approaching HDFS block size, then
> you
> >> really just want HDFS and not HBase :)
> >>
> >> We are not currently running this in production, but we have had an
> >> experimental version of our media server that runs on top of HBase
> rather
> >> than the file system.  It has a series of Python scripts (connected to
> >> HBase through our custom interface, you could use Java directly or
> >> Thrift/REST/etc) that are responsible for generating various thumbnail
> >> sizes.  The originals are stored in HBase, and then a special query is
> run
> >> to grab the thumbnail of a certain size.  If it exists in HBase already,
> >> it is just fetched and returned.  Otherwise, it is generated (via PIL,
> >> Python Imaging Library, and some other custom tools), stored in HBase,
> and
> >> then returned to the client.
> >>
> >> As far as HBase on Windows goes... It's currently not possible but there
> >> has been some effort from Powerset/Microsoft to make it happen.  I will
> >> yield to those more familiar with it.
> >>
> >> Personally, I run Windows on my primary work desktop and spend a good
> >> chunk of my time on HBase development.  When I've wanted to spin up
> >> pseudo-distributed local clusters, I usually use a cheap Linux node or
> >> local Virtual Machine.  In both cases, I use a Windows X Server and
> >> redirect output to my local Windows machine so I can run Eclipse and
> unit
> >> tests from my Windows GUI.  Others have used Cygwin with some success, I
> >> believe.
> >>
> >> Hope that sheds some light for you.
> >>
> >> You are almost certainly right about not wanting to store this in an
> >> RDBMS.  And a hybrid approach seems to make sense, especially as a first
> >> step.
> >>
> >> Jonathan Gray
> >>
> >>
> >>
> >> On Sun, June 7, 2009 6:44 am, Nitin Gupta wrote:
> >>
> >>> Hi All,
> >>> I am working on an application which is kind of a social network on
> >>> mobile
> >>>  WAP. Recently, we have incorporated the files or attachments support
> in
> >>> our application. Right now, since we are not in production yet, we are
> >>> keeping all the files in the RDBMS which our application is using. But
> I
> >>> am more than convinvced that this is not going to work once we are in
> >>> production mode.
> >>>
> >>> I got to know about HBase and I am making myself convice about its
> usage
> >>> for the file storage, search and retrieval operations. I would like my
> >>> opinion to be endorsed by expert HBase users/developers. Just for the
> >>> clarification, here is what I am planning to do:
> >>>
> >>> Make use of a RDBMS for relational data in the application.
> >>> All the files/blob data to be saved in the HBase.
> >>> When required, my application can query app data from the RDBMS and the
> >>> files can be retrieved from the HBase data store I will keep the meta
> >>> data
> >>> of the files in my rdbms so that files can be associated with my apps
> >>> entities
> >>>
> >>> Please help me decide if this is the right approach. My app is supposed
> >>> to provide support for images as well. So if anyone can advice if HBase
> >>> is
> >>> the right solution for me, in conjuction with an imaging tool.
> >>>
> >>> Since my team is predominantly Windows based, I would like to know is
> it
> >>> possible to run HBase on a windows machine in stand alone and in
> >>> clustered
> >>>  mode.
> >>>
> >>> Thanks for all your help.
> >>>
> >>>
> >>> nitin
> >>>
> >>>
> >>
> >>
> >
> >
>



      

Re: Help needed - Adding HBase to architecture

Posted by stack <st...@duboce.net>.
At powerset, we have ~80 regions per node on > 100 nodes.

I've seen other clusters with hundreds and in testing have come close to a
thousand per node.

When a node has this many regions on board and it crashes, its going to take
a while to recover.

We've not played with it in a while but regions could be fatter.  By
default, biggest store file in a region is < 256M.  Dependent on the type of
your data and your access patterns, we should probably look to doubling or
quadrupling this size.  Then could carry low hundreds of regions but they'd
have more heft to them.

St.Ack

On Sat, Jun 13, 2009 at 11:59 AM, zsongbo <zs...@gmail.com> wrote:

> Hi Billy,
>
> I agree "Hbase would be better suited to store the meta data in place of
> the
> images." very much.And store files in HDFS or other storage system such as
> S3. But for small files, S3-like object storage system will be better.
>
> Another issue to discuss with you:
> How many tablets/regions served in each of you HBase region server in you
> practices? The Bigtable paper suggests at most handreds.
>
> Schubert
>
> On Mon, Jun 8, 2009 at 2:28 PM, Billy Pearson <sales@pearsonwholesale.com
> >wrote:
>
> > If I was going to use a RDBMS to store the meta data then I would just
> use
> > hadoop hdfs to store the images/video
> > I know that hadoop has a thrift api now
> > http://wiki.apache.org/hadoop/HDFS-APIs
> >
> > Hbase would be better suited to store the meta data in place of the
> images.
> > The biggest benefit to hbase is you can scale the reads and writes to the
> > db not just the reads in most RDBMS
> >
> > So you should be able to work with the files in hadoop in any language as
> > long as you can get hadoop working correctly on windows.
> > The benefit of this is you can scale hadoop as needed to hold more data.
> > The downside to this is the memory that will be required for the namenode
> > I thank its like 3m files per gb of memory or something like that
> >
> >
> >
> >
> >
> > "Nitin Gupta" <ni...@gmail.com> wrote in message
> > news:003c01c9e7fc$2087df00$61979d00$@com...
> >
> >  Jonathan,
> >>
> >> Thanks for detailed explanation. Much helpful.
> >>
> >> As far as file size is concerned, we may be even required to save Videos
> >> in
> >> future. So we shall def go above the HBase size limit at some point in
> >> time.
> >> Any other solution or key-value database that you can recommend for our
> >> case?
> >>
> >> I am not much knowledgeable about the HDFS either. I think if we go with
> >> pure HDFS, then all the required DB operations would have to be custom
> >> developed on top of HDFS. For our needs, do you think that HDFS already
> >> has
> >> enough support that we will not need any major custom development. We
> are
> >> just saving the files/attachements and retrieving them with some basic
> >> search.
> >>
> >> Regards,
> >> Nitin
> >>
> >> -----Original Message-----
> >> From: Jonathan Gray [mailto:jlist@streamy.com]
> >> Sent: Sunday, June 07, 2009 9:30 PM
> >> To: hbase-user@hadoop.apache.org
> >> Subject: Re: Help needed - Adding HBase to architecture
> >>
> >> Nitin,
> >>
> >> HBase stores arbitrary binary values (row keys, column qualifiers, and
> >> column values), so it is certainly capable of storing and serving files
> >> and images.
> >>
> >> My only real question before I would give you a +1 on your idea is what
> >> you expect the range of file sizes to be.  While HBase allows you to
> store
> >> values up to length Integer.MAX_VALUE, that is not recommended and in
> past
> >> versions has lead to memory issues (OOME and such).
> >>
> >> Images, text, word/excel docs, etc... should be no problem.  But I don't
> >> recommend storing things in the upper 10s or 100s of MB, though it's
> >> probably possible with a little work adjusting some configuration
> >> parameters.  In general, if you are approaching HDFS block size, then
> you
> >> really just want HDFS and not HBase :)
> >>
> >> We are not currently running this in production, but we have had an
> >> experimental version of our media server that runs on top of HBase
> rather
> >> than the file system.  It has a series of Python scripts (connected to
> >> HBase through our custom interface, you could use Java directly or
> >> Thrift/REST/etc) that are responsible for generating various thumbnail
> >> sizes.  The originals are stored in HBase, and then a special query is
> run
> >> to grab the thumbnail of a certain size.  If it exists in HBase already,
> >> it is just fetched and returned.  Otherwise, it is generated (via PIL,
> >> Python Imaging Library, and some other custom tools), stored in HBase,
> and
> >> then returned to the client.
> >>
> >> As far as HBase on Windows goes... It's currently not possible but there
> >> has been some effort from Powerset/Microsoft to make it happen.  I will
> >> yield to those more familiar with it.
> >>
> >> Personally, I run Windows on my primary work desktop and spend a good
> >> chunk of my time on HBase development.  When I've wanted to spin up
> >> pseudo-distributed local clusters, I usually use a cheap Linux node or
> >> local Virtual Machine.  In both cases, I use a Windows X Server and
> >> redirect output to my local Windows machine so I can run Eclipse and
> unit
> >> tests from my Windows GUI.  Others have used Cygwin with some success, I
> >> believe.
> >>
> >> Hope that sheds some light for you.
> >>
> >> You are almost certainly right about not wanting to store this in an
> >> RDBMS.  And a hybrid approach seems to make sense, especially as a first
> >> step.
> >>
> >> Jonathan Gray
> >>
> >>
> >>
> >> On Sun, June 7, 2009 6:44 am, Nitin Gupta wrote:
> >>
> >>> Hi All,
> >>> I am working on an application which is kind of a social network on
> >>> mobile
> >>>  WAP. Recently, we have incorporated the files or attachments support
> in
> >>> our application. Right now, since we are not in production yet, we are
> >>> keeping all the files in the RDBMS which our application is using. But
> I
> >>> am more than convinvced that this is not going to work once we are in
> >>> production mode.
> >>>
> >>> I got to know about HBase and I am making myself convice about its
> usage
> >>> for the file storage, search and retrieval operations. I would like my
> >>> opinion to be endorsed by expert HBase users/developers. Just for the
> >>> clarification, here is what I am planning to do:
> >>>
> >>> Make use of a RDBMS for relational data in the application.
> >>> All the files/blob data to be saved in the HBase.
> >>> When required, my application can query app data from the RDBMS and the
> >>> files can be retrieved from the HBase data store I will keep the meta
> >>> data
> >>> of the files in my rdbms so that files can be associated with my apps
> >>> entities
> >>>
> >>> Please help me decide if this is the right approach. My app is supposed
> >>> to provide support for images as well. So if anyone can advice if HBase
> >>> is
> >>> the right solution for me, in conjuction with an imaging tool.
> >>>
> >>> Since my team is predominantly Windows based, I would like to know is
> it
> >>> possible to run HBase on a windows machine in stand alone and in
> >>> clustered
> >>>  mode.
> >>>
> >>> Thanks for all your help.
> >>>
> >>>
> >>> nitin
> >>>
> >>>
> >>
> >>
> >
> >
>

Re: Help needed - Adding HBase to architecture

Posted by Andrew Purtell <ap...@apache.org>.
I had each region server in a 25 node cluster serving over 1000 regions once,
HBase 0.19.0, 2GB heap for each region server. This setup was able to sustain
some scanner and insert load simultaneously. Based on my personal experience,
I can claim that HBase can support more than hundreds of regions per region
server. I can't say how many 1000s, such a table would be far larger than any
I generated in my testing. 

The trick is to keep HDFS healthy under the extreme load of having HBase
sitting on top with thousands of store files open and scanners iterating over
them. I had some trouble with this. Since that time, I have moved on from that
test cluster, so I unfortunately did not have the opportunity to test if
HADOOP-4681 (https://issues.apache.org/jira/browse/HADOOP-4681) would have
helped, and I suspect it would have, as "three strikes and you are out" was a
quite bad if unintentional error handling strategy on the part of DFSClient. 

The upcoming HBase 0.20 release cuts the load on HDFS roughly in half, which
will obviously positively impact the stability of the cluster at large scale.

  - Andy





________________________________
From: zsongbo <zs...@gmail.com>
To: hbase-user@hadoop.apache.org
Sent: Saturday, June 13, 2009 11:59:12 AM
Subject: Re: Help needed - Adding HBase to architecture

Hi Billy,

I agree "Hbase would be better suited to store the meta data in place of the
images." very much.And store files in HDFS or other storage system such as
S3. But for small files, S3-like object storage system will be better.

Another issue to discuss with you:
How many tablets/regions served in each of you HBase region server in you
practices? The Bigtable paper suggests at most handreds.

Schubert

On Mon, Jun 8, 2009 at 2:28 PM, Billy Pearson <sa...@pearsonwholesale.com>wrote:

> If I was going to use a RDBMS to store the meta data then I would just use
> hadoop hdfs to store the images/video
> I know that hadoop has a thrift api now
> http://wiki.apache.org/hadoop/HDFS-APIs
>
> Hbase would be better suited to store the meta data in place of the images.
> The biggest benefit to hbase is you can scale the reads and writes to the
> db not just the reads in most RDBMS
>
> So you should be able to work with the files in hadoop in any language as
> long as you can get hadoop working correctly on windows.
> The benefit of this is you can scale hadoop as needed to hold more data.
> The downside to this is the memory that will be required for the namenode
> I thank its like 3m files per gb of memory or something like that
>
>
>
>
>
> "Nitin Gupta" <ni...@gmail.com> wrote in message
> news:003c01c9e7fc$2087df00$61979d00$@com...
>
>  Jonathan,
>>
>> Thanks for detailed explanation. Much helpful.
>>
>> As far as file size is concerned, we may be even required to save Videos
>> in
>> future. So we shall def go above the HBase size limit at some point in
>> time.
>> Any other solution or key-value database that you can recommend for our
>> case?
>>
>> I am not much knowledgeable about the HDFS either. I think if we go with
>> pure HDFS, then all the required DB operations would have to be custom
>> developed on top of HDFS. For our needs, do you think that HDFS already
>> has
>> enough support that we will not need any major custom development. We are
>> just saving the files/attachements and retrieving them with some basic
>> search.
>>
>> Regards,
>> Nitin
>>
>> -----Original Message-----
>> From: Jonathan Gray [mailto:jlist@streamy.com]
>> Sent: Sunday, June 07, 2009 9:30 PM
>> To: hbase-user@hadoop.apache.org
>> Subject: Re: Help needed - Adding HBase to architecture
>>
>> Nitin,
>>
>> HBase stores arbitrary binary values (row keys, column qualifiers, and
>> column values), so it is certainly capable of storing and serving files
>> and images.
>>
>> My only real question before I would give you a +1 on your idea is what
>> you expect the range of file sizes to be.  While HBase allows you to store
>> values up to length Integer.MAX_VALUE, that is not recommended and in past
>> versions has lead to memory issues (OOME and such).
>>
>> Images, text, word/excel docs, etc... should be no problem.  But I don't
>> recommend storing things in the upper 10s or 100s of MB, though it's
>> probably possible with a little work adjusting some configuration
>> parameters.  In general, if you are approaching HDFS block size, then you
>> really just want HDFS and not HBase :)
>>
>> We are not currently running this in production, but we have had an
>> experimental version of our media server that runs on top of HBase rather
>> than the file system.  It has a series of Python scripts (connected to
>> HBase through our custom interface, you could use Java directly or
>> Thrift/REST/etc) that are responsible for generating various thumbnail
>> sizes.  The originals are stored in HBase, and then a special query is run
>> to grab the thumbnail of a certain size.  If it exists in HBase already,
>> it is just fetched and returned.  Otherwise, it is generated (via PIL,
>> Python Imaging Library, and some other custom tools), stored in HBase, and
>> then returned to the client.
>>
>> As far as HBase on Windows goes... It's currently not possible but there
>> has been some effort from Powerset/Microsoft to make it happen.  I will
>> yield to those more familiar with it.
>>
>> Personally, I run Windows on my primary work desktop and spend a good
>> chunk of my time on HBase development.  When I've wanted to spin up
>> pseudo-distributed local clusters, I usually use a cheap Linux node or
>> local Virtual Machine.  In both cases, I use a Windows X Server and
>> redirect output to my local Windows machine so I can run Eclipse and unit
>> tests from my Windows GUI.  Others have used Cygwin with some success, I
>> believe.
>>
>> Hope that sheds some light for you.
>>
>> You are almost certainly right about not wanting to store this in an
>> RDBMS.  And a hybrid approach seems to make sense, especially as a first
>> step.
>>
>> Jonathan Gray
>>
>>
>>
>> On Sun, June 7, 2009 6:44 am, Nitin Gupta wrote:
>>
>>> Hi All,
>>> I am working on an application which is kind of a social network on
>>> mobile
>>>  WAP. Recently, we have incorporated the files or attachments support in
>>> our application. Right now, since we are not in production yet, we are
>>> keeping all the files in the RDBMS which our application is using. But I
>>> am more than convinvced that this is not going to work once we are in
>>> production mode.
>>>
>>> I got to know about HBase and I am making myself convice about its usage
>>> for the file storage, search and retrieval operations. I would like my
>>> opinion to be endorsed by expert HBase users/developers. Just for the
>>> clarification, here is what I am planning to do:
>>>
>>> Make use of a RDBMS for relational data in the application.
>>> All the files/blob data to be saved in the HBase.
>>> When required, my application can query app data from the RDBMS and the
>>> files can be retrieved from the HBase data store I will keep the meta
>>> data
>>> of the files in my rdbms so that files can be associated with my apps
>>> entities
>>>
>>> Please help me decide if this is the right approach. My app is supposed
>>> to provide support for images as well. So if anyone can advice if HBase
>>> is
>>> the right solution for me, in conjuction with an imaging tool.
>>>
>>> Since my team is predominantly Windows based, I would like to know is it
>>> possible to run HBase on a windows machine in stand alone and in
>>> clustered
>>>  mode.
>>>
>>> Thanks for all your help.
>>>
>>>
>>> nitin
>>>
>>>
>>
>>
>
>



      

Re: Help needed - Adding HBase to architecture

Posted by zsongbo <zs...@gmail.com>.
Hi Billy,

I agree "Hbase would be better suited to store the meta data in place of the
images." very much.And store files in HDFS or other storage system such as
S3. But for small files, S3-like object storage system will be better.

Another issue to discuss with you:
How many tablets/regions served in each of you HBase region server in you
practices? The Bigtable paper suggests at most handreds.

Schubert

On Mon, Jun 8, 2009 at 2:28 PM, Billy Pearson <sa...@pearsonwholesale.com>wrote:

> If I was going to use a RDBMS to store the meta data then I would just use
> hadoop hdfs to store the images/video
> I know that hadoop has a thrift api now
> http://wiki.apache.org/hadoop/HDFS-APIs
>
> Hbase would be better suited to store the meta data in place of the images.
> The biggest benefit to hbase is you can scale the reads and writes to the
> db not just the reads in most RDBMS
>
> So you should be able to work with the files in hadoop in any language as
> long as you can get hadoop working correctly on windows.
> The benefit of this is you can scale hadoop as needed to hold more data.
> The downside to this is the memory that will be required for the namenode
> I thank its like 3m files per gb of memory or something like that
>
>
>
>
>
> "Nitin Gupta" <ni...@gmail.com> wrote in message
> news:003c01c9e7fc$2087df00$61979d00$@com...
>
>  Jonathan,
>>
>> Thanks for detailed explanation. Much helpful.
>>
>> As far as file size is concerned, we may be even required to save Videos
>> in
>> future. So we shall def go above the HBase size limit at some point in
>> time.
>> Any other solution or key-value database that you can recommend for our
>> case?
>>
>> I am not much knowledgeable about the HDFS either. I think if we go with
>> pure HDFS, then all the required DB operations would have to be custom
>> developed on top of HDFS. For our needs, do you think that HDFS already
>> has
>> enough support that we will not need any major custom development. We are
>> just saving the files/attachements and retrieving them with some basic
>> search.
>>
>> Regards,
>> Nitin
>>
>> -----Original Message-----
>> From: Jonathan Gray [mailto:jlist@streamy.com]
>> Sent: Sunday, June 07, 2009 9:30 PM
>> To: hbase-user@hadoop.apache.org
>> Subject: Re: Help needed - Adding HBase to architecture
>>
>> Nitin,
>>
>> HBase stores arbitrary binary values (row keys, column qualifiers, and
>> column values), so it is certainly capable of storing and serving files
>> and images.
>>
>> My only real question before I would give you a +1 on your idea is what
>> you expect the range of file sizes to be.  While HBase allows you to store
>> values up to length Integer.MAX_VALUE, that is not recommended and in past
>> versions has lead to memory issues (OOME and such).
>>
>> Images, text, word/excel docs, etc... should be no problem.  But I don't
>> recommend storing things in the upper 10s or 100s of MB, though it's
>> probably possible with a little work adjusting some configuration
>> parameters.  In general, if you are approaching HDFS block size, then you
>> really just want HDFS and not HBase :)
>>
>> We are not currently running this in production, but we have had an
>> experimental version of our media server that runs on top of HBase rather
>> than the file system.  It has a series of Python scripts (connected to
>> HBase through our custom interface, you could use Java directly or
>> Thrift/REST/etc) that are responsible for generating various thumbnail
>> sizes.  The originals are stored in HBase, and then a special query is run
>> to grab the thumbnail of a certain size.  If it exists in HBase already,
>> it is just fetched and returned.  Otherwise, it is generated (via PIL,
>> Python Imaging Library, and some other custom tools), stored in HBase, and
>> then returned to the client.
>>
>> As far as HBase on Windows goes... It's currently not possible but there
>> has been some effort from Powerset/Microsoft to make it happen.  I will
>> yield to those more familiar with it.
>>
>> Personally, I run Windows on my primary work desktop and spend a good
>> chunk of my time on HBase development.  When I've wanted to spin up
>> pseudo-distributed local clusters, I usually use a cheap Linux node or
>> local Virtual Machine.  In both cases, I use a Windows X Server and
>> redirect output to my local Windows machine so I can run Eclipse and unit
>> tests from my Windows GUI.  Others have used Cygwin with some success, I
>> believe.
>>
>> Hope that sheds some light for you.
>>
>> You are almost certainly right about not wanting to store this in an
>> RDBMS.  And a hybrid approach seems to make sense, especially as a first
>> step.
>>
>> Jonathan Gray
>>
>>
>>
>> On Sun, June 7, 2009 6:44 am, Nitin Gupta wrote:
>>
>>> Hi All,
>>> I am working on an application which is kind of a social network on
>>> mobile
>>>  WAP. Recently, we have incorporated the files or attachments support in
>>> our application. Right now, since we are not in production yet, we are
>>> keeping all the files in the RDBMS which our application is using. But I
>>> am more than convinvced that this is not going to work once we are in
>>> production mode.
>>>
>>> I got to know about HBase and I am making myself convice about its usage
>>> for the file storage, search and retrieval operations. I would like my
>>> opinion to be endorsed by expert HBase users/developers. Just for the
>>> clarification, here is what I am planning to do:
>>>
>>> Make use of a RDBMS for relational data in the application.
>>> All the files/blob data to be saved in the HBase.
>>> When required, my application can query app data from the RDBMS and the
>>> files can be retrieved from the HBase data store I will keep the meta
>>> data
>>> of the files in my rdbms so that files can be associated with my apps
>>> entities
>>>
>>> Please help me decide if this is the right approach. My app is supposed
>>> to provide support for images as well. So if anyone can advice if HBase
>>> is
>>> the right solution for me, in conjuction with an imaging tool.
>>>
>>> Since my team is predominantly Windows based, I would like to know is it
>>> possible to run HBase on a windows machine in stand alone and in
>>> clustered
>>>  mode.
>>>
>>> Thanks for all your help.
>>>
>>>
>>> nitin
>>>
>>>
>>
>>
>
>

Re: Help needed - Adding HBase to architecture

Posted by Billy Pearson <sa...@pearsonwholesale.com>.
If I was going to use a RDBMS to store the meta data then I would just use 
hadoop hdfs to store the images/video
I know that hadoop has a thrift api now
http://wiki.apache.org/hadoop/HDFS-APIs

Hbase would be better suited to store the meta data in place of the images.
The biggest benefit to hbase is you can scale the reads and writes to the db 
not just the reads in most RDBMS

So you should be able to work with the files in hadoop in any language as 
long as you can get hadoop working correctly on windows.
The benefit of this is you can scale hadoop as needed to hold more data.
The downside to this is the memory that will be required for the namenode
I thank its like 3m files per gb of memory or something like that





"Nitin Gupta" <ni...@gmail.com> wrote 
in message news:003c01c9e7fc$2087df00$61979d00$@com...
> Jonathan,
>
> Thanks for detailed explanation. Much helpful.
>
> As far as file size is concerned, we may be even required to save Videos 
> in
> future. So we shall def go above the HBase size limit at some point in 
> time.
> Any other solution or key-value database that you can recommend for our
> case?
>
> I am not much knowledgeable about the HDFS either. I think if we go with
> pure HDFS, then all the required DB operations would have to be custom
> developed on top of HDFS. For our needs, do you think that HDFS already 
> has
> enough support that we will not need any major custom development. We are
> just saving the files/attachements and retrieving them with some basic
> search.
>
> Regards,
> Nitin
>
> -----Original Message-----
> From: Jonathan Gray [mailto:jlist@streamy.com]
> Sent: Sunday, June 07, 2009 9:30 PM
> To: hbase-user@hadoop.apache.org
> Subject: Re: Help needed - Adding HBase to architecture
>
> Nitin,
>
> HBase stores arbitrary binary values (row keys, column qualifiers, and
> column values), so it is certainly capable of storing and serving files
> and images.
>
> My only real question before I would give you a +1 on your idea is what
> you expect the range of file sizes to be.  While HBase allows you to store
> values up to length Integer.MAX_VALUE, that is not recommended and in past
> versions has lead to memory issues (OOME and such).
>
> Images, text, word/excel docs, etc... should be no problem.  But I don't
> recommend storing things in the upper 10s or 100s of MB, though it's
> probably possible with a little work adjusting some configuration
> parameters.  In general, if you are approaching HDFS block size, then you
> really just want HDFS and not HBase :)
>
> We are not currently running this in production, but we have had an
> experimental version of our media server that runs on top of HBase rather
> than the file system.  It has a series of Python scripts (connected to
> HBase through our custom interface, you could use Java directly or
> Thrift/REST/etc) that are responsible for generating various thumbnail
> sizes.  The originals are stored in HBase, and then a special query is run
> to grab the thumbnail of a certain size.  If it exists in HBase already,
> it is just fetched and returned.  Otherwise, it is generated (via PIL,
> Python Imaging Library, and some other custom tools), stored in HBase, and
> then returned to the client.
>
> As far as HBase on Windows goes... It's currently not possible but there
> has been some effort from Powerset/Microsoft to make it happen.  I will
> yield to those more familiar with it.
>
> Personally, I run Windows on my primary work desktop and spend a good
> chunk of my time on HBase development.  When I've wanted to spin up
> pseudo-distributed local clusters, I usually use a cheap Linux node or
> local Virtual Machine.  In both cases, I use a Windows X Server and
> redirect output to my local Windows machine so I can run Eclipse and unit
> tests from my Windows GUI.  Others have used Cygwin with some success, I
> believe.
>
> Hope that sheds some light for you.
>
> You are almost certainly right about not wanting to store this in an
> RDBMS.  And a hybrid approach seems to make sense, especially as a first
> step.
>
> Jonathan Gray
>
>
>
> On Sun, June 7, 2009 6:44 am, Nitin Gupta wrote:
>> Hi All,
>> I am working on an application which is kind of a social network on 
>> mobile
>>  WAP. Recently, we have incorporated the files or attachments support in
>> our application. Right now, since we are not in production yet, we are
>> keeping all the files in the RDBMS which our application is using. But I
>> am more than convinvced that this is not going to work once we are in
>> production mode.
>>
>> I got to know about HBase and I am making myself convice about its usage
>> for the file storage, search and retrieval operations. I would like my
>> opinion to be endorsed by expert HBase users/developers. Just for the
>> clarification, here is what I am planning to do:
>>
>> Make use of a RDBMS for relational data in the application.
>> All the files/blob data to be saved in the HBase.
>> When required, my application can query app data from the RDBMS and the
>> files can be retrieved from the HBase data store I will keep the meta 
>> data
>> of the files in my rdbms so that files can be associated with my apps
>> entities
>>
>> Please help me decide if this is the right approach. My app is supposed
>> to provide support for images as well. So if anyone can advice if HBase 
>> is
>> the right solution for me, in conjuction with an imaging tool.
>>
>> Since my team is predominantly Windows based, I would like to know is it
>> possible to run HBase on a windows machine in stand alone and in 
>> clustered
>>  mode.
>>
>> Thanks for all your help.
>>
>>
>> nitin
>>
>
> 



RE: Help needed - Adding HBase to architecture

Posted by Nitin Gupta <ni...@gmail.com>.
Jonathan,

Thanks for detailed explanation. Much helpful.

As far as file size is concerned, we may be even required to save Videos in
future. So we shall def go above the HBase size limit at some point in time.
Any other solution or key-value database that you can recommend for our
case?

I am not much knowledgeable about the HDFS either. I think if we go with
pure HDFS, then all the required DB operations would have to be custom
developed on top of HDFS. For our needs, do you think that HDFS already has
enough support that we will not need any major custom development. We are
just saving the files/attachements and retrieving them with some basic
search.

Regards,
Nitin

-----Original Message-----
From: Jonathan Gray [mailto:jlist@streamy.com] 
Sent: Sunday, June 07, 2009 9:30 PM
To: hbase-user@hadoop.apache.org
Subject: Re: Help needed - Adding HBase to architecture

Nitin,

HBase stores arbitrary binary values (row keys, column qualifiers, and
column values), so it is certainly capable of storing and serving files
and images.

My only real question before I would give you a +1 on your idea is what
you expect the range of file sizes to be.  While HBase allows you to store
values up to length Integer.MAX_VALUE, that is not recommended and in past
versions has lead to memory issues (OOME and such).

Images, text, word/excel docs, etc... should be no problem.  But I don't
recommend storing things in the upper 10s or 100s of MB, though it's
probably possible with a little work adjusting some configuration
parameters.  In general, if you are approaching HDFS block size, then you
really just want HDFS and not HBase :)

We are not currently running this in production, but we have had an
experimental version of our media server that runs on top of HBase rather
than the file system.  It has a series of Python scripts (connected to
HBase through our custom interface, you could use Java directly or
Thrift/REST/etc) that are responsible for generating various thumbnail
sizes.  The originals are stored in HBase, and then a special query is run
to grab the thumbnail of a certain size.  If it exists in HBase already,
it is just fetched and returned.  Otherwise, it is generated (via PIL,
Python Imaging Library, and some other custom tools), stored in HBase, and
then returned to the client.

As far as HBase on Windows goes... It's currently not possible but there
has been some effort from Powerset/Microsoft to make it happen.  I will
yield to those more familiar with it.

Personally, I run Windows on my primary work desktop and spend a good
chunk of my time on HBase development.  When I've wanted to spin up
pseudo-distributed local clusters, I usually use a cheap Linux node or
local Virtual Machine.  In both cases, I use a Windows X Server and
redirect output to my local Windows machine so I can run Eclipse and unit
tests from my Windows GUI.  Others have used Cygwin with some success, I
believe.

Hope that sheds some light for you.

You are almost certainly right about not wanting to store this in an
RDBMS.  And a hybrid approach seems to make sense, especially as a first
step.

Jonathan Gray



On Sun, June 7, 2009 6:44 am, Nitin Gupta wrote:
> Hi All,
> I am working on an application which is kind of a social network on mobile
>  WAP. Recently, we have incorporated the files or attachments support in
> our application. Right now, since we are not in production yet, we are
> keeping all the files in the RDBMS which our application is using. But I
> am more than convinvced that this is not going to work once we are in
> production mode.
>
> I got to know about HBase and I am making myself convice about its usage
> for the file storage, search and retrieval operations. I would like my
> opinion to be endorsed by expert HBase users/developers. Just for the
> clarification, here is what I am planning to do:
>
> Make use of a RDBMS for relational data in the application.
> All the files/blob data to be saved in the HBase.
> When required, my application can query app data from the RDBMS and the
> files can be retrieved from the HBase data store I will keep the meta data
> of the files in my rdbms so that files can be associated with my apps
> entities
>
> Please help me decide if this is the right approach. My app is supposed
> to provide support for images as well. So if anyone can advice if HBase is
> the right solution for me, in conjuction with an imaging tool.
>
> Since my team is predominantly Windows based, I would like to know is it
> possible to run HBase on a windows machine in stand alone and in clustered
>  mode.
>
> Thanks for all your help.
>
>
> nitin
>


Re: Help needed - Adding HBase to architecture

Posted by Jonathan Gray <jl...@streamy.com>.
Nitin,

HBase stores arbitrary binary values (row keys, column qualifiers, and
column values), so it is certainly capable of storing and serving files
and images.

My only real question before I would give you a +1 on your idea is what
you expect the range of file sizes to be.  While HBase allows you to store
values up to length Integer.MAX_VALUE, that is not recommended and in past
versions has lead to memory issues (OOME and such).

Images, text, word/excel docs, etc... should be no problem.  But I don't
recommend storing things in the upper 10s or 100s of MB, though it's
probably possible with a little work adjusting some configuration
parameters.  In general, if you are approaching HDFS block size, then you
really just want HDFS and not HBase :)

We are not currently running this in production, but we have had an
experimental version of our media server that runs on top of HBase rather
than the file system.  It has a series of Python scripts (connected to
HBase through our custom interface, you could use Java directly or
Thrift/REST/etc) that are responsible for generating various thumbnail
sizes.  The originals are stored in HBase, and then a special query is run
to grab the thumbnail of a certain size.  If it exists in HBase already,
it is just fetched and returned.  Otherwise, it is generated (via PIL,
Python Imaging Library, and some other custom tools), stored in HBase, and
then returned to the client.

As far as HBase on Windows goes... It's currently not possible but there
has been some effort from Powerset/Microsoft to make it happen.  I will
yield to those more familiar with it.

Personally, I run Windows on my primary work desktop and spend a good
chunk of my time on HBase development.  When I've wanted to spin up
pseudo-distributed local clusters, I usually use a cheap Linux node or
local Virtual Machine.  In both cases, I use a Windows X Server and
redirect output to my local Windows machine so I can run Eclipse and unit
tests from my Windows GUI.  Others have used Cygwin with some success, I
believe.

Hope that sheds some light for you.

You are almost certainly right about not wanting to store this in an
RDBMS.  And a hybrid approach seems to make sense, especially as a first
step.

Jonathan Gray



On Sun, June 7, 2009 6:44 am, Nitin Gupta wrote:
> Hi All,
> I am working on an application which is kind of a social network on mobile
>  WAP. Recently, we have incorporated the files or attachments support in
> our application. Right now, since we are not in production yet, we are
> keeping all the files in the RDBMS which our application is using. But I
> am more than convinvced that this is not going to work once we are in
> production mode.
>
> I got to know about HBase and I am making myself convice about its usage
> for the file storage, search and retrieval operations. I would like my
> opinion to be endorsed by expert HBase users/developers. Just for the
> clarification, here is what I am planning to do:
>
> Make use of a RDBMS for relational data in the application.
> All the files/blob data to be saved in the HBase.
> When required, my application can query app data from the RDBMS and the
> files can be retrieved from the HBase data store I will keep the meta data
> of the files in my rdbms so that files can be associated with my apps
> entities
>
> Please help me decide if this is the right approach. My app is supposed
> to provide support for images as well. So if anyone can advice if HBase is
> the right solution for me, in conjuction with an imaging tool.
>
> Since my team is predominantly Windows based, I would like to know is it
> possible to run HBase on a windows machine in stand alone and in clustered
>  mode.
>
> Thanks for all your help.
>
>
> nitin
>