You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by "Jim R. Wilson" <wi...@gmail.com> on 2008/05/08 17:19:52 UTC

[hbase-user] about to give up on hbase/ec2

Unfortunately, I'm about to give up on hbase over ec2.

In my application, the hbase storage is very simple, write-once text
storage.  To get this to work on ec2, I've concluded I need the
following:

1. A cluster of hadoop machines running an appropriate version of
hadoop (0.16.3 at the time of this writing)

2. Hbase running on the same cluster, either connected to S3, which
I've been warned as slow, or HDFS on top of PersistentFS which may or
may not fair better.

3. Thrift service running atop hbase for interaction from remote
(outside ec2) Python and PHP scripts.

4. Static IP's for any hadoop nodes running data-transfer jobs due to
firewall restrictions on the MySQL end (outside ec2), and also so that
the Python/PHP scripts know where to find Thrift.

5. Mechanism to force all hbase nodes to write any memory-resident
changes to disc for backup purposes (Java).

Now, my particular problem is very simple - just numeric key to text
storage.  Ex: { "1":"Hello", "2":"World" }.  I've (nearly) come to the
conclusion that I would be much better off either:

a. Using an S3 bucket to store 1.txt, 2.txt etc (probably with a
heirarchical dir structure to keep the directories small - I've got
about 4 million such number/text pairs at the moment).

b. Using SimpleDB (which I've yet to learn, but expect to be similar
to hbase/BigTable)

c. Running an hbase/hadoop cluster somewhere else (I already have a
single-node cluster working great on our hosting provider's internal
network).

So unless the process is drastically simpler than I've estimated, I
think my next stop is going to be a SimpleDB tutorial, keeping my
hbase work handy as another alternative.

-- Jim R. Wilson (jimbojw)

Re: [hbase-user] about to give up on hbase/ec2

Posted by "Jim R. Wilson" <wi...@gmail.com>.

>  So unless the process is drastically simpler than I've estimated, I
>  think my next stop is going to be a SimpleDB tutorial, keeping my
>  hbase work handy as another alternative.

Well, SimpleDB is out - the limited beta is closed. That leaves me with just S3.

-- Jim


On Thu, May 8, 2008 at 10:19 AM, Jim R. Wilson <wi...@gmail.com> wrote:
> Unfortunately, I'm about to give up on hbase over ec2.
>
>  In my application, the hbase storage is very simple, write-once text
>  storage.  To get this to work on ec2, I've concluded I need the
>  following:
>
>  1. A cluster of hadoop machines running an appropriate version of
>  hadoop (0.16.3 at the time of this writing)
>
>  2. Hbase running on the same cluster, either connected to S3, which
>  I've been warned as slow, or HDFS on top of PersistentFS which may or
>  may not fair better.
>
>  3. Thrift service running atop hbase for interaction from remote
>  (outside ec2) Python and PHP scripts.
>
>  4. Static IP's for any hadoop nodes running data-transfer jobs due to
>  firewall restrictions on the MySQL end (outside ec2), and also so that
>  the Python/PHP scripts know where to find Thrift.
>
>  5. Mechanism to force all hbase nodes to write any memory-resident
>  changes to disc for backup purposes (Java).
>
>  Now, my particular problem is very simple - just numeric key to text
>  storage.  Ex: { "1":"Hello", "2":"World" }.  I've (nearly) come to the
>  conclusion that I would be much better off either:
>
>  a. Using an S3 bucket to store 1.txt, 2.txt etc (probably with a
>  heirarchical dir structure to keep the directories small - I've got
>  about 4 million such number/text pairs at the moment).
>
>  b. Using SimpleDB (which I've yet to learn, but expect to be similar
>  to hbase/BigTable)
>
>  c. Running an hbase/hadoop cluster somewhere else (I already have a
>  single-node cluster working great on our hosting provider's internal
>  network).
>
>  So unless the process is drastically simpler than I've estimated, I
>  think my next stop is going to be a SimpleDB tutorial, keeping my
>  hbase work handy as another alternative.
>
>  -- Jim R. Wilson (jimbojw)
>

Re: [hbase-user] about to give up on hbase/ec2

Posted by "Jim R. Wilson" <wi...@gmail.com>.

> how large are the text values to the numeric keys?

The vast majority (75%+) are less than 100 bytes, the other 25% go up
to around 130kb.  However, there is some clustering available -
there's a good chance I'll be able to batch together many small items
together into one gzip bundle.

-- Jim

On Thu, May 8, 2008 at 10:56 AM, Chris K Wensel <ch...@wensel.net> wrote:
> how large are the text values to the numeric keys?
>
>  i'm running a >40 node Hadoop cluster that launches ~40 mr jobs to do
> nothing more than bin event streams by symbol, apply some math, and stuff
> them into S3 (as zip files, ugh) for pickup. these zips are in the few megs
> size range, and I have about 20k symbols (currently, next app will have 200k
> symbols). (cascading makes and manages all the mr jobs for me).
>
>  once we have validated all the result data sets, we will probably start
> mirroring a subset of the data (it's daily) in Hbase for further adhoc query
> support.
>
>  point being we are tackling each piece an element at a time. get hadoop/ec2
> up and stable, run larger and larger jobs/clusters, validate data, improve
> data accessibility, etc, etc. we would have went made trying to get it all
> up and going in one shot.
>
>  Also, keep in mind EC2 will have permanent local storage soon. So backing
> up incrementally to S3 may not be necessary depending on the SLA for the
> storage. So a long lived Hadoop cluster can be as permanent as any local
> cluster in a datacenter.
>
>  ckw
>
>
>
>  On May 8, 2008, at 8:19 AM, Jim R. Wilson wrote:
>
>
> > Unfortunately, I'm about to give up on hbase over ec2.
> >
> > In my application, the hbase storage is very simple, write-once text
> > storage.  To get this to work on ec2, I've concluded I need the
> > following:
> >
> > 1. A cluster of hadoop machines running an appropriate version of
> > hadoop (0.16.3 at the time of this writing)
> >
> > 2. Hbase running on the same cluster, either connected to S3, which
> > I've been warned as slow, or HDFS on top of PersistentFS which may or
> > may not fair better.
> >
> > 3. Thrift service running atop hbase for interaction from remote
> > (outside ec2) Python and PHP scripts.
> >
> > 4. Static IP's for any hadoop nodes running data-transfer jobs due to
> > firewall restrictions on the MySQL end (outside ec2), and also so that
> > the Python/PHP scripts know where to find Thrift.
> >
> > 5. Mechanism to force all hbase nodes to write any memory-resident
> > changes to disc for backup purposes (Java).
> >
> > Now, my particular problem is very simple - just numeric key to text
> > storage.  Ex: { "1":"Hello", "2":"World" }.  I've (nearly) come to the
> > conclusion that I would be much better off either:
> >
> > a. Using an S3 bucket to store 1.txt, 2.txt etc (probably with a
> > heirarchical dir structure to keep the directories small - I've got
> > about 4 million such number/text pairs at the moment).
> >
> > b. Using SimpleDB (which I've yet to learn, but expect to be similar
> > to hbase/BigTable)
> >
> > c. Running an hbase/hadoop cluster somewhere else (I already have a
> > single-node cluster working great on our hosting provider's internal
> > network).
> >
> > So unless the process is drastically simpler than I've estimated, I
> > think my next stop is going to be a SimpleDB tutorial, keeping my
> > hbase work handy as another alternative.
> >
> > -- Jim R. Wilson (jimbojw)
> >
>
>  Chris K Wensel
>  chris@wensel.net
>  http://chris.wensel.net/
>  http://www.cascading.org/
>
>
>
>
>

Re: [hbase-user] about to give up on hbase/ec2

Posted by Chris K Wensel <ch...@wensel.net>.

how large are the text values to the numeric keys?

i'm running a >40 node Hadoop cluster that launches ~40 mr jobs to do  
nothing more than bin event streams by symbol, apply some math, and  
stuff them into S3 (as zip files, ugh) for pickup. these zips are in  
the few megs size range, and I have about 20k symbols (currently, next  
app will have 200k symbols). (cascading makes and manages all the mr  
jobs for me).

once we have validated all the result data sets, we will probably  
start mirroring a subset of the data (it's daily) in Hbase for further  
adhoc query support.

point being we are tackling each piece an element at a time. get  
hadoop/ec2 up and stable, run larger and larger jobs/clusters,  
validate data, improve data accessibility, etc, etc. we would have  
went made trying to get it all up and going in one shot.

Also, keep in mind EC2 will have permanent local storage soon. So  
backing up incrementally to S3 may not be necessary depending on the  
SLA for the storage. So a long lived Hadoop cluster can be as  
permanent as any local cluster in a datacenter.

ckw

On May 8, 2008, at 8:19 AM, Jim R. Wilson wrote:

> Unfortunately, I'm about to give up on hbase over ec2.
>
> In my application, the hbase storage is very simple, write-once text
> storage.  To get this to work on ec2, I've concluded I need the
> following:
>
> 1. A cluster of hadoop machines running an appropriate version of
> hadoop (0.16.3 at the time of this writing)
>
> 2. Hbase running on the same cluster, either connected to S3, which
> I've been warned as slow, or HDFS on top of PersistentFS which may or
> may not fair better.
>
> 3. Thrift service running atop hbase for interaction from remote
> (outside ec2) Python and PHP scripts.
>
> 4. Static IP's for any hadoop nodes running data-transfer jobs due to
> firewall restrictions on the MySQL end (outside ec2), and also so that
> the Python/PHP scripts know where to find Thrift.
>
> 5. Mechanism to force all hbase nodes to write any memory-resident
> changes to disc for backup purposes (Java).
>
> Now, my particular problem is very simple - just numeric key to text
> storage.  Ex: { "1":"Hello", "2":"World" }.  I've (nearly) come to the
> conclusion that I would be much better off either:
>
> a. Using an S3 bucket to store 1.txt, 2.txt etc (probably with a
> heirarchical dir structure to keep the directories small - I've got
> about 4 million such number/text pairs at the moment).
>
> b. Using SimpleDB (which I've yet to learn, but expect to be similar
> to hbase/BigTable)
>
> c. Running an hbase/hadoop cluster somewhere else (I already have a
> single-node cluster working great on our hosting provider's internal
> network).
>
> So unless the process is drastically simpler than I've estimated, I
> think my next stop is going to be a SimpleDB tutorial, keeping my
> hbase work handy as another alternative.
>
> -- Jim R. Wilson (jimbojw)

Chris K Wensel
chris@wensel.net
http://chris.wensel.net/
http://www.cascading.org/