You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Jack Levin <ma...@gmail.com> on 2010/09/20 20:00:31 UTC

Millions of photos into Hbase

Greetings all.  My name is Jack and I work for an image hosting
company Image Shack, we also have a property thats widely used as a
twitter app called yfrog (yfrog.com).

Image-Shack gets close to two million image uploads per day, which are
usually stored on regular servers (we have about 700), as regular
files, and each server has its own host name, such as (img55).   I've
been researching on how to improve our backend design in terms of data
safety and stumped onto the Hbase project.

We have been running hadoop for data access log analysis for a while
now, quite successfully.  We are receiving about 2 billion hits per
day and store all of that data into RCFiles (attribution to Facebook
applies here), that are loadable into Hive (thanks to FB again).  So
we know how to manage HDFS, and run mapreduce jobs.

Now, I think hbase is he most beautiful thing that happen to
distributed DB world :).   The idea is to store image files (about
400Kb on average into HBASE).  The setup will include the following
configuration:

50 servers total (2 datacenters), with 8 GB RAM, dual core cpu, 6 x
2TB disks each.
3 to 5 Zookeepers
2 Masters (in a datacenter each)
10 to 20 Stargate REST instances (one per server, hash loadbalanced)
40 to 50 RegionServers (will probably keep masters separate on dedicated boxes).
2 Namenode servers (one backup, highly available, will do fsimage and
edits snapshots also)

So far I got about 13 servers running, and doing about 20 insertions /
second (file size ranging from few KB to 2-3MB, ave. 400KB). via
Stargate API.  Our frontend servers receive files, and I just
fork-insert them into stargate via http (curl).
The inserts are humming along nicely, without any noticeable load on
regionservers, so far inserted about 2 TB worth of images.
I have adjusted the region file size to be 512MB, and table block size
to about 400KB , trying to match average access block to limit HDFS
trips.   So far the read performance was more than adequate, and of
course write performance is nowhere near capacity.
So right now, all newly uploaded images go to HBASE.  But we do plan
to insert about 170 Million images (about 100 days worth), which is
only about 64 TB, or 10% of planned cluster size of 600TB.
The end goal is to have a storage system that creates data safety,
e.g. system may go down but data can not be lost.   Our Front-End
servers will continue to serve images from their own file system (we
are serving about 16 Gbits at peak), however should we need to bring
any of those down for maintenance, we will redirect all traffic to
Hbase (should be no more than few hundred Mbps), while the front end
server is repaired (for example having its disk replaced), after the
repairs, we quickly repopulate it with missing files, while serving
the missing remaining off Hbase.
All in all should be very interesting project, and I am hoping not to
run into any snags, however, should that happens, I am pleased to know
that such a great and vibrant tech group exists that supports and uses
HBASE :).

-Jack

Re: Millions of photos into Hbase

Posted by Jack Levin <ma...@gmail.com>.

I don't have empty Rows, does that make a difference?  E.g. when row
is inserted its always followed by the image data.

-Jack

On Mon, Sep 20, 2010 at 2:06 PM, Todd Lipcon <to...@cloudera.com> wrote:
> On Mon, Sep 20, 2010 at 1:13 PM, Jack Levin <ma...@gmail.com> wrote:
>> Todd, I could not get stargate to work on 0.89 for some reason, thats
>> why we are running 0.20.6.  Also in regards to bloom filters, I
>> thought they were mainly for column seeking, in our case we have this
>> schema:
>>
>> row           att:data
>> filename    file_data
>>
>
> The bloom filters work either in a ROW basis or a ROW_COL basis. If
> you turn on the row key blooms, then your get of a particular filename
> will avoid looking in the storefiles that don't have any data for the
> row.
>
> Regarding stargate in 0.89, it's been renamed to "rest" since the old
> rest server got removed. I haven't used it much but hopefully someone
> can give you a pointer (or even better, update the wiki/docs!)
>
> -Todd
>
>>
>> -Jack
>>
>> On Mon, Sep 20, 2010 at 11:53 AM, Todd Lipcon <to...@cloudera.com> wrote:
>>> Hey Jack,
>>>
>>> This sounds like a very exciting project! A few thoughts that might help you:
>>> - Check out the Bloom filter support that is in the 0.89 series. It
>>> sounds like all of your access is going to be random key gets - adding
>>> blooms will save you lots of disk seeks.
>>> - I might even bump the region size up to 1G or more given the planned capacity.
>>> - The "HA" setup will be tricky - we don't have a great HA story yet.
>>> Given you have two DCs, you may want to consider running separate
>>> HBase clusters, one in each, and either using the new replication
>>> support, or simply doing "client replication" by writing all images to
>>> both.
>>>
>>> Good luck with the project, and keep us posted how it goes.
>>>
>>> Thanks
>>> -Todd
>>>
>>> On Mon, Sep 20, 2010 at 11:00 AM, Jack Levin <ma...@gmail.com> wrote:
>>>>
>>>> Greetings all.  My name is Jack and I work for an image hosting
>>>> company Image Shack, we also have a property thats widely used as a
>>>> twitter app called yfrog (yfrog.com).
>>>>
>>>> Image-Shack gets close to two million image uploads per day, which are
>>>> usually stored on regular servers (we have about 700), as regular
>>>> files, and each server has its own host name, such as (img55).   I've
>>>> been researching on how to improve our backend design in terms of data
>>>> safety and stumped onto the Hbase project.
>>>>
>>>> We have been running hadoop for data access log analysis for a while
>>>> now, quite successfully.  We are receiving about 2 billion hits per
>>>> day and store all of that data into RCFiles (attribution to Facebook
>>>> applies here), that are loadable into Hive (thanks to FB again).  So
>>>> we know how to manage HDFS, and run mapreduce jobs.
>>>>
>>>> Now, I think hbase is he most beautiful thing that happen to
>>>> distributed DB world :).   The idea is to store image files (about
>>>> 400Kb on average into HBASE).  The setup will include the following
>>>> configuration:
>>>>
>>>> 50 servers total (2 datacenters), with 8 GB RAM, dual core cpu, 6 x
>>>> 2TB disks each.
>>>> 3 to 5 Zookeepers
>>>> 2 Masters (in a datacenter each)
>>>> 10 to 20 Stargate REST instances (one per server, hash loadbalanced)
>>>> 40 to 50 RegionServers (will probably keep masters separate on dedicated boxes).
>>>> 2 Namenode servers (one backup, highly available, will do fsimage and
>>>> edits snapshots also)
>>>>
>>>> So far I got about 13 servers running, and doing about 20 insertions /
>>>> second (file size ranging from few KB to 2-3MB, ave. 400KB). via
>>>> Stargate API.  Our frontend servers receive files, and I just
>>>> fork-insert them into stargate via http (curl).
>>>> The inserts are humming along nicely, without any noticeable load on
>>>> regionservers, so far inserted about 2 TB worth of images.
>>>> I have adjusted the region file size to be 512MB, and table block size
>>>> to about 400KB , trying to match average access block to limit HDFS
>>>> trips.   So far the read performance was more than adequate, and of
>>>> course write performance is nowhere near capacity.
>>>> So right now, all newly uploaded images go to HBASE.  But we do plan
>>>> to insert about 170 Million images (about 100 days worth), which is
>>>> only about 64 TB, or 10% of planned cluster size of 600TB.
>>>> The end goal is to have a storage system that creates data safety,
>>>> e.g. system may go down but data can not be lost.   Our Front-End
>>>> servers will continue to serve images from their own file system (we
>>>> are serving about 16 Gbits at peak), however should we need to bring
>>>> any of those down for maintenance, we will redirect all traffic to
>>>> Hbase (should be no more than few hundred Mbps), while the front end
>>>> server is repaired (for example having its disk replaced), after the
>>>> repairs, we quickly repopulate it with missing files, while serving
>>>> the missing remaining off Hbase.
>>>> All in all should be very interesting project, and I am hoping not to
>>>> run into any snags, however, should that happens, I am pleased to know
>>>> that such a great and vibrant tech group exists that supports and uses
>>>> HBASE :).
>>>>
>>>> -Jack
>>>
>>>
>>>
>>> --
>>> Todd Lipcon
>>> Software Engineer, Cloudera
>>>
>>
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: Millions of photos into Hbase

Posted by Ryan Rawson <ry...@gmail.com>.

you'll need a major compaction to generate the blooms on the existing data.



On Mon, Sep 20, 2010 at 2:15 PM, Alexey Kovyrin <al...@kovyrin.net> wrote:
> When one enables blooms for rows, is major compaction or something
> else required (aside from enabling the table after the alter)?
>
> On Mon, Sep 20, 2010 at 5:06 PM, Todd Lipcon <to...@cloudera.com> wrote:
>> On Mon, Sep 20, 2010 at 1:13 PM, Jack Levin <ma...@gmail.com> wrote:
>>> Todd, I could not get stargate to work on 0.89 for some reason, thats
>>> why we are running 0.20.6.  Also in regards to bloom filters, I
>>> thought they were mainly for column seeking, in our case we have this
>>> schema:
>>>
>>> row           att:data
>>> filename    file_data
>>>
>>
>> The bloom filters work either in a ROW basis or a ROW_COL basis. If
>> you turn on the row key blooms, then your get of a particular filename
>> will avoid looking in the storefiles that don't have any data for the
>> row.
>>
>> Regarding stargate in 0.89, it's been renamed to "rest" since the old
>> rest server got removed. I haven't used it much but hopefully someone
>> can give you a pointer (or even better, update the wiki/docs!)
>>
>> -Todd
>>
>>>
>>> -Jack
>>>
>>> On Mon, Sep 20, 2010 at 11:53 AM, Todd Lipcon <to...@cloudera.com> wrote:
>>>> Hey Jack,
>>>>
>>>> This sounds like a very exciting project! A few thoughts that might help you:
>>>> - Check out the Bloom filter support that is in the 0.89 series. It
>>>> sounds like all of your access is going to be random key gets - adding
>>>> blooms will save you lots of disk seeks.
>>>> - I might even bump the region size up to 1G or more given the planned capacity.
>>>> - The "HA" setup will be tricky - we don't have a great HA story yet.
>>>> Given you have two DCs, you may want to consider running separate
>>>> HBase clusters, one in each, and either using the new replication
>>>> support, or simply doing "client replication" by writing all images to
>>>> both.
>>>>
>>>> Good luck with the project, and keep us posted how it goes.
>>>>
>>>> Thanks
>>>> -Todd
>>>>
>>>> On Mon, Sep 20, 2010 at 11:00 AM, Jack Levin <ma...@gmail.com> wrote:
>>>>>
>>>>> Greetings all.  My name is Jack and I work for an image hosting
>>>>> company Image Shack, we also have a property thats widely used as a
>>>>> twitter app called yfrog (yfrog.com).
>>>>>
>>>>> Image-Shack gets close to two million image uploads per day, which are
>>>>> usually stored on regular servers (we have about 700), as regular
>>>>> files, and each server has its own host name, such as (img55).   I've
>>>>> been researching on how to improve our backend design in terms of data
>>>>> safety and stumped onto the Hbase project.
>>>>>
>>>>> We have been running hadoop for data access log analysis for a while
>>>>> now, quite successfully.  We are receiving about 2 billion hits per
>>>>> day and store all of that data into RCFiles (attribution to Facebook
>>>>> applies here), that are loadable into Hive (thanks to FB again).  So
>>>>> we know how to manage HDFS, and run mapreduce jobs.
>>>>>
>>>>> Now, I think hbase is he most beautiful thing that happen to
>>>>> distributed DB world :).   The idea is to store image files (about
>>>>> 400Kb on average into HBASE).  The setup will include the following
>>>>> configuration:
>>>>>
>>>>> 50 servers total (2 datacenters), with 8 GB RAM, dual core cpu, 6 x
>>>>> 2TB disks each.
>>>>> 3 to 5 Zookeepers
>>>>> 2 Masters (in a datacenter each)
>>>>> 10 to 20 Stargate REST instances (one per server, hash loadbalanced)
>>>>> 40 to 50 RegionServers (will probably keep masters separate on dedicated boxes).
>>>>> 2 Namenode servers (one backup, highly available, will do fsimage and
>>>>> edits snapshots also)
>>>>>
>>>>> So far I got about 13 servers running, and doing about 20 insertions /
>>>>> second (file size ranging from few KB to 2-3MB, ave. 400KB). via
>>>>> Stargate API.  Our frontend servers receive files, and I just
>>>>> fork-insert them into stargate via http (curl).
>>>>> The inserts are humming along nicely, without any noticeable load on
>>>>> regionservers, so far inserted about 2 TB worth of images.
>>>>> I have adjusted the region file size to be 512MB, and table block size
>>>>> to about 400KB , trying to match average access block to limit HDFS
>>>>> trips.   So far the read performance was more than adequate, and of
>>>>> course write performance is nowhere near capacity.
>>>>> So right now, all newly uploaded images go to HBASE.  But we do plan
>>>>> to insert about 170 Million images (about 100 days worth), which is
>>>>> only about 64 TB, or 10% of planned cluster size of 600TB.
>>>>> The end goal is to have a storage system that creates data safety,
>>>>> e.g. system may go down but data can not be lost.   Our Front-End
>>>>> servers will continue to serve images from their own file system (we
>>>>> are serving about 16 Gbits at peak), however should we need to bring
>>>>> any of those down for maintenance, we will redirect all traffic to
>>>>> Hbase (should be no more than few hundred Mbps), while the front end
>>>>> server is repaired (for example having its disk replaced), after the
>>>>> repairs, we quickly repopulate it with missing files, while serving
>>>>> the missing remaining off Hbase.
>>>>> All in all should be very interesting project, and I am hoping not to
>>>>> run into any snags, however, should that happens, I am pleased to know
>>>>> that such a great and vibrant tech group exists that supports and uses
>>>>> HBASE :).
>>>>>
>>>>> -Jack
>>>>
>>>>
>>>>
>>>> --
>>>> Todd Lipcon
>>>> Software Engineer, Cloudera
>>>>
>>>
>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>
>
>
> --
> Alexey Kovyrin
> http://kovyrin.net/
>

Re: Millions of photos into Hbase

Posted by Alexey Kovyrin <al...@kovyrin.net>.

When one enables blooms for rows, is major compaction or something
else required (aside from enabling the table after the alter)?

On Mon, Sep 20, 2010 at 5:06 PM, Todd Lipcon <to...@cloudera.com> wrote:
> On Mon, Sep 20, 2010 at 1:13 PM, Jack Levin <ma...@gmail.com> wrote:
>> Todd, I could not get stargate to work on 0.89 for some reason, thats
>> why we are running 0.20.6.  Also in regards to bloom filters, I
>> thought they were mainly for column seeking, in our case we have this
>> schema:
>>
>> row           att:data
>> filename    file_data
>>
>
> The bloom filters work either in a ROW basis or a ROW_COL basis. If
> you turn on the row key blooms, then your get of a particular filename
> will avoid looking in the storefiles that don't have any data for the
> row.
>
> Regarding stargate in 0.89, it's been renamed to "rest" since the old
> rest server got removed. I haven't used it much but hopefully someone
> can give you a pointer (or even better, update the wiki/docs!)
>
> -Todd
>
>>
>> -Jack
>>
>> On Mon, Sep 20, 2010 at 11:53 AM, Todd Lipcon <to...@cloudera.com> wrote:
>>> Hey Jack,
>>>
>>> This sounds like a very exciting project! A few thoughts that might help you:
>>> - Check out the Bloom filter support that is in the 0.89 series. It
>>> sounds like all of your access is going to be random key gets - adding
>>> blooms will save you lots of disk seeks.
>>> - I might even bump the region size up to 1G or more given the planned capacity.
>>> - The "HA" setup will be tricky - we don't have a great HA story yet.
>>> Given you have two DCs, you may want to consider running separate
>>> HBase clusters, one in each, and either using the new replication
>>> support, or simply doing "client replication" by writing all images to
>>> both.
>>>
>>> Good luck with the project, and keep us posted how it goes.
>>>
>>> Thanks
>>> -Todd
>>>
>>> On Mon, Sep 20, 2010 at 11:00 AM, Jack Levin <ma...@gmail.com> wrote:
>>>>
>>>> Greetings all.  My name is Jack and I work for an image hosting
>>>> company Image Shack, we also have a property thats widely used as a
>>>> twitter app called yfrog (yfrog.com).
>>>>
>>>> Image-Shack gets close to two million image uploads per day, which are
>>>> usually stored on regular servers (we have about 700), as regular
>>>> files, and each server has its own host name, such as (img55).   I've
>>>> been researching on how to improve our backend design in terms of data
>>>> safety and stumped onto the Hbase project.
>>>>
>>>> We have been running hadoop for data access log analysis for a while
>>>> now, quite successfully.  We are receiving about 2 billion hits per
>>>> day and store all of that data into RCFiles (attribution to Facebook
>>>> applies here), that are loadable into Hive (thanks to FB again).  So
>>>> we know how to manage HDFS, and run mapreduce jobs.
>>>>
>>>> Now, I think hbase is he most beautiful thing that happen to
>>>> distributed DB world :).   The idea is to store image files (about
>>>> 400Kb on average into HBASE).  The setup will include the following
>>>> configuration:
>>>>
>>>> 50 servers total (2 datacenters), with 8 GB RAM, dual core cpu, 6 x
>>>> 2TB disks each.
>>>> 3 to 5 Zookeepers
>>>> 2 Masters (in a datacenter each)
>>>> 10 to 20 Stargate REST instances (one per server, hash loadbalanced)
>>>> 40 to 50 RegionServers (will probably keep masters separate on dedicated boxes).
>>>> 2 Namenode servers (one backup, highly available, will do fsimage and
>>>> edits snapshots also)
>>>>
>>>> So far I got about 13 servers running, and doing about 20 insertions /
>>>> second (file size ranging from few KB to 2-3MB, ave. 400KB). via
>>>> Stargate API.  Our frontend servers receive files, and I just
>>>> fork-insert them into stargate via http (curl).
>>>> The inserts are humming along nicely, without any noticeable load on
>>>> regionservers, so far inserted about 2 TB worth of images.
>>>> I have adjusted the region file size to be 512MB, and table block size
>>>> to about 400KB , trying to match average access block to limit HDFS
>>>> trips.   So far the read performance was more than adequate, and of
>>>> course write performance is nowhere near capacity.
>>>> So right now, all newly uploaded images go to HBASE.  But we do plan
>>>> to insert about 170 Million images (about 100 days worth), which is
>>>> only about 64 TB, or 10% of planned cluster size of 600TB.
>>>> The end goal is to have a storage system that creates data safety,
>>>> e.g. system may go down but data can not be lost.   Our Front-End
>>>> servers will continue to serve images from their own file system (we
>>>> are serving about 16 Gbits at peak), however should we need to bring
>>>> any of those down for maintenance, we will redirect all traffic to
>>>> Hbase (should be no more than few hundred Mbps), while the front end
>>>> server is repaired (for example having its disk replaced), after the
>>>> repairs, we quickly repopulate it with missing files, while serving
>>>> the missing remaining off Hbase.
>>>> All in all should be very interesting project, and I am hoping not to
>>>> run into any snags, however, should that happens, I am pleased to know
>>>> that such a great and vibrant tech group exists that supports and uses
>>>> HBASE :).
>>>>
>>>> -Jack
>>>
>>>
>>>
>>> --
>>> Todd Lipcon
>>> Software Engineer, Cloudera
>>>
>>
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>



-- 
Alexey Kovyrin
http://kovyrin.net/

Re: Millions of photos into Hbase

Posted by Todd Lipcon <to...@cloudera.com>.

On Mon, Sep 20, 2010 at 1:13 PM, Jack Levin <ma...@gmail.com> wrote:
> Todd, I could not get stargate to work on 0.89 for some reason, thats
> why we are running 0.20.6.  Also in regards to bloom filters, I
> thought they were mainly for column seeking, in our case we have this
> schema:
>
> row           att:data
> filename    file_data
>

The bloom filters work either in a ROW basis or a ROW_COL basis. If
you turn on the row key blooms, then your get of a particular filename
will avoid looking in the storefiles that don't have any data for the
row.

Regarding stargate in 0.89, it's been renamed to "rest" since the old
rest server got removed. I haven't used it much but hopefully someone
can give you a pointer (or even better, update the wiki/docs!)

-Todd

>
> -Jack
>
> On Mon, Sep 20, 2010 at 11:53 AM, Todd Lipcon <to...@cloudera.com> wrote:
>> Hey Jack,
>>
>> This sounds like a very exciting project! A few thoughts that might help you:
>> - Check out the Bloom filter support that is in the 0.89 series. It
>> sounds like all of your access is going to be random key gets - adding
>> blooms will save you lots of disk seeks.
>> - I might even bump the region size up to 1G or more given the planned capacity.
>> - The "HA" setup will be tricky - we don't have a great HA story yet.
>> Given you have two DCs, you may want to consider running separate
>> HBase clusters, one in each, and either using the new replication
>> support, or simply doing "client replication" by writing all images to
>> both.
>>
>> Good luck with the project, and keep us posted how it goes.
>>
>> Thanks
>> -Todd
>>
>> On Mon, Sep 20, 2010 at 11:00 AM, Jack Levin <ma...@gmail.com> wrote:
>>>
>>> Greetings all.  My name is Jack and I work for an image hosting
>>> company Image Shack, we also have a property thats widely used as a
>>> twitter app called yfrog (yfrog.com).
>>>
>>> Image-Shack gets close to two million image uploads per day, which are
>>> usually stored on regular servers (we have about 700), as regular
>>> files, and each server has its own host name, such as (img55).   I've
>>> been researching on how to improve our backend design in terms of data
>>> safety and stumped onto the Hbase project.
>>>
>>> We have been running hadoop for data access log analysis for a while
>>> now, quite successfully.  We are receiving about 2 billion hits per
>>> day and store all of that data into RCFiles (attribution to Facebook
>>> applies here), that are loadable into Hive (thanks to FB again).  So
>>> we know how to manage HDFS, and run mapreduce jobs.
>>>
>>> Now, I think hbase is he most beautiful thing that happen to
>>> distributed DB world :).   The idea is to store image files (about
>>> 400Kb on average into HBASE).  The setup will include the following
>>> configuration:
>>>
>>> 50 servers total (2 datacenters), with 8 GB RAM, dual core cpu, 6 x
>>> 2TB disks each.
>>> 3 to 5 Zookeepers
>>> 2 Masters (in a datacenter each)
>>> 10 to 20 Stargate REST instances (one per server, hash loadbalanced)
>>> 40 to 50 RegionServers (will probably keep masters separate on dedicated boxes).
>>> 2 Namenode servers (one backup, highly available, will do fsimage and
>>> edits snapshots also)
>>>
>>> So far I got about 13 servers running, and doing about 20 insertions /
>>> second (file size ranging from few KB to 2-3MB, ave. 400KB). via
>>> Stargate API.  Our frontend servers receive files, and I just
>>> fork-insert them into stargate via http (curl).
>>> The inserts are humming along nicely, without any noticeable load on
>>> regionservers, so far inserted about 2 TB worth of images.
>>> I have adjusted the region file size to be 512MB, and table block size
>>> to about 400KB , trying to match average access block to limit HDFS
>>> trips.   So far the read performance was more than adequate, and of
>>> course write performance is nowhere near capacity.
>>> So right now, all newly uploaded images go to HBASE.  But we do plan
>>> to insert about 170 Million images (about 100 days worth), which is
>>> only about 64 TB, or 10% of planned cluster size of 600TB.
>>> The end goal is to have a storage system that creates data safety,
>>> e.g. system may go down but data can not be lost.   Our Front-End
>>> servers will continue to serve images from their own file system (we
>>> are serving about 16 Gbits at peak), however should we need to bring
>>> any of those down for maintenance, we will redirect all traffic to
>>> Hbase (should be no more than few hundred Mbps), while the front end
>>> server is repaired (for example having its disk replaced), after the
>>> repairs, we quickly repopulate it with missing files, while serving
>>> the missing remaining off Hbase.
>>> All in all should be very interesting project, and I am hoping not to
>>> run into any snags, however, should that happens, I am pleased to know
>>> that such a great and vibrant tech group exists that supports and uses
>>> HBASE :).
>>>
>>> -Jack
>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Millions of photos into Hbase

Posted by Jack Levin <ma...@gmail.com>.

Todd, I could not get stargate to work on 0.89 for some reason, thats
why we are running 0.20.6.  Also in regards to bloom filters, I
thought they were mainly for column seeking, in our case we have this
schema:

row           att:data
filename    file_data


-Jack

On Mon, Sep 20, 2010 at 11:53 AM, Todd Lipcon <to...@cloudera.com> wrote:
> Hey Jack,
>
> This sounds like a very exciting project! A few thoughts that might help you:
> - Check out the Bloom filter support that is in the 0.89 series. It
> sounds like all of your access is going to be random key gets - adding
> blooms will save you lots of disk seeks.
> - I might even bump the region size up to 1G or more given the planned capacity.
> - The "HA" setup will be tricky - we don't have a great HA story yet.
> Given you have two DCs, you may want to consider running separate
> HBase clusters, one in each, and either using the new replication
> support, or simply doing "client replication" by writing all images to
> both.
>
> Good luck with the project, and keep us posted how it goes.
>
> Thanks
> -Todd
>
> On Mon, Sep 20, 2010 at 11:00 AM, Jack Levin <ma...@gmail.com> wrote:
>>
>> Greetings all.  My name is Jack and I work for an image hosting
>> company Image Shack, we also have a property thats widely used as a
>> twitter app called yfrog (yfrog.com).
>>
>> Image-Shack gets close to two million image uploads per day, which are
>> usually stored on regular servers (we have about 700), as regular
>> files, and each server has its own host name, such as (img55).   I've
>> been researching on how to improve our backend design in terms of data
>> safety and stumped onto the Hbase project.
>>
>> We have been running hadoop for data access log analysis for a while
>> now, quite successfully.  We are receiving about 2 billion hits per
>> day and store all of that data into RCFiles (attribution to Facebook
>> applies here), that are loadable into Hive (thanks to FB again).  So
>> we know how to manage HDFS, and run mapreduce jobs.
>>
>> Now, I think hbase is he most beautiful thing that happen to
>> distributed DB world :).   The idea is to store image files (about
>> 400Kb on average into HBASE).  The setup will include the following
>> configuration:
>>
>> 50 servers total (2 datacenters), with 8 GB RAM, dual core cpu, 6 x
>> 2TB disks each.
>> 3 to 5 Zookeepers
>> 2 Masters (in a datacenter each)
>> 10 to 20 Stargate REST instances (one per server, hash loadbalanced)
>> 40 to 50 RegionServers (will probably keep masters separate on dedicated boxes).
>> 2 Namenode servers (one backup, highly available, will do fsimage and
>> edits snapshots also)
>>
>> So far I got about 13 servers running, and doing about 20 insertions /
>> second (file size ranging from few KB to 2-3MB, ave. 400KB). via
>> Stargate API.  Our frontend servers receive files, and I just
>> fork-insert them into stargate via http (curl).
>> The inserts are humming along nicely, without any noticeable load on
>> regionservers, so far inserted about 2 TB worth of images.
>> I have adjusted the region file size to be 512MB, and table block size
>> to about 400KB , trying to match average access block to limit HDFS
>> trips.   So far the read performance was more than adequate, and of
>> course write performance is nowhere near capacity.
>> So right now, all newly uploaded images go to HBASE.  But we do plan
>> to insert about 170 Million images (about 100 days worth), which is
>> only about 64 TB, or 10% of planned cluster size of 600TB.
>> The end goal is to have a storage system that creates data safety,
>> e.g. system may go down but data can not be lost.   Our Front-End
>> servers will continue to serve images from their own file system (we
>> are serving about 16 Gbits at peak), however should we need to bring
>> any of those down for maintenance, we will redirect all traffic to
>> Hbase (should be no more than few hundred Mbps), while the front end
>> server is repaired (for example having its disk replaced), after the
>> repairs, we quickly repopulate it with missing files, while serving
>> the missing remaining off Hbase.
>> All in all should be very interesting project, and I am hoping not to
>> run into any snags, however, should that happens, I am pleased to know
>> that such a great and vibrant tech group exists that supports and uses
>> HBASE :).
>>
>> -Jack
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: Millions of photos into Hbase

Posted by Todd Lipcon <to...@cloudera.com>.

Hey Jack,

This sounds like a very exciting project! A few thoughts that might help you:
- Check out the Bloom filter support that is in the 0.89 series. It
sounds like all of your access is going to be random key gets - adding
blooms will save you lots of disk seeks.
- I might even bump the region size up to 1G or more given the planned capacity.
- The "HA" setup will be tricky - we don't have a great HA story yet.
Given you have two DCs, you may want to consider running separate
HBase clusters, one in each, and either using the new replication
support, or simply doing "client replication" by writing all images to
both.

Good luck with the project, and keep us posted how it goes.

Thanks
-Todd

On Mon, Sep 20, 2010 at 11:00 AM, Jack Levin <ma...@gmail.com> wrote:
>
> Greetings all.  My name is Jack and I work for an image hosting
> company Image Shack, we also have a property thats widely used as a
> twitter app called yfrog (yfrog.com).
>
> Image-Shack gets close to two million image uploads per day, which are
> usually stored on regular servers (we have about 700), as regular
> files, and each server has its own host name, such as (img55).   I've
> been researching on how to improve our backend design in terms of data
> safety and stumped onto the Hbase project.
>
> We have been running hadoop for data access log analysis for a while
> now, quite successfully.  We are receiving about 2 billion hits per
> day and store all of that data into RCFiles (attribution to Facebook
> applies here), that are loadable into Hive (thanks to FB again).  So
> we know how to manage HDFS, and run mapreduce jobs.
>
> Now, I think hbase is he most beautiful thing that happen to
> distributed DB world :).   The idea is to store image files (about
> 400Kb on average into HBASE).  The setup will include the following
> configuration:
>
> 50 servers total (2 datacenters), with 8 GB RAM, dual core cpu, 6 x
> 2TB disks each.
> 3 to 5 Zookeepers
> 2 Masters (in a datacenter each)
> 10 to 20 Stargate REST instances (one per server, hash loadbalanced)
> 40 to 50 RegionServers (will probably keep masters separate on dedicated boxes).
> 2 Namenode servers (one backup, highly available, will do fsimage and
> edits snapshots also)
>
> So far I got about 13 servers running, and doing about 20 insertions /
> second (file size ranging from few KB to 2-3MB, ave. 400KB). via
> Stargate API.  Our frontend servers receive files, and I just
> fork-insert them into stargate via http (curl).
> The inserts are humming along nicely, without any noticeable load on
> regionservers, so far inserted about 2 TB worth of images.
> I have adjusted the region file size to be 512MB, and table block size
> to about 400KB , trying to match average access block to limit HDFS
> trips.   So far the read performance was more than adequate, and of
> course write performance is nowhere near capacity.
> So right now, all newly uploaded images go to HBASE.  But we do plan
> to insert about 170 Million images (about 100 days worth), which is
> only about 64 TB, or 10% of planned cluster size of 600TB.
> The end goal is to have a storage system that creates data safety,
> e.g. system may go down but data can not be lost.   Our Front-End
> servers will continue to serve images from their own file system (we
> are serving about 16 Gbits at peak), however should we need to bring
> any of those down for maintenance, we will redirect all traffic to
> Hbase (should be no more than few hundred Mbps), while the front end
> server is repaired (for example having its disk replaced), after the
> repairs, we quickly repopulate it with missing files, while serving
> the missing remaining off Hbase.
> All in all should be very interesting project, and I am hoping not to
> run into any snags, however, should that happens, I am pleased to know
> that such a great and vibrant tech group exists that supports and uses
> HBASE :).
>
> -Jack



--
Todd Lipcon
Software Engineer, Cloudera

Re: Millions of photos into Hbase

Posted by Leen Toelen <to...@gmail.com>.

Hi,

I suppose you have read about HayStack as well? This gives an explanation on
how Facebook stores its photos.

Needle in a haystack: efficient storage of billions of photos
http://www.facebook.com/note.php?note_id=76191543919

Regards,
Leen

On Thu, Sep 23, 2010 at 8:57 AM, Jack Levin <ma...@gmail.com> wrote:

> On Wed, Sep 22, 2010 at 10:39 AM, Sujee Maniyam <su...@sujee.net> wrote:
> > Jack,
> > sounds like a cool project indeed.  Few questions for you...
> >
> >
> > 1) how do you setup 50+ servers.  What I mean is, installing OS,
> > installing all software.  Setting up user accounts.  Setting up SSH
> > keys ..etc
> >     DO you use any software for this?
>
> Just ssh keys, using cfengine, and other things.
>
>
> >
> > 2) Is Hbase going to be the 'primary' storage for images?  Meaning,
> > your front-end reads/writes to Hbase?
> > Do you also maintain a 'file storage' as a backup / alternative?
>
> There will be front end servers that will cache and serve files off
> their disks, while hbase is going to keep the images highly available
> for some products as well as keeping them safe.
>
>
> > 3) Do you only store the 'main image' in Hbase?  How about
> > thumbnails, medium size, large size cousins?
>
> Those will be generated dynamically.
>
> >
> > 4) Is this a dedicated Hbase cluster, or you are building this on top
> > of your existing Hadoop cluster.  Will this be sharing  resources with
> > MR jobs that you already run?
>
> Its dedicated.
>
> >
> > 5) I notice you have 8G RAM for region servers.  Hbase is very memory
> > hungry and specially dealing with large data sizes, I'd imagine you'd
> > need 24G-32G (as it was previously mentioned)
>
> This is not going to happen for two reasons, a) its too expensive and
> motherboard does not support it b) We are aiming for large dataset
> with 5 to 10% concentrated hits.
>
> > 6) how long does it take for all regions to be available after a 'cold
> > start' of Hbase?
>
> 6-7 mins with dual core, 1 minute with 8 core.
>
> > 7) I'd be interested to know how do you do 'standby servers' for
> > HMaster and Hadoop Namenode
>
> Just two 8 core boxes, running both namenode, secondary namenode and
> masters.
>
> -Jack
>
>
> > have fun
> >
> > regards
> > Sujee
> >
> > http://sujee.net
> >
>

Re: Millions of photos into Hbase

Posted by Jack Levin <ma...@gmail.com>.

On Wed, Sep 22, 2010 at 10:39 AM, Sujee Maniyam <su...@sujee.net> wrote:
> Jack,
> sounds like a cool project indeed.  Few questions for you...
>
>
> 1) how do you setup 50+ servers.  What I mean is, installing OS,
> installing all software.  Setting up user accounts.  Setting up SSH
> keys ..etc
>     DO you use any software for this?

Just ssh keys, using cfengine, and other things.

>
> 2) Is Hbase going to be the 'primary' storage for images?  Meaning,
> your front-end reads/writes to Hbase?
> Do you also maintain a 'file storage' as a backup / alternative?

There will be front end servers that will cache and serve files off
their disks, while hbase is going to keep the images highly available
for some products as well as keeping them safe.

> 3) Do you only store the 'main image' in Hbase?  How about
> thumbnails, medium size, large size cousins?

Those will be generated dynamically.

>
> 4) Is this a dedicated Hbase cluster, or you are building this on top
> of your existing Hadoop cluster.  Will this be sharing  resources with
> MR jobs that you already run?

Its dedicated.

>
> 5) I notice you have 8G RAM for region servers.  Hbase is very memory
> hungry and specially dealing with large data sizes, I'd imagine you'd
> need 24G-32G (as it was previously mentioned)

This is not going to happen for two reasons, a) its too expensive and
motherboard does not support it b) We are aiming for large dataset
with 5 to 10% concentrated hits.

> 6) how long does it take for all regions to be available after a 'cold
> start' of Hbase?

6-7 mins with dual core, 1 minute with 8 core.

> 7) I'd be interested to know how do you do 'standby servers' for
> HMaster and Hadoop Namenode

Just two 8 core boxes, running both namenode, secondary namenode and masters.

-Jack

> have fun
>
> regards
> Sujee
>
> http://sujee.net
>

Re: Millions of photos into Hbase

Posted by Sujee Maniyam <su...@sujee.net>.

Jack,
sounds like a cool project indeed.  Few questions for you...


1) how do you setup 50+ servers.  What I mean is, installing OS,
installing all software.  Setting up user accounts.  Setting up SSH
keys ..etc
     DO you use any software for this?


2) Is Hbase going to be the 'primary' storage for images?  Meaning,
your front-end reads/writes to Hbase?
Do you also maintain a 'file storage' as a backup / alternative?

3) Do you only store the 'main image' in Hbase?  How about
thumbnails, medium size, large size cousins?

4) Is this a dedicated Hbase cluster, or you are building this on top
of your existing Hadoop cluster.  Will this be sharing  resources with
MR jobs that you already run?


5) I notice you have 8G RAM for region servers.  Hbase is very memory
hungry and specially dealing with large data sizes, I'd imagine you'd
need 24G-32G (as it was previously mentioned)

6) how long does it take for all regions to be available after a 'cold
start' of Hbase?

7) I'd be interested to know how do you do 'standby servers' for
HMaster and Hadoop Namenode

have fun

regards
Sujee

http://sujee.net

Re: Millions of photos into Hbase

Posted by Ryan Rawson <ry...@gmail.com>.

The speed of the master isnt a big deal, since it isnt involve in the
read/write path at all.

3000 records? Depending on the size, but for small records many people
get multi-thousand/server, and on large ones, hundreds.

-ryan

On Tue, Sep 21, 2010 at 12:15 PM, Jack Levin <ma...@gmail.com> wrote:
> Quick question, if you have say 3000 records, on 13 servers, with each
> region being 1GB, how long do we expect those regions to load? (Master
> is dual core, with 8 GB RAM)
>
> Also, what does this line mean:
>
> ZKUnassignedWatcher: ZK-EVENT-PROCESS: Got zkEvent NodeDeleted
> state:SyncConnected path:/hbase/UNASSIGNED/1735906890
>
> -Jack
>
> On Mon, Sep 20, 2010 at 10:51 PM, Jack Levin <ma...@gmail.com> wrote:
>> Awesome, thanks!... I will give it a whirl on our test cluster.
>>
>> -Jack
>>
>> On Mon, Sep 20, 2010 at 10:15 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>> So we are running this code in production:
>>>
>>> http://github.com/stumbleupon/hbase
>>>
>>> The branch off point is 8dc5a1a353ffc9fa57ac59618f76928b5eb31f6c, and
>>> everything past that is our rebase and cherry-picked changes.
>>>
>>> We use git to manage this internally, and don't use svn.  Included is
>>> the LZO libraries we use checked directly into the code, and the
>>> assembly changes to publish those.
>>>
>>> So when we are ready to do a deploy, we do this:
>>> mvn install assembly:assembly
>>> (or include the -DskipTests to make it go faster)
>>>
>>> and then we have a new tarball to deploy.
>>>
>>> Note there is absolutely NO warranty here, not even that it will run
>>> for a microsecond... futhermore this is NOT an ASF release, just a
>>> courtesy.  If there ever was to be a release it would look
>>> differently, because ASF releases cant include GPL code (this does)
>>> and depend on commercial releases of haoopp.
>>>
>>> Enjoy,
>>> -ryan
>>>
>>> On Mon, Sep 20, 2010 at 9:57 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>>> no no, 20 GB heap per node.  each node with 24-32gb ram, etc.
>>>>
>>>> we cant rely on the linux buffer cache to save us, so we have to cache
>>>> in hbase ram.
>>>>
>>>> :-)
>>>>
>>>> -ryan
>>>>
>>>> On Mon, Sep 20, 2010 at 9:44 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>> 20GB+?, hmmm..... I do plan to run 50 regionserver nodes though, with
>>>>> 3 GB Heap likely, this should be plenty to rip through say, 350TB of
>>>>> data.
>>>>>
>>>>> -Jack
>>>>>
>>>>> On Mon, Sep 20, 2010 at 9:39 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>>>>> yes that is the new ZK based coordination.  when i publish the SU code
>>>>>> we have a patch which limits that and is faster.  2GB is a little
>>>>>> small for a regionserver memory... in my ideal world we'll be putting
>>>>>> 20GB+ of ram to regionserver.
>>>>>>
>>>>>> I just figured you were using the DEB/RPMs because your files were in
>>>>>> /usr/local... I usually run everything out of /home/hadoop b/c it
>>>>>> allows me to easily rsync as user hadoop.
>>>>>>
>>>>>> but you are on the right track yes :-)
>>>>>>
>>>>>> On Mon, Sep 20, 2010 at 9:32 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>> Who said anything about deb :). I do use tarballs.... Yes, so what did
>>>>>>> it is the copy of that jar to under hbase/lib, and then full restart.
>>>>>>>  Now here is a funny thing, the master shuddered for about 10 minutes,
>>>>>>> spewing those messages:
>>>>>>>
>>>>>>> 2010-09-20 21:23:45,826 DEBUG org.apache.hadoop.hbase.master.HMaster:
>>>>>>> Event NodeCreated with state SyncConnected with path
>>>>>>> /hbase/UNASSIGNED/97999366
>>>>>>> 2010-09-20 21:23:45,827 DEBUG
>>>>>>> org.apache.hadoop.hbase.master.ZKMasterAddressWatcher: Got event
>>>>>>> NodeCreated with path /hbase/UNASSIGNED/97999366
>>>>>>> 2010-09-20 21:23:45,827 DEBUG
>>>>>>> org.apache.hadoop.hbase.master.ZKUnassignedWatcher: ZK-EVENT-PROCESS:
>>>>>>> Got zkEvent NodeCreated state:SyncConnected
>>>>>>> path:/hbase/UNASSIGNED/97999366
>>>>>>> 2010-09-20 21:23:45,827 DEBUG
>>>>>>> org.apache.hadoop.hbase.master.RegionManager: Created/updated
>>>>>>> UNASSIGNED zNode img15,normal052q.jpg,1285001686282.97999366 in state
>>>>>>> M2ZK_REGION_OFFLINE
>>>>>>> 2010-09-20 21:23:45,828 INFO
>>>>>>> org.apache.hadoop.hbase.master.RegionServerOperation:
>>>>>>> img13,p1000319tq.jpg,1284952655960.812544765 open on
>>>>>>> 10.103.2.3,60020,1285042333293
>>>>>>> 2010-09-20 21:23:45,828 DEBUG
>>>>>>> org.apache.hadoop.hbase.master.ZKUnassignedWatcher: Got event type [
>>>>>>> M2ZK_REGION_OFFLINE ] for region 97999366
>>>>>>> 2010-09-20 21:23:45,828 DEBUG org.apache.hadoop.hbase.master.HMaster:
>>>>>>> Event NodeChildrenChanged with state SyncConnected with path
>>>>>>> /hbase/UNASSIGNED
>>>>>>> 2010-09-20 21:23:45,828 DEBUG
>>>>>>> org.apache.hadoop.hbase.master.ZKMasterAddressWatcher: Got event
>>>>>>> NodeChildrenChanged with path /hbase/UNASSIGNED
>>>>>>> 2010-09-20 21:23:45,828 DEBUG
>>>>>>> org.apache.hadoop.hbase.master.ZKUnassignedWatcher: ZK-EVENT-PROCESS:
>>>>>>> Got zkEvent NodeChildrenChanged state:SyncConnected
>>>>>>> path:/hbase/UNASSIGNED
>>>>>>> 2010-09-20 21:23:45,830 DEBUG
>>>>>>> org.apache.hadoop.hbase.master.BaseScanner: Current assignment of
>>>>>>> img150,,1284859678248.3116007 is not valid;
>>>>>>> serverAddress=10.103.2.1:60020, startCode=1285038205920 unknown.
>>>>>>>
>>>>>>>
>>>>>>> Does anyone know what they mean?   At first it would kill one of my
>>>>>>> datanodes.  But what helped is when I changed to heap size to 4GB for
>>>>>>> master and 2GB for datanode that was dying, and after 10 minutes I got
>>>>>>> into a clean state.
>>>>>>>
>>>>>>> -Jack
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Sep 20, 2010 at 9:28 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>>>>>>> yes, on every single machine as well, and restart.
>>>>>>>>
>>>>>>>> again, not sure how how you'd do this in a scalable manner with your
>>>>>>>> deb packages... on the source tarball you can just replace it, rsync
>>>>>>>> it out and done.
>>>>>>>>
>>>>>>>> :-)
>>>>>>>>
>>>>>>>> On Mon, Sep 20, 2010 at 8:56 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>> ok, I found that file, do I replace hadoop-core.*.jar under /usr/lib/hbase/lib?
>>>>>>>>> Then restart, etc?  All regionservers too?
>>>>>>>>>
>>>>>>>>> -Jack
>>>>>>>>>
>>>>>>>>> On Mon, Sep 20, 2010 at 8:40 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>>>>>>>>> Well I don't really run CDH, I disagree with their rpm/deb packaging
>>>>>>>>>> policies and I have to highly recommend not using DEBs to install
>>>>>>>>>> software...
>>>>>>>>>>
>>>>>>>>>> So normally installing from tarball, the jar is in
>>>>>>>>>> <installpath>/hadoop-0.20.0-320/hadoop-core-0.20.2+320.jar
>>>>>>>>>>
>>>>>>>>>> On CDH/DEB edition, it's somewhere silly ... locate and find will be
>>>>>>>>>> your friend.  It should be called hadoop-core-0.20.2+320.jar though!
>>>>>>>>>>
>>>>>>>>>> I'm working on a github publish of SU's production system, which uses
>>>>>>>>>> the cloudera maven repo to install the correct JAR in hbase so when
>>>>>>>>>> you type 'mvn assembly:assembly' to build your own hbase-*-bin.tar.gz
>>>>>>>>>> (the * being whatever version you specified in pom.xml) the cdh3b2 jar
>>>>>>>>>> comes pre-packaged.
>>>>>>>>>>
>>>>>>>>>> Stay tuned :-)
>>>>>>>>>>
>>>>>>>>>> -ryan
>>>>>>>>>>
>>>>>>>>>> On Mon, Sep 20, 2010 at 8:36 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>>>> Ryan, hadoop jar, what is the usual path to the file? I just to to be
>>>>>>>>>>> sure, and where do I put it?
>>>>>>>>>>>
>>>>>>>>>>> -Jack
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Sep 20, 2010 at 8:30 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>>>>>>>>>>> you need 2 more things:
>>>>>>>>>>>>
>>>>>>>>>>>> - restart hdfs
>>>>>>>>>>>> - make sure the hadoop jar from your install replaces the one we ship with
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Sep 20, 2010 at 8:22 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>>>>>> So, I switched to 0.89, and we already had CDH3
>>>>>>>>>>>>> (hadoop-0.20-datanode-0.20.2+320-3.noarch), even though I added
>>>>>>>>>>>>>  <name>dfs.support.append</name> as true to both hdfs-site.xml and
>>>>>>>>>>>>> hbase-site.xml, the master still reports this:
>>>>>>>>>>>>>
>>>>>>>>>>>>>  You are currently running the HMaster without HDFS append support
>>>>>>>>>>>>> enabled. This may result in data loss. Please see the HBase wiki  for
>>>>>>>>>>>>> details.
>>>>>>>>>>>>> Master Attributes
>>>>>>>>>>>>> Attribute Name  Value   Description
>>>>>>>>>>>>> HBase Version   0.89.20100726, r979826  HBase version and svn revision
>>>>>>>>>>>>> HBase Compiled  Sat Jul 31 02:01:58 PDT 2010, stack     When HBase version
>>>>>>>>>>>>> was compiled and by whom
>>>>>>>>>>>>> Hadoop Version  0.20.2, r911707 Hadoop version and svn revision
>>>>>>>>>>>>> Hadoop Compiled Fri Feb 19 08:07:34 UTC 2010, chrisdo   When Hadoop
>>>>>>>>>>>>> version was compiled and by whom
>>>>>>>>>>>>> HBase Root Directory    hdfs://namenode-rd.imageshack.us:9000/hbase     Location
>>>>>>>>>>>>> of HBase home directory
>>>>>>>>>>>>>
>>>>>>>>>>>>> Any ideas whats wrong?
>>>>>>>>>>>>>
>>>>>>>>>>>>> -Jack
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Sep 20, 2010 at 5:47 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>>>>>>>>>>>>> Hey,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> There is actually only 1 active branch of hbase, that being the 0.89
>>>>>>>>>>>>>> release, which is based on 'trunk'.  We have snapshotted a series of
>>>>>>>>>>>>>> 0.89 "developer releases" in hopes that people would try them our and
>>>>>>>>>>>>>> start thinking about the next major version.  One of these is what SU
>>>>>>>>>>>>>> is running prod on.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> At this point tracking 0.89 and which ones are the 'best' peach sets
>>>>>>>>>>>>>> to run is a bit of a contact sport, but if you are serious about not
>>>>>>>>>>>>>> losing data it is worthwhile.  SU is based on the most recent DR with
>>>>>>>>>>>>>> a few minor patches of our own concoction brought in.  If current
>>>>>>>>>>>>>> works, but some Master ops are slow, and there are a few patches on
>>>>>>>>>>>>>> top of that.  I'll poke about and see if its possible to publish to a
>>>>>>>>>>>>>> github branch or something.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -ryan
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Sep 20, 2010 at 5:16 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>>>>>>>> Sounds, good, only reason I ask is because of this:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> There are currently two active branches of HBase:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>    * 0.20 - the current stable release series, being maintained with
>>>>>>>>>>>>>>> patches for bug fixes only. This release series does not support HDFS
>>>>>>>>>>>>>>> durability - edits may be lost in the case of node failure.
>>>>>>>>>>>>>>>    * 0.89 - a development release series with active feature and
>>>>>>>>>>>>>>> stability development, not currently recommended for production use.
>>>>>>>>>>>>>>> This release does support HDFS durability - cases in which edits are
>>>>>>>>>>>>>>> lost are considered serious bugs.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Are we talking about data loss in case of datanode going down while
>>>>>>>>>>>>>>> being written to, or RegionServer going down?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -jack
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, Sep 20, 2010 at 4:09 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>>>>>>>>>>>>>>> We run 0.89 in production @ Stumbleupon.  We also employ 3 committers...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> As for safety, you have no choice but to run 0.89.  If you run a 0.20
>>>>>>>>>>>>>>>> release you will lose data.  you must be on 0.89 and
>>>>>>>>>>>>>>>> CDH3/append-branch to achieve data durability, and there really is no
>>>>>>>>>>>>>>>> argument around it.  If you are doing your tests with 0.20.6 now, I'd
>>>>>>>>>>>>>>>> stop and rebase those tests onto the latest DR announced on the list.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -ryan
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Mon, Sep 20, 2010 at 3:17 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>>>>>>>>>> Hi Stack, see inline:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Mon, Sep 20, 2010 at 2:42 PM, Stack <st...@duboce.net> wrote:
>>>>>>>>>>>>>>>>>> Hey Jack:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks for writing.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> See below for some comments.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Mon, Sep 20, 2010 at 11:00 AM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Image-Shack gets close to two million image uploads per day, which are
>>>>>>>>>>>>>>>>>>> usually stored on regular servers (we have about 700), as regular
>>>>>>>>>>>>>>>>>>> files, and each server has its own host name, such as (img55).   I've
>>>>>>>>>>>>>>>>>>> been researching on how to improve our backend design in terms of data
>>>>>>>>>>>>>>>>>>> safety and stumped onto the Hbase project.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Any other requirements other than data safety? (latency, etc).
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Latency is the second requirement.  We have some services that are
>>>>>>>>>>>>>>>>> very short tail, and can produce 95% cache hit rate, so I assume this
>>>>>>>>>>>>>>>>> would really put cache into good use.  Some other services however,
>>>>>>>>>>>>>>>>> have about 25% cache hit ratio, in which case the latency should be
>>>>>>>>>>>>>>>>> 'adequate', e.g. if its slightly worse than getting data off raw disk,
>>>>>>>>>>>>>>>>> then its good enough.   Safely is supremely important, then its
>>>>>>>>>>>>>>>>> availability, then speed.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Now, I think hbase is he most beautiful thing that happen to
>>>>>>>>>>>>>>>>>>> distributed DB world :).   The idea is to store image files (about
>>>>>>>>>>>>>>>>>>> 400Kb on average into HBASE).
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I'd guess some images are much bigger than this.  Do you ever limit
>>>>>>>>>>>>>>>>>> the size of images folks can upload to your service?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The setup will include the following
>>>>>>>>>>>>>>>>>>> configuration:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 50 servers total (2 datacenters), with 8 GB RAM, dual core cpu, 6 x
>>>>>>>>>>>>>>>>>>> 2TB disks each.
>>>>>>>>>>>>>>>>>>> 3 to 5 Zookeepers
>>>>>>>>>>>>>>>>>>> 2 Masters (in a datacenter each)
>>>>>>>>>>>>>>>>>>> 10 to 20 Stargate REST instances (one per server, hash loadbalanced)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Whats your frontend?  Why REST?  It might be more efficient if you
>>>>>>>>>>>>>>>>>> could run with thrift given REST base64s its payload IIRC (check the
>>>>>>>>>>>>>>>>>> src yourself).
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> For insertion we use Haproxy, and balance curl PUTs across multiple REST APIs.
>>>>>>>>>>>>>>>>> For reading, its a nginx proxy that does Content-type modification
>>>>>>>>>>>>>>>>> from image/jpeg to octet-stream, and vice versa,
>>>>>>>>>>>>>>>>> it then hits Haproxy again, which hits balanced REST.
>>>>>>>>>>>>>>>>> Why REST, it was the simplest thing to run, given that its supports
>>>>>>>>>>>>>>>>> HTTP, potentially we could rewrite something for thrift, as long as we
>>>>>>>>>>>>>>>>> can use http still to send and receive data (anyone wrote anything
>>>>>>>>>>>>>>>>> like that say in python, C or java?)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 40 to 50 RegionServers (will probably keep masters separate on dedicated boxes).
>>>>>>>>>>>>>>>>>>> 2 Namenode servers (one backup, highly available, will do fsimage and
>>>>>>>>>>>>>>>>>>> edits snapshots also)
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> So far I got about 13 servers running, and doing about 20 insertions /
>>>>>>>>>>>>>>>>>>> second (file size ranging from few KB to 2-3MB, ave. 400KB). via
>>>>>>>>>>>>>>>>>>> Stargate API.  Our frontend servers receive files, and I just
>>>>>>>>>>>>>>>>>>> fork-insert them into stargate via http (curl).
>>>>>>>>>>>>>>>>>>> The inserts are humming along nicely, without any noticeable load on
>>>>>>>>>>>>>>>>>>> regionservers, so far inserted about 2 TB worth of images.
>>>>>>>>>>>>>>>>>>> I have adjusted the region file size to be 512MB, and table block size
>>>>>>>>>>>>>>>>>>> to about 400KB , trying to match average access block to limit HDFS
>>>>>>>>>>>>>>>>>>> trips.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> As Todd suggests, I'd go up from 512MB... 1G at least.  You'll
>>>>>>>>>>>>>>>>>> probably want to up your flush size from 64MB to 128MB or maybe 192MB.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Yep, i will adjust to 1G.  I thought flush was controlled by a
>>>>>>>>>>>>>>>>> function of memstore HEAP, something like 40%?  Or are you talking
>>>>>>>>>>>>>>>>> about HDFS block size?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  So far the read performance was more than adequate, and of
>>>>>>>>>>>>>>>>>>> course write performance is nowhere near capacity.
>>>>>>>>>>>>>>>>>>> So right now, all newly uploaded images go to HBASE.  But we do plan
>>>>>>>>>>>>>>>>>>> to insert about 170 Million images (about 100 days worth), which is
>>>>>>>>>>>>>>>>>>> only about 64 TB, or 10% of planned cluster size of 600TB.
>>>>>>>>>>>>>>>>>>> The end goal is to have a storage system that creates data safety,
>>>>>>>>>>>>>>>>>>> e.g. system may go down but data can not be lost.   Our Front-End
>>>>>>>>>>>>>>>>>>> servers will continue to serve images from their own file system (we
>>>>>>>>>>>>>>>>>>> are serving about 16 Gbits at peak), however should we need to bring
>>>>>>>>>>>>>>>>>>> any of those down for maintenance, we will redirect all traffic to
>>>>>>>>>>>>>>>>>>> Hbase (should be no more than few hundred Mbps), while the front end
>>>>>>>>>>>>>>>>>>> server is repaired (for example having its disk replaced), after the
>>>>>>>>>>>>>>>>>>> repairs, we quickly repopulate it with missing files, while serving
>>>>>>>>>>>>>>>>>>> the missing remaining off Hbase.
>>>>>>>>>>>>>>>>>>> All in all should be very interesting project, and I am hoping not to
>>>>>>>>>>>>>>>>>>> run into any snags, however, should that happens, I am pleased to know
>>>>>>>>>>>>>>>>>>> that such a great and vibrant tech group exists that supports and uses
>>>>>>>>>>>>>>>>>>> HBASE :).
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> We're definetly interested in how your project progresses.  If you are
>>>>>>>>>>>>>>>>>> ever up in the city, you should drop by for a chat.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Cool.  I'd like that.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> St.Ack
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> P.S. I'm also w/ Todd that you should move to 0.89 and blooms.
>>>>>>>>>>>>>>>>>> P.P.S I updated the wiki on stargate REST:
>>>>>>>>>>>>>>>>>> http://wiki.apache.org/hadoop/Hbase/Stargate
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Cool, I assume if we move to that it won't kill existing meta tables,
>>>>>>>>>>>>>>>>> and data?  e.g. cross compatible?
>>>>>>>>>>>>>>>>> Is 0.89 ready for production environment?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> -Jack
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Millions of photos into Hbase

Posted by Jack Levin <ma...@gmail.com>.

Quick question, if you have say 3000 records, on 13 servers, with each
region being 1GB, how long do we expect those regions to load? (Master
is dual core, with 8 GB RAM)

Also, what does this line mean:

ZKUnassignedWatcher: ZK-EVENT-PROCESS: Got zkEvent NodeDeleted
state:SyncConnected path:/hbase/UNASSIGNED/1735906890

-Jack

On Mon, Sep 20, 2010 at 10:51 PM, Jack Levin <ma...@gmail.com> wrote:
> Awesome, thanks!... I will give it a whirl on our test cluster.
>
> -Jack
>
> On Mon, Sep 20, 2010 at 10:15 PM, Ryan Rawson <ry...@gmail.com> wrote:
>> So we are running this code in production:
>>
>> http://github.com/stumbleupon/hbase
>>
>> The branch off point is 8dc5a1a353ffc9fa57ac59618f76928b5eb31f6c, and
>> everything past that is our rebase and cherry-picked changes.
>>
>> We use git to manage this internally, and don't use svn.  Included is
>> the LZO libraries we use checked directly into the code, and the
>> assembly changes to publish those.
>>
>> So when we are ready to do a deploy, we do this:
>> mvn install assembly:assembly
>> (or include the -DskipTests to make it go faster)
>>
>> and then we have a new tarball to deploy.
>>
>> Note there is absolutely NO warranty here, not even that it will run
>> for a microsecond... futhermore this is NOT an ASF release, just a
>> courtesy.  If there ever was to be a release it would look
>> differently, because ASF releases cant include GPL code (this does)
>> and depend on commercial releases of haoopp.
>>
>> Enjoy,
>> -ryan
>>
>> On Mon, Sep 20, 2010 at 9:57 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>> no no, 20 GB heap per node.  each node with 24-32gb ram, etc.
>>>
>>> we cant rely on the linux buffer cache to save us, so we have to cache
>>> in hbase ram.
>>>
>>> :-)
>>>
>>> -ryan
>>>
>>> On Mon, Sep 20, 2010 at 9:44 PM, Jack Levin <ma...@gmail.com> wrote:
>>>> 20GB+?, hmmm..... I do plan to run 50 regionserver nodes though, with
>>>> 3 GB Heap likely, this should be plenty to rip through say, 350TB of
>>>> data.
>>>>
>>>> -Jack
>>>>
>>>> On Mon, Sep 20, 2010 at 9:39 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>>>> yes that is the new ZK based coordination.  when i publish the SU code
>>>>> we have a patch which limits that and is faster.  2GB is a little
>>>>> small for a regionserver memory... in my ideal world we'll be putting
>>>>> 20GB+ of ram to regionserver.
>>>>>
>>>>> I just figured you were using the DEB/RPMs because your files were in
>>>>> /usr/local... I usually run everything out of /home/hadoop b/c it
>>>>> allows me to easily rsync as user hadoop.
>>>>>
>>>>> but you are on the right track yes :-)
>>>>>
>>>>> On Mon, Sep 20, 2010 at 9:32 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>> Who said anything about deb :). I do use tarballs.... Yes, so what did
>>>>>> it is the copy of that jar to under hbase/lib, and then full restart.
>>>>>>  Now here is a funny thing, the master shuddered for about 10 minutes,
>>>>>> spewing those messages:
>>>>>>
>>>>>> 2010-09-20 21:23:45,826 DEBUG org.apache.hadoop.hbase.master.HMaster:
>>>>>> Event NodeCreated with state SyncConnected with path
>>>>>> /hbase/UNASSIGNED/97999366
>>>>>> 2010-09-20 21:23:45,827 DEBUG
>>>>>> org.apache.hadoop.hbase.master.ZKMasterAddressWatcher: Got event
>>>>>> NodeCreated with path /hbase/UNASSIGNED/97999366
>>>>>> 2010-09-20 21:23:45,827 DEBUG
>>>>>> org.apache.hadoop.hbase.master.ZKUnassignedWatcher: ZK-EVENT-PROCESS:
>>>>>> Got zkEvent NodeCreated state:SyncConnected
>>>>>> path:/hbase/UNASSIGNED/97999366
>>>>>> 2010-09-20 21:23:45,827 DEBUG
>>>>>> org.apache.hadoop.hbase.master.RegionManager: Created/updated
>>>>>> UNASSIGNED zNode img15,normal052q.jpg,1285001686282.97999366 in state
>>>>>> M2ZK_REGION_OFFLINE
>>>>>> 2010-09-20 21:23:45,828 INFO
>>>>>> org.apache.hadoop.hbase.master.RegionServerOperation:
>>>>>> img13,p1000319tq.jpg,1284952655960.812544765 open on
>>>>>> 10.103.2.3,60020,1285042333293
>>>>>> 2010-09-20 21:23:45,828 DEBUG
>>>>>> org.apache.hadoop.hbase.master.ZKUnassignedWatcher: Got event type [
>>>>>> M2ZK_REGION_OFFLINE ] for region 97999366
>>>>>> 2010-09-20 21:23:45,828 DEBUG org.apache.hadoop.hbase.master.HMaster:
>>>>>> Event NodeChildrenChanged with state SyncConnected with path
>>>>>> /hbase/UNASSIGNED
>>>>>> 2010-09-20 21:23:45,828 DEBUG
>>>>>> org.apache.hadoop.hbase.master.ZKMasterAddressWatcher: Got event
>>>>>> NodeChildrenChanged with path /hbase/UNASSIGNED
>>>>>> 2010-09-20 21:23:45,828 DEBUG
>>>>>> org.apache.hadoop.hbase.master.ZKUnassignedWatcher: ZK-EVENT-PROCESS:
>>>>>> Got zkEvent NodeChildrenChanged state:SyncConnected
>>>>>> path:/hbase/UNASSIGNED
>>>>>> 2010-09-20 21:23:45,830 DEBUG
>>>>>> org.apache.hadoop.hbase.master.BaseScanner: Current assignment of
>>>>>> img150,,1284859678248.3116007 is not valid;
>>>>>> serverAddress=10.103.2.1:60020, startCode=1285038205920 unknown.
>>>>>>
>>>>>>
>>>>>> Does anyone know what they mean?   At first it would kill one of my
>>>>>> datanodes.  But what helped is when I changed to heap size to 4GB for
>>>>>> master and 2GB for datanode that was dying, and after 10 minutes I got
>>>>>> into a clean state.
>>>>>>
>>>>>> -Jack
>>>>>>
>>>>>>
>>>>>> On Mon, Sep 20, 2010 at 9:28 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>>>>>> yes, on every single machine as well, and restart.
>>>>>>>
>>>>>>> again, not sure how how you'd do this in a scalable manner with your
>>>>>>> deb packages... on the source tarball you can just replace it, rsync
>>>>>>> it out and done.
>>>>>>>
>>>>>>> :-)
>>>>>>>
>>>>>>> On Mon, Sep 20, 2010 at 8:56 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>> ok, I found that file, do I replace hadoop-core.*.jar under /usr/lib/hbase/lib?
>>>>>>>> Then restart, etc?  All regionservers too?
>>>>>>>>
>>>>>>>> -Jack
>>>>>>>>
>>>>>>>> On Mon, Sep 20, 2010 at 8:40 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>>>>>>>> Well I don't really run CDH, I disagree with their rpm/deb packaging
>>>>>>>>> policies and I have to highly recommend not using DEBs to install
>>>>>>>>> software...
>>>>>>>>>
>>>>>>>>> So normally installing from tarball, the jar is in
>>>>>>>>> <installpath>/hadoop-0.20.0-320/hadoop-core-0.20.2+320.jar
>>>>>>>>>
>>>>>>>>> On CDH/DEB edition, it's somewhere silly ... locate and find will be
>>>>>>>>> your friend.  It should be called hadoop-core-0.20.2+320.jar though!
>>>>>>>>>
>>>>>>>>> I'm working on a github publish of SU's production system, which uses
>>>>>>>>> the cloudera maven repo to install the correct JAR in hbase so when
>>>>>>>>> you type 'mvn assembly:assembly' to build your own hbase-*-bin.tar.gz
>>>>>>>>> (the * being whatever version you specified in pom.xml) the cdh3b2 jar
>>>>>>>>> comes pre-packaged.
>>>>>>>>>
>>>>>>>>> Stay tuned :-)
>>>>>>>>>
>>>>>>>>> -ryan
>>>>>>>>>
>>>>>>>>> On Mon, Sep 20, 2010 at 8:36 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>>> Ryan, hadoop jar, what is the usual path to the file? I just to to be
>>>>>>>>>> sure, and where do I put it?
>>>>>>>>>>
>>>>>>>>>> -Jack
>>>>>>>>>>
>>>>>>>>>> On Mon, Sep 20, 2010 at 8:30 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>>>>>>>>>> you need 2 more things:
>>>>>>>>>>>
>>>>>>>>>>> - restart hdfs
>>>>>>>>>>> - make sure the hadoop jar from your install replaces the one we ship with
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Sep 20, 2010 at 8:22 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>>>>> So, I switched to 0.89, and we already had CDH3
>>>>>>>>>>>> (hadoop-0.20-datanode-0.20.2+320-3.noarch), even though I added
>>>>>>>>>>>>  <name>dfs.support.append</name> as true to both hdfs-site.xml and
>>>>>>>>>>>> hbase-site.xml, the master still reports this:
>>>>>>>>>>>>
>>>>>>>>>>>>  You are currently running the HMaster without HDFS append support
>>>>>>>>>>>> enabled. This may result in data loss. Please see the HBase wiki  for
>>>>>>>>>>>> details.
>>>>>>>>>>>> Master Attributes
>>>>>>>>>>>> Attribute Name  Value   Description
>>>>>>>>>>>> HBase Version   0.89.20100726, r979826  HBase version and svn revision
>>>>>>>>>>>> HBase Compiled  Sat Jul 31 02:01:58 PDT 2010, stack     When HBase version
>>>>>>>>>>>> was compiled and by whom
>>>>>>>>>>>> Hadoop Version  0.20.2, r911707 Hadoop version and svn revision
>>>>>>>>>>>> Hadoop Compiled Fri Feb 19 08:07:34 UTC 2010, chrisdo   When Hadoop
>>>>>>>>>>>> version was compiled and by whom
>>>>>>>>>>>> HBase Root Directory    hdfs://namenode-rd.imageshack.us:9000/hbase     Location
>>>>>>>>>>>> of HBase home directory
>>>>>>>>>>>>
>>>>>>>>>>>> Any ideas whats wrong?
>>>>>>>>>>>>
>>>>>>>>>>>> -Jack
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Sep 20, 2010 at 5:47 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>>>>>>>>>>>> Hey,
>>>>>>>>>>>>>
>>>>>>>>>>>>> There is actually only 1 active branch of hbase, that being the 0.89
>>>>>>>>>>>>> release, which is based on 'trunk'.  We have snapshotted a series of
>>>>>>>>>>>>> 0.89 "developer releases" in hopes that people would try them our and
>>>>>>>>>>>>> start thinking about the next major version.  One of these is what SU
>>>>>>>>>>>>> is running prod on.
>>>>>>>>>>>>>
>>>>>>>>>>>>> At this point tracking 0.89 and which ones are the 'best' peach sets
>>>>>>>>>>>>> to run is a bit of a contact sport, but if you are serious about not
>>>>>>>>>>>>> losing data it is worthwhile.  SU is based on the most recent DR with
>>>>>>>>>>>>> a few minor patches of our own concoction brought in.  If current
>>>>>>>>>>>>> works, but some Master ops are slow, and there are a few patches on
>>>>>>>>>>>>> top of that.  I'll poke about and see if its possible to publish to a
>>>>>>>>>>>>> github branch or something.
>>>>>>>>>>>>>
>>>>>>>>>>>>> -ryan
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Sep 20, 2010 at 5:16 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>>>>>>> Sounds, good, only reason I ask is because of this:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> There are currently two active branches of HBase:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    * 0.20 - the current stable release series, being maintained with
>>>>>>>>>>>>>> patches for bug fixes only. This release series does not support HDFS
>>>>>>>>>>>>>> durability - edits may be lost in the case of node failure.
>>>>>>>>>>>>>>    * 0.89 - a development release series with active feature and
>>>>>>>>>>>>>> stability development, not currently recommended for production use.
>>>>>>>>>>>>>> This release does support HDFS durability - cases in which edits are
>>>>>>>>>>>>>> lost are considered serious bugs.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Are we talking about data loss in case of datanode going down while
>>>>>>>>>>>>>> being written to, or RegionServer going down?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -jack
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Sep 20, 2010 at 4:09 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>>>>>>>>>>>>>> We run 0.89 in production @ Stumbleupon.  We also employ 3 committers...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> As for safety, you have no choice but to run 0.89.  If you run a 0.20
>>>>>>>>>>>>>>> release you will lose data.  you must be on 0.89 and
>>>>>>>>>>>>>>> CDH3/append-branch to achieve data durability, and there really is no
>>>>>>>>>>>>>>> argument around it.  If you are doing your tests with 0.20.6 now, I'd
>>>>>>>>>>>>>>> stop and rebase those tests onto the latest DR announced on the list.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -ryan
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, Sep 20, 2010 at 3:17 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>>>>>>>>> Hi Stack, see inline:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Mon, Sep 20, 2010 at 2:42 PM, Stack <st...@duboce.net> wrote:
>>>>>>>>>>>>>>>>> Hey Jack:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks for writing.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> See below for some comments.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Mon, Sep 20, 2010 at 11:00 AM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Image-Shack gets close to two million image uploads per day, which are
>>>>>>>>>>>>>>>>>> usually stored on regular servers (we have about 700), as regular
>>>>>>>>>>>>>>>>>> files, and each server has its own host name, such as (img55).   I've
>>>>>>>>>>>>>>>>>> been researching on how to improve our backend design in terms of data
>>>>>>>>>>>>>>>>>> safety and stumped onto the Hbase project.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Any other requirements other than data safety? (latency, etc).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Latency is the second requirement.  We have some services that are
>>>>>>>>>>>>>>>> very short tail, and can produce 95% cache hit rate, so I assume this
>>>>>>>>>>>>>>>> would really put cache into good use.  Some other services however,
>>>>>>>>>>>>>>>> have about 25% cache hit ratio, in which case the latency should be
>>>>>>>>>>>>>>>> 'adequate', e.g. if its slightly worse than getting data off raw disk,
>>>>>>>>>>>>>>>> then its good enough.   Safely is supremely important, then its
>>>>>>>>>>>>>>>> availability, then speed.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Now, I think hbase is he most beautiful thing that happen to
>>>>>>>>>>>>>>>>>> distributed DB world :).   The idea is to store image files (about
>>>>>>>>>>>>>>>>>> 400Kb on average into HBASE).
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I'd guess some images are much bigger than this.  Do you ever limit
>>>>>>>>>>>>>>>>> the size of images folks can upload to your service?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The setup will include the following
>>>>>>>>>>>>>>>>>> configuration:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 50 servers total (2 datacenters), with 8 GB RAM, dual core cpu, 6 x
>>>>>>>>>>>>>>>>>> 2TB disks each.
>>>>>>>>>>>>>>>>>> 3 to 5 Zookeepers
>>>>>>>>>>>>>>>>>> 2 Masters (in a datacenter each)
>>>>>>>>>>>>>>>>>> 10 to 20 Stargate REST instances (one per server, hash loadbalanced)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Whats your frontend?  Why REST?  It might be more efficient if you
>>>>>>>>>>>>>>>>> could run with thrift given REST base64s its payload IIRC (check the
>>>>>>>>>>>>>>>>> src yourself).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> For insertion we use Haproxy, and balance curl PUTs across multiple REST APIs.
>>>>>>>>>>>>>>>> For reading, its a nginx proxy that does Content-type modification
>>>>>>>>>>>>>>>> from image/jpeg to octet-stream, and vice versa,
>>>>>>>>>>>>>>>> it then hits Haproxy again, which hits balanced REST.
>>>>>>>>>>>>>>>> Why REST, it was the simplest thing to run, given that its supports
>>>>>>>>>>>>>>>> HTTP, potentially we could rewrite something for thrift, as long as we
>>>>>>>>>>>>>>>> can use http still to send and receive data (anyone wrote anything
>>>>>>>>>>>>>>>> like that say in python, C or java?)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 40 to 50 RegionServers (will probably keep masters separate on dedicated boxes).
>>>>>>>>>>>>>>>>>> 2 Namenode servers (one backup, highly available, will do fsimage and
>>>>>>>>>>>>>>>>>> edits snapshots also)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> So far I got about 13 servers running, and doing about 20 insertions /
>>>>>>>>>>>>>>>>>> second (file size ranging from few KB to 2-3MB, ave. 400KB). via
>>>>>>>>>>>>>>>>>> Stargate API.  Our frontend servers receive files, and I just
>>>>>>>>>>>>>>>>>> fork-insert them into stargate via http (curl).
>>>>>>>>>>>>>>>>>> The inserts are humming along nicely, without any noticeable load on
>>>>>>>>>>>>>>>>>> regionservers, so far inserted about 2 TB worth of images.
>>>>>>>>>>>>>>>>>> I have adjusted the region file size to be 512MB, and table block size
>>>>>>>>>>>>>>>>>> to about 400KB , trying to match average access block to limit HDFS
>>>>>>>>>>>>>>>>>> trips.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> As Todd suggests, I'd go up from 512MB... 1G at least.  You'll
>>>>>>>>>>>>>>>>> probably want to up your flush size from 64MB to 128MB or maybe 192MB.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Yep, i will adjust to 1G.  I thought flush was controlled by a
>>>>>>>>>>>>>>>> function of memstore HEAP, something like 40%?  Or are you talking
>>>>>>>>>>>>>>>> about HDFS block size?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  So far the read performance was more than adequate, and of
>>>>>>>>>>>>>>>>>> course write performance is nowhere near capacity.
>>>>>>>>>>>>>>>>>> So right now, all newly uploaded images go to HBASE.  But we do plan
>>>>>>>>>>>>>>>>>> to insert about 170 Million images (about 100 days worth), which is
>>>>>>>>>>>>>>>>>> only about 64 TB, or 10% of planned cluster size of 600TB.
>>>>>>>>>>>>>>>>>> The end goal is to have a storage system that creates data safety,
>>>>>>>>>>>>>>>>>> e.g. system may go down but data can not be lost.   Our Front-End
>>>>>>>>>>>>>>>>>> servers will continue to serve images from their own file system (we
>>>>>>>>>>>>>>>>>> are serving about 16 Gbits at peak), however should we need to bring
>>>>>>>>>>>>>>>>>> any of those down for maintenance, we will redirect all traffic to
>>>>>>>>>>>>>>>>>> Hbase (should be no more than few hundred Mbps), while the front end
>>>>>>>>>>>>>>>>>> server is repaired (for example having its disk replaced), after the
>>>>>>>>>>>>>>>>>> repairs, we quickly repopulate it with missing files, while serving
>>>>>>>>>>>>>>>>>> the missing remaining off Hbase.
>>>>>>>>>>>>>>>>>> All in all should be very interesting project, and I am hoping not to
>>>>>>>>>>>>>>>>>> run into any snags, however, should that happens, I am pleased to know
>>>>>>>>>>>>>>>>>> that such a great and vibrant tech group exists that supports and uses
>>>>>>>>>>>>>>>>>> HBASE :).
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> We're definetly interested in how your project progresses.  If you are
>>>>>>>>>>>>>>>>> ever up in the city, you should drop by for a chat.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Cool.  I'd like that.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> St.Ack
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> P.S. I'm also w/ Todd that you should move to 0.89 and blooms.
>>>>>>>>>>>>>>>>> P.P.S I updated the wiki on stargate REST:
>>>>>>>>>>>>>>>>> http://wiki.apache.org/hadoop/Hbase/Stargate
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Cool, I assume if we move to that it won't kill existing meta tables,
>>>>>>>>>>>>>>>> and data?  e.g. cross compatible?
>>>>>>>>>>>>>>>> Is 0.89 ready for production environment?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -Jack
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Millions of photos into Hbase

Posted by Jack Levin <ma...@gmail.com>.

Awesome, thanks!... I will give it a whirl on our test cluster.

-Jack

On Mon, Sep 20, 2010 at 10:15 PM, Ryan Rawson <ry...@gmail.com> wrote:
> So we are running this code in production:
>
> http://github.com/stumbleupon/hbase
>
> The branch off point is 8dc5a1a353ffc9fa57ac59618f76928b5eb31f6c, and
> everything past that is our rebase and cherry-picked changes.
>
> We use git to manage this internally, and don't use svn.  Included is
> the LZO libraries we use checked directly into the code, and the
> assembly changes to publish those.
>
> So when we are ready to do a deploy, we do this:
> mvn install assembly:assembly
> (or include the -DskipTests to make it go faster)
>
> and then we have a new tarball to deploy.
>
> Note there is absolutely NO warranty here, not even that it will run
> for a microsecond... futhermore this is NOT an ASF release, just a
> courtesy.  If there ever was to be a release it would look
> differently, because ASF releases cant include GPL code (this does)
> and depend on commercial releases of haoopp.
>
> Enjoy,
> -ryan
>
> On Mon, Sep 20, 2010 at 9:57 PM, Ryan Rawson <ry...@gmail.com> wrote:
>> no no, 20 GB heap per node.  each node with 24-32gb ram, etc.
>>
>> we cant rely on the linux buffer cache to save us, so we have to cache
>> in hbase ram.
>>
>> :-)
>>
>> -ryan
>>
>> On Mon, Sep 20, 2010 at 9:44 PM, Jack Levin <ma...@gmail.com> wrote:
>>> 20GB+?, hmmm..... I do plan to run 50 regionserver nodes though, with
>>> 3 GB Heap likely, this should be plenty to rip through say, 350TB of
>>> data.
>>>
>>> -Jack
>>>
>>> On Mon, Sep 20, 2010 at 9:39 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>>> yes that is the new ZK based coordination.  when i publish the SU code
>>>> we have a patch which limits that and is faster.  2GB is a little
>>>> small for a regionserver memory... in my ideal world we'll be putting
>>>> 20GB+ of ram to regionserver.
>>>>
>>>> I just figured you were using the DEB/RPMs because your files were in
>>>> /usr/local... I usually run everything out of /home/hadoop b/c it
>>>> allows me to easily rsync as user hadoop.
>>>>
>>>> but you are on the right track yes :-)
>>>>
>>>> On Mon, Sep 20, 2010 at 9:32 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>> Who said anything about deb :). I do use tarballs.... Yes, so what did
>>>>> it is the copy of that jar to under hbase/lib, and then full restart.
>>>>>  Now here is a funny thing, the master shuddered for about 10 minutes,
>>>>> spewing those messages:
>>>>>
>>>>> 2010-09-20 21:23:45,826 DEBUG org.apache.hadoop.hbase.master.HMaster:
>>>>> Event NodeCreated with state SyncConnected with path
>>>>> /hbase/UNASSIGNED/97999366
>>>>> 2010-09-20 21:23:45,827 DEBUG
>>>>> org.apache.hadoop.hbase.master.ZKMasterAddressWatcher: Got event
>>>>> NodeCreated with path /hbase/UNASSIGNED/97999366
>>>>> 2010-09-20 21:23:45,827 DEBUG
>>>>> org.apache.hadoop.hbase.master.ZKUnassignedWatcher: ZK-EVENT-PROCESS:
>>>>> Got zkEvent NodeCreated state:SyncConnected
>>>>> path:/hbase/UNASSIGNED/97999366
>>>>> 2010-09-20 21:23:45,827 DEBUG
>>>>> org.apache.hadoop.hbase.master.RegionManager: Created/updated
>>>>> UNASSIGNED zNode img15,normal052q.jpg,1285001686282.97999366 in state
>>>>> M2ZK_REGION_OFFLINE
>>>>> 2010-09-20 21:23:45,828 INFO
>>>>> org.apache.hadoop.hbase.master.RegionServerOperation:
>>>>> img13,p1000319tq.jpg,1284952655960.812544765 open on
>>>>> 10.103.2.3,60020,1285042333293
>>>>> 2010-09-20 21:23:45,828 DEBUG
>>>>> org.apache.hadoop.hbase.master.ZKUnassignedWatcher: Got event type [
>>>>> M2ZK_REGION_OFFLINE ] for region 97999366
>>>>> 2010-09-20 21:23:45,828 DEBUG org.apache.hadoop.hbase.master.HMaster:
>>>>> Event NodeChildrenChanged with state SyncConnected with path
>>>>> /hbase/UNASSIGNED
>>>>> 2010-09-20 21:23:45,828 DEBUG
>>>>> org.apache.hadoop.hbase.master.ZKMasterAddressWatcher: Got event
>>>>> NodeChildrenChanged with path /hbase/UNASSIGNED
>>>>> 2010-09-20 21:23:45,828 DEBUG
>>>>> org.apache.hadoop.hbase.master.ZKUnassignedWatcher: ZK-EVENT-PROCESS:
>>>>> Got zkEvent NodeChildrenChanged state:SyncConnected
>>>>> path:/hbase/UNASSIGNED
>>>>> 2010-09-20 21:23:45,830 DEBUG
>>>>> org.apache.hadoop.hbase.master.BaseScanner: Current assignment of
>>>>> img150,,1284859678248.3116007 is not valid;
>>>>> serverAddress=10.103.2.1:60020, startCode=1285038205920 unknown.
>>>>>
>>>>>
>>>>> Does anyone know what they mean?   At first it would kill one of my
>>>>> datanodes.  But what helped is when I changed to heap size to 4GB for
>>>>> master and 2GB for datanode that was dying, and after 10 minutes I got
>>>>> into a clean state.
>>>>>
>>>>> -Jack
>>>>>
>>>>>
>>>>> On Mon, Sep 20, 2010 at 9:28 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>>>>> yes, on every single machine as well, and restart.
>>>>>>
>>>>>> again, not sure how how you'd do this in a scalable manner with your
>>>>>> deb packages... on the source tarball you can just replace it, rsync
>>>>>> it out and done.
>>>>>>
>>>>>> :-)
>>>>>>
>>>>>> On Mon, Sep 20, 2010 at 8:56 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>> ok, I found that file, do I replace hadoop-core.*.jar under /usr/lib/hbase/lib?
>>>>>>> Then restart, etc?  All regionservers too?
>>>>>>>
>>>>>>> -Jack
>>>>>>>
>>>>>>> On Mon, Sep 20, 2010 at 8:40 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>>>>>>> Well I don't really run CDH, I disagree with their rpm/deb packaging
>>>>>>>> policies and I have to highly recommend not using DEBs to install
>>>>>>>> software...
>>>>>>>>
>>>>>>>> So normally installing from tarball, the jar is in
>>>>>>>> <installpath>/hadoop-0.20.0-320/hadoop-core-0.20.2+320.jar
>>>>>>>>
>>>>>>>> On CDH/DEB edition, it's somewhere silly ... locate and find will be
>>>>>>>> your friend.  It should be called hadoop-core-0.20.2+320.jar though!
>>>>>>>>
>>>>>>>> I'm working on a github publish of SU's production system, which uses
>>>>>>>> the cloudera maven repo to install the correct JAR in hbase so when
>>>>>>>> you type 'mvn assembly:assembly' to build your own hbase-*-bin.tar.gz
>>>>>>>> (the * being whatever version you specified in pom.xml) the cdh3b2 jar
>>>>>>>> comes pre-packaged.
>>>>>>>>
>>>>>>>> Stay tuned :-)
>>>>>>>>
>>>>>>>> -ryan
>>>>>>>>
>>>>>>>> On Mon, Sep 20, 2010 at 8:36 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>> Ryan, hadoop jar, what is the usual path to the file? I just to to be
>>>>>>>>> sure, and where do I put it?
>>>>>>>>>
>>>>>>>>> -Jack
>>>>>>>>>
>>>>>>>>> On Mon, Sep 20, 2010 at 8:30 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>>>>>>>>> you need 2 more things:
>>>>>>>>>>
>>>>>>>>>> - restart hdfs
>>>>>>>>>> - make sure the hadoop jar from your install replaces the one we ship with
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Sep 20, 2010 at 8:22 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>>>> So, I switched to 0.89, and we already had CDH3
>>>>>>>>>>> (hadoop-0.20-datanode-0.20.2+320-3.noarch), even though I added
>>>>>>>>>>>  <name>dfs.support.append</name> as true to both hdfs-site.xml and
>>>>>>>>>>> hbase-site.xml, the master still reports this:
>>>>>>>>>>>
>>>>>>>>>>>  You are currently running the HMaster without HDFS append support
>>>>>>>>>>> enabled. This may result in data loss. Please see the HBase wiki  for
>>>>>>>>>>> details.
>>>>>>>>>>> Master Attributes
>>>>>>>>>>> Attribute Name  Value   Description
>>>>>>>>>>> HBase Version   0.89.20100726, r979826  HBase version and svn revision
>>>>>>>>>>> HBase Compiled  Sat Jul 31 02:01:58 PDT 2010, stack     When HBase version
>>>>>>>>>>> was compiled and by whom
>>>>>>>>>>> Hadoop Version  0.20.2, r911707 Hadoop version and svn revision
>>>>>>>>>>> Hadoop Compiled Fri Feb 19 08:07:34 UTC 2010, chrisdo   When Hadoop
>>>>>>>>>>> version was compiled and by whom
>>>>>>>>>>> HBase Root Directory    hdfs://namenode-rd.imageshack.us:9000/hbase     Location
>>>>>>>>>>> of HBase home directory
>>>>>>>>>>>
>>>>>>>>>>> Any ideas whats wrong?
>>>>>>>>>>>
>>>>>>>>>>> -Jack
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Sep 20, 2010 at 5:47 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>>>>>>>>>>> Hey,
>>>>>>>>>>>>
>>>>>>>>>>>> There is actually only 1 active branch of hbase, that being the 0.89
>>>>>>>>>>>> release, which is based on 'trunk'.  We have snapshotted a series of
>>>>>>>>>>>> 0.89 "developer releases" in hopes that people would try them our and
>>>>>>>>>>>> start thinking about the next major version.  One of these is what SU
>>>>>>>>>>>> is running prod on.
>>>>>>>>>>>>
>>>>>>>>>>>> At this point tracking 0.89 and which ones are the 'best' peach sets
>>>>>>>>>>>> to run is a bit of a contact sport, but if you are serious about not
>>>>>>>>>>>> losing data it is worthwhile.  SU is based on the most recent DR with
>>>>>>>>>>>> a few minor patches of our own concoction brought in.  If current
>>>>>>>>>>>> works, but some Master ops are slow, and there are a few patches on
>>>>>>>>>>>> top of that.  I'll poke about and see if its possible to publish to a
>>>>>>>>>>>> github branch or something.
>>>>>>>>>>>>
>>>>>>>>>>>> -ryan
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Sep 20, 2010 at 5:16 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>>>>>> Sounds, good, only reason I ask is because of this:
>>>>>>>>>>>>>
>>>>>>>>>>>>> There are currently two active branches of HBase:
>>>>>>>>>>>>>
>>>>>>>>>>>>>    * 0.20 - the current stable release series, being maintained with
>>>>>>>>>>>>> patches for bug fixes only. This release series does not support HDFS
>>>>>>>>>>>>> durability - edits may be lost in the case of node failure.
>>>>>>>>>>>>>    * 0.89 - a development release series with active feature and
>>>>>>>>>>>>> stability development, not currently recommended for production use.
>>>>>>>>>>>>> This release does support HDFS durability - cases in which edits are
>>>>>>>>>>>>> lost are considered serious bugs.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Are we talking about data loss in case of datanode going down while
>>>>>>>>>>>>> being written to, or RegionServer going down?
>>>>>>>>>>>>>
>>>>>>>>>>>>> -jack
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Sep 20, 2010 at 4:09 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>>>>>>>>>>>>> We run 0.89 in production @ Stumbleupon.  We also employ 3 committers...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> As for safety, you have no choice but to run 0.89.  If you run a 0.20
>>>>>>>>>>>>>> release you will lose data.  you must be on 0.89 and
>>>>>>>>>>>>>> CDH3/append-branch to achieve data durability, and there really is no
>>>>>>>>>>>>>> argument around it.  If you are doing your tests with 0.20.6 now, I'd
>>>>>>>>>>>>>> stop and rebase those tests onto the latest DR announced on the list.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -ryan
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Sep 20, 2010 at 3:17 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>>>>>>>> Hi Stack, see inline:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, Sep 20, 2010 at 2:42 PM, Stack <st...@duboce.net> wrote:
>>>>>>>>>>>>>>>> Hey Jack:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks for writing.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> See below for some comments.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Mon, Sep 20, 2010 at 11:00 AM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Image-Shack gets close to two million image uploads per day, which are
>>>>>>>>>>>>>>>>> usually stored on regular servers (we have about 700), as regular
>>>>>>>>>>>>>>>>> files, and each server has its own host name, such as (img55).   I've
>>>>>>>>>>>>>>>>> been researching on how to improve our backend design in terms of data
>>>>>>>>>>>>>>>>> safety and stumped onto the Hbase project.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Any other requirements other than data safety? (latency, etc).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Latency is the second requirement.  We have some services that are
>>>>>>>>>>>>>>> very short tail, and can produce 95% cache hit rate, so I assume this
>>>>>>>>>>>>>>> would really put cache into good use.  Some other services however,
>>>>>>>>>>>>>>> have about 25% cache hit ratio, in which case the latency should be
>>>>>>>>>>>>>>> 'adequate', e.g. if its slightly worse than getting data off raw disk,
>>>>>>>>>>>>>>> then its good enough.   Safely is supremely important, then its
>>>>>>>>>>>>>>> availability, then speed.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Now, I think hbase is he most beautiful thing that happen to
>>>>>>>>>>>>>>>>> distributed DB world :).   The idea is to store image files (about
>>>>>>>>>>>>>>>>> 400Kb on average into HBASE).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'd guess some images are much bigger than this.  Do you ever limit
>>>>>>>>>>>>>>>> the size of images folks can upload to your service?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The setup will include the following
>>>>>>>>>>>>>>>>> configuration:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 50 servers total (2 datacenters), with 8 GB RAM, dual core cpu, 6 x
>>>>>>>>>>>>>>>>> 2TB disks each.
>>>>>>>>>>>>>>>>> 3 to 5 Zookeepers
>>>>>>>>>>>>>>>>> 2 Masters (in a datacenter each)
>>>>>>>>>>>>>>>>> 10 to 20 Stargate REST instances (one per server, hash loadbalanced)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Whats your frontend?  Why REST?  It might be more efficient if you
>>>>>>>>>>>>>>>> could run with thrift given REST base64s its payload IIRC (check the
>>>>>>>>>>>>>>>> src yourself).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> For insertion we use Haproxy, and balance curl PUTs across multiple REST APIs.
>>>>>>>>>>>>>>> For reading, its a nginx proxy that does Content-type modification
>>>>>>>>>>>>>>> from image/jpeg to octet-stream, and vice versa,
>>>>>>>>>>>>>>> it then hits Haproxy again, which hits balanced REST.
>>>>>>>>>>>>>>> Why REST, it was the simplest thing to run, given that its supports
>>>>>>>>>>>>>>> HTTP, potentially we could rewrite something for thrift, as long as we
>>>>>>>>>>>>>>> can use http still to send and receive data (anyone wrote anything
>>>>>>>>>>>>>>> like that say in python, C or java?)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 40 to 50 RegionServers (will probably keep masters separate on dedicated boxes).
>>>>>>>>>>>>>>>>> 2 Namenode servers (one backup, highly available, will do fsimage and
>>>>>>>>>>>>>>>>> edits snapshots also)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> So far I got about 13 servers running, and doing about 20 insertions /
>>>>>>>>>>>>>>>>> second (file size ranging from few KB to 2-3MB, ave. 400KB). via
>>>>>>>>>>>>>>>>> Stargate API.  Our frontend servers receive files, and I just
>>>>>>>>>>>>>>>>> fork-insert them into stargate via http (curl).
>>>>>>>>>>>>>>>>> The inserts are humming along nicely, without any noticeable load on
>>>>>>>>>>>>>>>>> regionservers, so far inserted about 2 TB worth of images.
>>>>>>>>>>>>>>>>> I have adjusted the region file size to be 512MB, and table block size
>>>>>>>>>>>>>>>>> to about 400KB , trying to match average access block to limit HDFS
>>>>>>>>>>>>>>>>> trips.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> As Todd suggests, I'd go up from 512MB... 1G at least.  You'll
>>>>>>>>>>>>>>>> probably want to up your flush size from 64MB to 128MB or maybe 192MB.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Yep, i will adjust to 1G.  I thought flush was controlled by a
>>>>>>>>>>>>>>> function of memstore HEAP, something like 40%?  Or are you talking
>>>>>>>>>>>>>>> about HDFS block size?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  So far the read performance was more than adequate, and of
>>>>>>>>>>>>>>>>> course write performance is nowhere near capacity.
>>>>>>>>>>>>>>>>> So right now, all newly uploaded images go to HBASE.  But we do plan
>>>>>>>>>>>>>>>>> to insert about 170 Million images (about 100 days worth), which is
>>>>>>>>>>>>>>>>> only about 64 TB, or 10% of planned cluster size of 600TB.
>>>>>>>>>>>>>>>>> The end goal is to have a storage system that creates data safety,
>>>>>>>>>>>>>>>>> e.g. system may go down but data can not be lost.   Our Front-End
>>>>>>>>>>>>>>>>> servers will continue to serve images from their own file system (we
>>>>>>>>>>>>>>>>> are serving about 16 Gbits at peak), however should we need to bring
>>>>>>>>>>>>>>>>> any of those down for maintenance, we will redirect all traffic to
>>>>>>>>>>>>>>>>> Hbase (should be no more than few hundred Mbps), while the front end
>>>>>>>>>>>>>>>>> server is repaired (for example having its disk replaced), after the
>>>>>>>>>>>>>>>>> repairs, we quickly repopulate it with missing files, while serving
>>>>>>>>>>>>>>>>> the missing remaining off Hbase.
>>>>>>>>>>>>>>>>> All in all should be very interesting project, and I am hoping not to
>>>>>>>>>>>>>>>>> run into any snags, however, should that happens, I am pleased to know
>>>>>>>>>>>>>>>>> that such a great and vibrant tech group exists that supports and uses
>>>>>>>>>>>>>>>>> HBASE :).
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> We're definetly interested in how your project progresses.  If you are
>>>>>>>>>>>>>>>> ever up in the city, you should drop by for a chat.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Cool.  I'd like that.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> St.Ack
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> P.S. I'm also w/ Todd that you should move to 0.89 and blooms.
>>>>>>>>>>>>>>>> P.P.S I updated the wiki on stargate REST:
>>>>>>>>>>>>>>>> http://wiki.apache.org/hadoop/Hbase/Stargate
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Cool, I assume if we move to that it won't kill existing meta tables,
>>>>>>>>>>>>>>> and data?  e.g. cross compatible?
>>>>>>>>>>>>>>> Is 0.89 ready for production environment?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -Jack
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Millions of photos into Hbase

Posted by Ryan Rawson <ry...@gmail.com>.

So we are running this code in production:

http://github.com/stumbleupon/hbase

The branch off point is 8dc5a1a353ffc9fa57ac59618f76928b5eb31f6c, and
everything past that is our rebase and cherry-picked changes.

We use git to manage this internally, and don't use svn.  Included is
the LZO libraries we use checked directly into the code, and the
assembly changes to publish those.

So when we are ready to do a deploy, we do this:
mvn install assembly:assembly
(or include the -DskipTests to make it go faster)

and then we have a new tarball to deploy.

Note there is absolutely NO warranty here, not even that it will run
for a microsecond... futhermore this is NOT an ASF release, just a
courtesy.  If there ever was to be a release it would look
differently, because ASF releases cant include GPL code (this does)
and depend on commercial releases of haoopp.

Enjoy,
-ryan

On Mon, Sep 20, 2010 at 9:57 PM, Ryan Rawson <ry...@gmail.com> wrote:
> no no, 20 GB heap per node.  each node with 24-32gb ram, etc.
>
> we cant rely on the linux buffer cache to save us, so we have to cache
> in hbase ram.
>
> :-)
>
> -ryan
>
> On Mon, Sep 20, 2010 at 9:44 PM, Jack Levin <ma...@gmail.com> wrote:
>> 20GB+?, hmmm..... I do plan to run 50 regionserver nodes though, with
>> 3 GB Heap likely, this should be plenty to rip through say, 350TB of
>> data.
>>
>> -Jack
>>
>> On Mon, Sep 20, 2010 at 9:39 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>> yes that is the new ZK based coordination.  when i publish the SU code
>>> we have a patch which limits that and is faster.  2GB is a little
>>> small for a regionserver memory... in my ideal world we'll be putting
>>> 20GB+ of ram to regionserver.
>>>
>>> I just figured you were using the DEB/RPMs because your files were in
>>> /usr/local... I usually run everything out of /home/hadoop b/c it
>>> allows me to easily rsync as user hadoop.
>>>
>>> but you are on the right track yes :-)
>>>
>>> On Mon, Sep 20, 2010 at 9:32 PM, Jack Levin <ma...@gmail.com> wrote:
>>>> Who said anything about deb :). I do use tarballs.... Yes, so what did
>>>> it is the copy of that jar to under hbase/lib, and then full restart.
>>>>  Now here is a funny thing, the master shuddered for about 10 minutes,
>>>> spewing those messages:
>>>>
>>>> 2010-09-20 21:23:45,826 DEBUG org.apache.hadoop.hbase.master.HMaster:
>>>> Event NodeCreated with state SyncConnected with path
>>>> /hbase/UNASSIGNED/97999366
>>>> 2010-09-20 21:23:45,827 DEBUG
>>>> org.apache.hadoop.hbase.master.ZKMasterAddressWatcher: Got event
>>>> NodeCreated with path /hbase/UNASSIGNED/97999366
>>>> 2010-09-20 21:23:45,827 DEBUG
>>>> org.apache.hadoop.hbase.master.ZKUnassignedWatcher: ZK-EVENT-PROCESS:
>>>> Got zkEvent NodeCreated state:SyncConnected
>>>> path:/hbase/UNASSIGNED/97999366
>>>> 2010-09-20 21:23:45,827 DEBUG
>>>> org.apache.hadoop.hbase.master.RegionManager: Created/updated
>>>> UNASSIGNED zNode img15,normal052q.jpg,1285001686282.97999366 in state
>>>> M2ZK_REGION_OFFLINE
>>>> 2010-09-20 21:23:45,828 INFO
>>>> org.apache.hadoop.hbase.master.RegionServerOperation:
>>>> img13,p1000319tq.jpg,1284952655960.812544765 open on
>>>> 10.103.2.3,60020,1285042333293
>>>> 2010-09-20 21:23:45,828 DEBUG
>>>> org.apache.hadoop.hbase.master.ZKUnassignedWatcher: Got event type [
>>>> M2ZK_REGION_OFFLINE ] for region 97999366
>>>> 2010-09-20 21:23:45,828 DEBUG org.apache.hadoop.hbase.master.HMaster:
>>>> Event NodeChildrenChanged with state SyncConnected with path
>>>> /hbase/UNASSIGNED
>>>> 2010-09-20 21:23:45,828 DEBUG
>>>> org.apache.hadoop.hbase.master.ZKMasterAddressWatcher: Got event
>>>> NodeChildrenChanged with path /hbase/UNASSIGNED
>>>> 2010-09-20 21:23:45,828 DEBUG
>>>> org.apache.hadoop.hbase.master.ZKUnassignedWatcher: ZK-EVENT-PROCESS:
>>>> Got zkEvent NodeChildrenChanged state:SyncConnected
>>>> path:/hbase/UNASSIGNED
>>>> 2010-09-20 21:23:45,830 DEBUG
>>>> org.apache.hadoop.hbase.master.BaseScanner: Current assignment of
>>>> img150,,1284859678248.3116007 is not valid;
>>>> serverAddress=10.103.2.1:60020, startCode=1285038205920 unknown.
>>>>
>>>>
>>>> Does anyone know what they mean?   At first it would kill one of my
>>>> datanodes.  But what helped is when I changed to heap size to 4GB for
>>>> master and 2GB for datanode that was dying, and after 10 minutes I got
>>>> into a clean state.
>>>>
>>>> -Jack
>>>>
>>>>
>>>> On Mon, Sep 20, 2010 at 9:28 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>>>> yes, on every single machine as well, and restart.
>>>>>
>>>>> again, not sure how how you'd do this in a scalable manner with your
>>>>> deb packages... on the source tarball you can just replace it, rsync
>>>>> it out and done.
>>>>>
>>>>> :-)
>>>>>
>>>>> On Mon, Sep 20, 2010 at 8:56 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>> ok, I found that file, do I replace hadoop-core.*.jar under /usr/lib/hbase/lib?
>>>>>> Then restart, etc?  All regionservers too?
>>>>>>
>>>>>> -Jack
>>>>>>
>>>>>> On Mon, Sep 20, 2010 at 8:40 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>>>>>> Well I don't really run CDH, I disagree with their rpm/deb packaging
>>>>>>> policies and I have to highly recommend not using DEBs to install
>>>>>>> software...
>>>>>>>
>>>>>>> So normally installing from tarball, the jar is in
>>>>>>> <installpath>/hadoop-0.20.0-320/hadoop-core-0.20.2+320.jar
>>>>>>>
>>>>>>> On CDH/DEB edition, it's somewhere silly ... locate and find will be
>>>>>>> your friend.  It should be called hadoop-core-0.20.2+320.jar though!
>>>>>>>
>>>>>>> I'm working on a github publish of SU's production system, which uses
>>>>>>> the cloudera maven repo to install the correct JAR in hbase so when
>>>>>>> you type 'mvn assembly:assembly' to build your own hbase-*-bin.tar.gz
>>>>>>> (the * being whatever version you specified in pom.xml) the cdh3b2 jar
>>>>>>> comes pre-packaged.
>>>>>>>
>>>>>>> Stay tuned :-)
>>>>>>>
>>>>>>> -ryan
>>>>>>>
>>>>>>> On Mon, Sep 20, 2010 at 8:36 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>> Ryan, hadoop jar, what is the usual path to the file? I just to to be
>>>>>>>> sure, and where do I put it?
>>>>>>>>
>>>>>>>> -Jack
>>>>>>>>
>>>>>>>> On Mon, Sep 20, 2010 at 8:30 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>>>>>>>> you need 2 more things:
>>>>>>>>>
>>>>>>>>> - restart hdfs
>>>>>>>>> - make sure the hadoop jar from your install replaces the one we ship with
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Sep 20, 2010 at 8:22 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>>> So, I switched to 0.89, and we already had CDH3
>>>>>>>>>> (hadoop-0.20-datanode-0.20.2+320-3.noarch), even though I added
>>>>>>>>>>  <name>dfs.support.append</name> as true to both hdfs-site.xml and
>>>>>>>>>> hbase-site.xml, the master still reports this:
>>>>>>>>>>
>>>>>>>>>>  You are currently running the HMaster without HDFS append support
>>>>>>>>>> enabled. This may result in data loss. Please see the HBase wiki  for
>>>>>>>>>> details.
>>>>>>>>>> Master Attributes
>>>>>>>>>> Attribute Name  Value   Description
>>>>>>>>>> HBase Version   0.89.20100726, r979826  HBase version and svn revision
>>>>>>>>>> HBase Compiled  Sat Jul 31 02:01:58 PDT 2010, stack     When HBase version
>>>>>>>>>> was compiled and by whom
>>>>>>>>>> Hadoop Version  0.20.2, r911707 Hadoop version and svn revision
>>>>>>>>>> Hadoop Compiled Fri Feb 19 08:07:34 UTC 2010, chrisdo   When Hadoop
>>>>>>>>>> version was compiled and by whom
>>>>>>>>>> HBase Root Directory    hdfs://namenode-rd.imageshack.us:9000/hbase     Location
>>>>>>>>>> of HBase home directory
>>>>>>>>>>
>>>>>>>>>> Any ideas whats wrong?
>>>>>>>>>>
>>>>>>>>>> -Jack
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Sep 20, 2010 at 5:47 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>>>>>>>>>> Hey,
>>>>>>>>>>>
>>>>>>>>>>> There is actually only 1 active branch of hbase, that being the 0.89
>>>>>>>>>>> release, which is based on 'trunk'.  We have snapshotted a series of
>>>>>>>>>>> 0.89 "developer releases" in hopes that people would try them our and
>>>>>>>>>>> start thinking about the next major version.  One of these is what SU
>>>>>>>>>>> is running prod on.
>>>>>>>>>>>
>>>>>>>>>>> At this point tracking 0.89 and which ones are the 'best' peach sets
>>>>>>>>>>> to run is a bit of a contact sport, but if you are serious about not
>>>>>>>>>>> losing data it is worthwhile.  SU is based on the most recent DR with
>>>>>>>>>>> a few minor patches of our own concoction brought in.  If current
>>>>>>>>>>> works, but some Master ops are slow, and there are a few patches on
>>>>>>>>>>> top of that.  I'll poke about and see if its possible to publish to a
>>>>>>>>>>> github branch or something.
>>>>>>>>>>>
>>>>>>>>>>> -ryan
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Sep 20, 2010 at 5:16 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>>>>> Sounds, good, only reason I ask is because of this:
>>>>>>>>>>>>
>>>>>>>>>>>> There are currently two active branches of HBase:
>>>>>>>>>>>>
>>>>>>>>>>>>    * 0.20 - the current stable release series, being maintained with
>>>>>>>>>>>> patches for bug fixes only. This release series does not support HDFS
>>>>>>>>>>>> durability - edits may be lost in the case of node failure.
>>>>>>>>>>>>    * 0.89 - a development release series with active feature and
>>>>>>>>>>>> stability development, not currently recommended for production use.
>>>>>>>>>>>> This release does support HDFS durability - cases in which edits are
>>>>>>>>>>>> lost are considered serious bugs.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Are we talking about data loss in case of datanode going down while
>>>>>>>>>>>> being written to, or RegionServer going down?
>>>>>>>>>>>>
>>>>>>>>>>>> -jack
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Sep 20, 2010 at 4:09 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>>>>>>>>>>>> We run 0.89 in production @ Stumbleupon.  We also employ 3 committers...
>>>>>>>>>>>>>
>>>>>>>>>>>>> As for safety, you have no choice but to run 0.89.  If you run a 0.20
>>>>>>>>>>>>> release you will lose data.  you must be on 0.89 and
>>>>>>>>>>>>> CDH3/append-branch to achieve data durability, and there really is no
>>>>>>>>>>>>> argument around it.  If you are doing your tests with 0.20.6 now, I'd
>>>>>>>>>>>>> stop and rebase those tests onto the latest DR announced on the list.
>>>>>>>>>>>>>
>>>>>>>>>>>>> -ryan
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Sep 20, 2010 at 3:17 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>>>>>>> Hi Stack, see inline:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Sep 20, 2010 at 2:42 PM, Stack <st...@duboce.net> wrote:
>>>>>>>>>>>>>>> Hey Jack:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks for writing.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> See below for some comments.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, Sep 20, 2010 at 11:00 AM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Image-Shack gets close to two million image uploads per day, which are
>>>>>>>>>>>>>>>> usually stored on regular servers (we have about 700), as regular
>>>>>>>>>>>>>>>> files, and each server has its own host name, such as (img55).   I've
>>>>>>>>>>>>>>>> been researching on how to improve our backend design in terms of data
>>>>>>>>>>>>>>>> safety and stumped onto the Hbase project.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Any other requirements other than data safety? (latency, etc).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Latency is the second requirement.  We have some services that are
>>>>>>>>>>>>>> very short tail, and can produce 95% cache hit rate, so I assume this
>>>>>>>>>>>>>> would really put cache into good use.  Some other services however,
>>>>>>>>>>>>>> have about 25% cache hit ratio, in which case the latency should be
>>>>>>>>>>>>>> 'adequate', e.g. if its slightly worse than getting data off raw disk,
>>>>>>>>>>>>>> then its good enough.   Safely is supremely important, then its
>>>>>>>>>>>>>> availability, then speed.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Now, I think hbase is he most beautiful thing that happen to
>>>>>>>>>>>>>>>> distributed DB world :).   The idea is to store image files (about
>>>>>>>>>>>>>>>> 400Kb on average into HBASE).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'd guess some images are much bigger than this.  Do you ever limit
>>>>>>>>>>>>>>> the size of images folks can upload to your service?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The setup will include the following
>>>>>>>>>>>>>>>> configuration:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 50 servers total (2 datacenters), with 8 GB RAM, dual core cpu, 6 x
>>>>>>>>>>>>>>>> 2TB disks each.
>>>>>>>>>>>>>>>> 3 to 5 Zookeepers
>>>>>>>>>>>>>>>> 2 Masters (in a datacenter each)
>>>>>>>>>>>>>>>> 10 to 20 Stargate REST instances (one per server, hash loadbalanced)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Whats your frontend?  Why REST?  It might be more efficient if you
>>>>>>>>>>>>>>> could run with thrift given REST base64s its payload IIRC (check the
>>>>>>>>>>>>>>> src yourself).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> For insertion we use Haproxy, and balance curl PUTs across multiple REST APIs.
>>>>>>>>>>>>>> For reading, its a nginx proxy that does Content-type modification
>>>>>>>>>>>>>> from image/jpeg to octet-stream, and vice versa,
>>>>>>>>>>>>>> it then hits Haproxy again, which hits balanced REST.
>>>>>>>>>>>>>> Why REST, it was the simplest thing to run, given that its supports
>>>>>>>>>>>>>> HTTP, potentially we could rewrite something for thrift, as long as we
>>>>>>>>>>>>>> can use http still to send and receive data (anyone wrote anything
>>>>>>>>>>>>>> like that say in python, C or java?)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 40 to 50 RegionServers (will probably keep masters separate on dedicated boxes).
>>>>>>>>>>>>>>>> 2 Namenode servers (one backup, highly available, will do fsimage and
>>>>>>>>>>>>>>>> edits snapshots also)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> So far I got about 13 servers running, and doing about 20 insertions /
>>>>>>>>>>>>>>>> second (file size ranging from few KB to 2-3MB, ave. 400KB). via
>>>>>>>>>>>>>>>> Stargate API.  Our frontend servers receive files, and I just
>>>>>>>>>>>>>>>> fork-insert them into stargate via http (curl).
>>>>>>>>>>>>>>>> The inserts are humming along nicely, without any noticeable load on
>>>>>>>>>>>>>>>> regionservers, so far inserted about 2 TB worth of images.
>>>>>>>>>>>>>>>> I have adjusted the region file size to be 512MB, and table block size
>>>>>>>>>>>>>>>> to about 400KB , trying to match average access block to limit HDFS
>>>>>>>>>>>>>>>> trips.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> As Todd suggests, I'd go up from 512MB... 1G at least.  You'll
>>>>>>>>>>>>>>> probably want to up your flush size from 64MB to 128MB or maybe 192MB.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Yep, i will adjust to 1G.  I thought flush was controlled by a
>>>>>>>>>>>>>> function of memstore HEAP, something like 40%?  Or are you talking
>>>>>>>>>>>>>> about HDFS block size?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  So far the read performance was more than adequate, and of
>>>>>>>>>>>>>>>> course write performance is nowhere near capacity.
>>>>>>>>>>>>>>>> So right now, all newly uploaded images go to HBASE.  But we do plan
>>>>>>>>>>>>>>>> to insert about 170 Million images (about 100 days worth), which is
>>>>>>>>>>>>>>>> only about 64 TB, or 10% of planned cluster size of 600TB.
>>>>>>>>>>>>>>>> The end goal is to have a storage system that creates data safety,
>>>>>>>>>>>>>>>> e.g. system may go down but data can not be lost.   Our Front-End
>>>>>>>>>>>>>>>> servers will continue to serve images from their own file system (we
>>>>>>>>>>>>>>>> are serving about 16 Gbits at peak), however should we need to bring
>>>>>>>>>>>>>>>> any of those down for maintenance, we will redirect all traffic to
>>>>>>>>>>>>>>>> Hbase (should be no more than few hundred Mbps), while the front end
>>>>>>>>>>>>>>>> server is repaired (for example having its disk replaced), after the
>>>>>>>>>>>>>>>> repairs, we quickly repopulate it with missing files, while serving
>>>>>>>>>>>>>>>> the missing remaining off Hbase.
>>>>>>>>>>>>>>>> All in all should be very interesting project, and I am hoping not to
>>>>>>>>>>>>>>>> run into any snags, however, should that happens, I am pleased to know
>>>>>>>>>>>>>>>> that such a great and vibrant tech group exists that supports and uses
>>>>>>>>>>>>>>>> HBASE :).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> We're definetly interested in how your project progresses.  If you are
>>>>>>>>>>>>>>> ever up in the city, you should drop by for a chat.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Cool.  I'd like that.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> St.Ack
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> P.S. I'm also w/ Todd that you should move to 0.89 and blooms.
>>>>>>>>>>>>>>> P.P.S I updated the wiki on stargate REST:
>>>>>>>>>>>>>>> http://wiki.apache.org/hadoop/Hbase/Stargate
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Cool, I assume if we move to that it won't kill existing meta tables,
>>>>>>>>>>>>>> and data?  e.g. cross compatible?
>>>>>>>>>>>>>> Is 0.89 ready for production environment?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -Jack
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Millions of photos into Hbase

Posted by Ryan Rawson <ry...@gmail.com>.

no no, 20 GB heap per node.  each node with 24-32gb ram, etc.

we cant rely on the linux buffer cache to save us, so we have to cache
in hbase ram.

:-)

-ryan

On Mon, Sep 20, 2010 at 9:44 PM, Jack Levin <ma...@gmail.com> wrote:
> 20GB+?, hmmm..... I do plan to run 50 regionserver nodes though, with
> 3 GB Heap likely, this should be plenty to rip through say, 350TB of
> data.
>
> -Jack
>
> On Mon, Sep 20, 2010 at 9:39 PM, Ryan Rawson <ry...@gmail.com> wrote:
>> yes that is the new ZK based coordination.  when i publish the SU code
>> we have a patch which limits that and is faster.  2GB is a little
>> small for a regionserver memory... in my ideal world we'll be putting
>> 20GB+ of ram to regionserver.
>>
>> I just figured you were using the DEB/RPMs because your files were in
>> /usr/local... I usually run everything out of /home/hadoop b/c it
>> allows me to easily rsync as user hadoop.
>>
>> but you are on the right track yes :-)
>>
>> On Mon, Sep 20, 2010 at 9:32 PM, Jack Levin <ma...@gmail.com> wrote:
>>> Who said anything about deb :). I do use tarballs.... Yes, so what did
>>> it is the copy of that jar to under hbase/lib, and then full restart.
>>>  Now here is a funny thing, the master shuddered for about 10 minutes,
>>> spewing those messages:
>>>
>>> 2010-09-20 21:23:45,826 DEBUG org.apache.hadoop.hbase.master.HMaster:
>>> Event NodeCreated with state SyncConnected with path
>>> /hbase/UNASSIGNED/97999366
>>> 2010-09-20 21:23:45,827 DEBUG
>>> org.apache.hadoop.hbase.master.ZKMasterAddressWatcher: Got event
>>> NodeCreated with path /hbase/UNASSIGNED/97999366
>>> 2010-09-20 21:23:45,827 DEBUG
>>> org.apache.hadoop.hbase.master.ZKUnassignedWatcher: ZK-EVENT-PROCESS:
>>> Got zkEvent NodeCreated state:SyncConnected
>>> path:/hbase/UNASSIGNED/97999366
>>> 2010-09-20 21:23:45,827 DEBUG
>>> org.apache.hadoop.hbase.master.RegionManager: Created/updated
>>> UNASSIGNED zNode img15,normal052q.jpg,1285001686282.97999366 in state
>>> M2ZK_REGION_OFFLINE
>>> 2010-09-20 21:23:45,828 INFO
>>> org.apache.hadoop.hbase.master.RegionServerOperation:
>>> img13,p1000319tq.jpg,1284952655960.812544765 open on
>>> 10.103.2.3,60020,1285042333293
>>> 2010-09-20 21:23:45,828 DEBUG
>>> org.apache.hadoop.hbase.master.ZKUnassignedWatcher: Got event type [
>>> M2ZK_REGION_OFFLINE ] for region 97999366
>>> 2010-09-20 21:23:45,828 DEBUG org.apache.hadoop.hbase.master.HMaster:
>>> Event NodeChildrenChanged with state SyncConnected with path
>>> /hbase/UNASSIGNED
>>> 2010-09-20 21:23:45,828 DEBUG
>>> org.apache.hadoop.hbase.master.ZKMasterAddressWatcher: Got event
>>> NodeChildrenChanged with path /hbase/UNASSIGNED
>>> 2010-09-20 21:23:45,828 DEBUG
>>> org.apache.hadoop.hbase.master.ZKUnassignedWatcher: ZK-EVENT-PROCESS:
>>> Got zkEvent NodeChildrenChanged state:SyncConnected
>>> path:/hbase/UNASSIGNED
>>> 2010-09-20 21:23:45,830 DEBUG
>>> org.apache.hadoop.hbase.master.BaseScanner: Current assignment of
>>> img150,,1284859678248.3116007 is not valid;
>>> serverAddress=10.103.2.1:60020, startCode=1285038205920 unknown.
>>>
>>>
>>> Does anyone know what they mean?   At first it would kill one of my
>>> datanodes.  But what helped is when I changed to heap size to 4GB for
>>> master and 2GB for datanode that was dying, and after 10 minutes I got
>>> into a clean state.
>>>
>>> -Jack
>>>
>>>
>>> On Mon, Sep 20, 2010 at 9:28 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>>> yes, on every single machine as well, and restart.
>>>>
>>>> again, not sure how how you'd do this in a scalable manner with your
>>>> deb packages... on the source tarball you can just replace it, rsync
>>>> it out and done.
>>>>
>>>> :-)
>>>>
>>>> On Mon, Sep 20, 2010 at 8:56 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>> ok, I found that file, do I replace hadoop-core.*.jar under /usr/lib/hbase/lib?
>>>>> Then restart, etc?  All regionservers too?
>>>>>
>>>>> -Jack
>>>>>
>>>>> On Mon, Sep 20, 2010 at 8:40 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>>>>> Well I don't really run CDH, I disagree with their rpm/deb packaging
>>>>>> policies and I have to highly recommend not using DEBs to install
>>>>>> software...
>>>>>>
>>>>>> So normally installing from tarball, the jar is in
>>>>>> <installpath>/hadoop-0.20.0-320/hadoop-core-0.20.2+320.jar
>>>>>>
>>>>>> On CDH/DEB edition, it's somewhere silly ... locate and find will be
>>>>>> your friend.  It should be called hadoop-core-0.20.2+320.jar though!
>>>>>>
>>>>>> I'm working on a github publish of SU's production system, which uses
>>>>>> the cloudera maven repo to install the correct JAR in hbase so when
>>>>>> you type 'mvn assembly:assembly' to build your own hbase-*-bin.tar.gz
>>>>>> (the * being whatever version you specified in pom.xml) the cdh3b2 jar
>>>>>> comes pre-packaged.
>>>>>>
>>>>>> Stay tuned :-)
>>>>>>
>>>>>> -ryan
>>>>>>
>>>>>> On Mon, Sep 20, 2010 at 8:36 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>> Ryan, hadoop jar, what is the usual path to the file? I just to to be
>>>>>>> sure, and where do I put it?
>>>>>>>
>>>>>>> -Jack
>>>>>>>
>>>>>>> On Mon, Sep 20, 2010 at 8:30 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>>>>>>> you need 2 more things:
>>>>>>>>
>>>>>>>> - restart hdfs
>>>>>>>> - make sure the hadoop jar from your install replaces the one we ship with
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Sep 20, 2010 at 8:22 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>> So, I switched to 0.89, and we already had CDH3
>>>>>>>>> (hadoop-0.20-datanode-0.20.2+320-3.noarch), even though I added
>>>>>>>>>  <name>dfs.support.append</name> as true to both hdfs-site.xml and
>>>>>>>>> hbase-site.xml, the master still reports this:
>>>>>>>>>
>>>>>>>>>  You are currently running the HMaster without HDFS append support
>>>>>>>>> enabled. This may result in data loss. Please see the HBase wiki  for
>>>>>>>>> details.
>>>>>>>>> Master Attributes
>>>>>>>>> Attribute Name  Value   Description
>>>>>>>>> HBase Version   0.89.20100726, r979826  HBase version and svn revision
>>>>>>>>> HBase Compiled  Sat Jul 31 02:01:58 PDT 2010, stack     When HBase version
>>>>>>>>> was compiled and by whom
>>>>>>>>> Hadoop Version  0.20.2, r911707 Hadoop version and svn revision
>>>>>>>>> Hadoop Compiled Fri Feb 19 08:07:34 UTC 2010, chrisdo   When Hadoop
>>>>>>>>> version was compiled and by whom
>>>>>>>>> HBase Root Directory    hdfs://namenode-rd.imageshack.us:9000/hbase     Location
>>>>>>>>> of HBase home directory
>>>>>>>>>
>>>>>>>>> Any ideas whats wrong?
>>>>>>>>>
>>>>>>>>> -Jack
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Sep 20, 2010 at 5:47 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>>>>>>>>> Hey,
>>>>>>>>>>
>>>>>>>>>> There is actually only 1 active branch of hbase, that being the 0.89
>>>>>>>>>> release, which is based on 'trunk'.  We have snapshotted a series of
>>>>>>>>>> 0.89 "developer releases" in hopes that people would try them our and
>>>>>>>>>> start thinking about the next major version.  One of these is what SU
>>>>>>>>>> is running prod on.
>>>>>>>>>>
>>>>>>>>>> At this point tracking 0.89 and which ones are the 'best' peach sets
>>>>>>>>>> to run is a bit of a contact sport, but if you are serious about not
>>>>>>>>>> losing data it is worthwhile.  SU is based on the most recent DR with
>>>>>>>>>> a few minor patches of our own concoction brought in.  If current
>>>>>>>>>> works, but some Master ops are slow, and there are a few patches on
>>>>>>>>>> top of that.  I'll poke about and see if its possible to publish to a
>>>>>>>>>> github branch or something.
>>>>>>>>>>
>>>>>>>>>> -ryan
>>>>>>>>>>
>>>>>>>>>> On Mon, Sep 20, 2010 at 5:16 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>>>> Sounds, good, only reason I ask is because of this:
>>>>>>>>>>>
>>>>>>>>>>> There are currently two active branches of HBase:
>>>>>>>>>>>
>>>>>>>>>>>    * 0.20 - the current stable release series, being maintained with
>>>>>>>>>>> patches for bug fixes only. This release series does not support HDFS
>>>>>>>>>>> durability - edits may be lost in the case of node failure.
>>>>>>>>>>>    * 0.89 - a development release series with active feature and
>>>>>>>>>>> stability development, not currently recommended for production use.
>>>>>>>>>>> This release does support HDFS durability - cases in which edits are
>>>>>>>>>>> lost are considered serious bugs.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Are we talking about data loss in case of datanode going down while
>>>>>>>>>>> being written to, or RegionServer going down?
>>>>>>>>>>>
>>>>>>>>>>> -jack
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Sep 20, 2010 at 4:09 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>>>>>>>>>>> We run 0.89 in production @ Stumbleupon.  We also employ 3 committers...
>>>>>>>>>>>>
>>>>>>>>>>>> As for safety, you have no choice but to run 0.89.  If you run a 0.20
>>>>>>>>>>>> release you will lose data.  you must be on 0.89 and
>>>>>>>>>>>> CDH3/append-branch to achieve data durability, and there really is no
>>>>>>>>>>>> argument around it.  If you are doing your tests with 0.20.6 now, I'd
>>>>>>>>>>>> stop and rebase those tests onto the latest DR announced on the list.
>>>>>>>>>>>>
>>>>>>>>>>>> -ryan
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Sep 20, 2010 at 3:17 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>>>>>> Hi Stack, see inline:
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Sep 20, 2010 at 2:42 PM, Stack <st...@duboce.net> wrote:
>>>>>>>>>>>>>> Hey Jack:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for writing.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> See below for some comments.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Sep 20, 2010 at 11:00 AM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Image-Shack gets close to two million image uploads per day, which are
>>>>>>>>>>>>>>> usually stored on regular servers (we have about 700), as regular
>>>>>>>>>>>>>>> files, and each server has its own host name, such as (img55).   I've
>>>>>>>>>>>>>>> been researching on how to improve our backend design in terms of data
>>>>>>>>>>>>>>> safety and stumped onto the Hbase project.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Any other requirements other than data safety? (latency, etc).
>>>>>>>>>>>>>
>>>>>>>>>>>>> Latency is the second requirement.  We have some services that are
>>>>>>>>>>>>> very short tail, and can produce 95% cache hit rate, so I assume this
>>>>>>>>>>>>> would really put cache into good use.  Some other services however,
>>>>>>>>>>>>> have about 25% cache hit ratio, in which case the latency should be
>>>>>>>>>>>>> 'adequate', e.g. if its slightly worse than getting data off raw disk,
>>>>>>>>>>>>> then its good enough.   Safely is supremely important, then its
>>>>>>>>>>>>> availability, then speed.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Now, I think hbase is he most beautiful thing that happen to
>>>>>>>>>>>>>>> distributed DB world :).   The idea is to store image files (about
>>>>>>>>>>>>>>> 400Kb on average into HBASE).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'd guess some images are much bigger than this.  Do you ever limit
>>>>>>>>>>>>>> the size of images folks can upload to your service?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The setup will include the following
>>>>>>>>>>>>>>> configuration:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 50 servers total (2 datacenters), with 8 GB RAM, dual core cpu, 6 x
>>>>>>>>>>>>>>> 2TB disks each.
>>>>>>>>>>>>>>> 3 to 5 Zookeepers
>>>>>>>>>>>>>>> 2 Masters (in a datacenter each)
>>>>>>>>>>>>>>> 10 to 20 Stargate REST instances (one per server, hash loadbalanced)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Whats your frontend?  Why REST?  It might be more efficient if you
>>>>>>>>>>>>>> could run with thrift given REST base64s its payload IIRC (check the
>>>>>>>>>>>>>> src yourself).
>>>>>>>>>>>>>
>>>>>>>>>>>>> For insertion we use Haproxy, and balance curl PUTs across multiple REST APIs.
>>>>>>>>>>>>> For reading, its a nginx proxy that does Content-type modification
>>>>>>>>>>>>> from image/jpeg to octet-stream, and vice versa,
>>>>>>>>>>>>> it then hits Haproxy again, which hits balanced REST.
>>>>>>>>>>>>> Why REST, it was the simplest thing to run, given that its supports
>>>>>>>>>>>>> HTTP, potentially we could rewrite something for thrift, as long as we
>>>>>>>>>>>>> can use http still to send and receive data (anyone wrote anything
>>>>>>>>>>>>> like that say in python, C or java?)
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 40 to 50 RegionServers (will probably keep masters separate on dedicated boxes).
>>>>>>>>>>>>>>> 2 Namenode servers (one backup, highly available, will do fsimage and
>>>>>>>>>>>>>>> edits snapshots also)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> So far I got about 13 servers running, and doing about 20 insertions /
>>>>>>>>>>>>>>> second (file size ranging from few KB to 2-3MB, ave. 400KB). via
>>>>>>>>>>>>>>> Stargate API.  Our frontend servers receive files, and I just
>>>>>>>>>>>>>>> fork-insert them into stargate via http (curl).
>>>>>>>>>>>>>>> The inserts are humming along nicely, without any noticeable load on
>>>>>>>>>>>>>>> regionservers, so far inserted about 2 TB worth of images.
>>>>>>>>>>>>>>> I have adjusted the region file size to be 512MB, and table block size
>>>>>>>>>>>>>>> to about 400KB , trying to match average access block to limit HDFS
>>>>>>>>>>>>>>> trips.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> As Todd suggests, I'd go up from 512MB... 1G at least.  You'll
>>>>>>>>>>>>>> probably want to up your flush size from 64MB to 128MB or maybe 192MB.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yep, i will adjust to 1G.  I thought flush was controlled by a
>>>>>>>>>>>>> function of memstore HEAP, something like 40%?  Or are you talking
>>>>>>>>>>>>> about HDFS block size?
>>>>>>>>>>>>>
>>>>>>>>>>>>>>  So far the read performance was more than adequate, and of
>>>>>>>>>>>>>>> course write performance is nowhere near capacity.
>>>>>>>>>>>>>>> So right now, all newly uploaded images go to HBASE.  But we do plan
>>>>>>>>>>>>>>> to insert about 170 Million images (about 100 days worth), which is
>>>>>>>>>>>>>>> only about 64 TB, or 10% of planned cluster size of 600TB.
>>>>>>>>>>>>>>> The end goal is to have a storage system that creates data safety,
>>>>>>>>>>>>>>> e.g. system may go down but data can not be lost.   Our Front-End
>>>>>>>>>>>>>>> servers will continue to serve images from their own file system (we
>>>>>>>>>>>>>>> are serving about 16 Gbits at peak), however should we need to bring
>>>>>>>>>>>>>>> any of those down for maintenance, we will redirect all traffic to
>>>>>>>>>>>>>>> Hbase (should be no more than few hundred Mbps), while the front end
>>>>>>>>>>>>>>> server is repaired (for example having its disk replaced), after the
>>>>>>>>>>>>>>> repairs, we quickly repopulate it with missing files, while serving
>>>>>>>>>>>>>>> the missing remaining off Hbase.
>>>>>>>>>>>>>>> All in all should be very interesting project, and I am hoping not to
>>>>>>>>>>>>>>> run into any snags, however, should that happens, I am pleased to know
>>>>>>>>>>>>>>> that such a great and vibrant tech group exists that supports and uses
>>>>>>>>>>>>>>> HBASE :).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> We're definetly interested in how your project progresses.  If you are
>>>>>>>>>>>>>> ever up in the city, you should drop by for a chat.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Cool.  I'd like that.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> St.Ack
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> P.S. I'm also w/ Todd that you should move to 0.89 and blooms.
>>>>>>>>>>>>>> P.P.S I updated the wiki on stargate REST:
>>>>>>>>>>>>>> http://wiki.apache.org/hadoop/Hbase/Stargate
>>>>>>>>>>>>>
>>>>>>>>>>>>> Cool, I assume if we move to that it won't kill existing meta tables,
>>>>>>>>>>>>> and data?  e.g. cross compatible?
>>>>>>>>>>>>> Is 0.89 ready for production environment?
>>>>>>>>>>>>>
>>>>>>>>>>>>> -Jack
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Millions of photos into Hbase

Posted by Jack Levin <ma...@gmail.com>.

20GB+?, hmmm..... I do plan to run 50 regionserver nodes though, with
3 GB Heap likely, this should be plenty to rip through say, 350TB of
data.

-Jack

On Mon, Sep 20, 2010 at 9:39 PM, Ryan Rawson <ry...@gmail.com> wrote:
> yes that is the new ZK based coordination.  when i publish the SU code
> we have a patch which limits that and is faster.  2GB is a little
> small for a regionserver memory... in my ideal world we'll be putting
> 20GB+ of ram to regionserver.
>
> I just figured you were using the DEB/RPMs because your files were in
> /usr/local... I usually run everything out of /home/hadoop b/c it
> allows me to easily rsync as user hadoop.
>
> but you are on the right track yes :-)
>
> On Mon, Sep 20, 2010 at 9:32 PM, Jack Levin <ma...@gmail.com> wrote:
>> Who said anything about deb :). I do use tarballs.... Yes, so what did
>> it is the copy of that jar to under hbase/lib, and then full restart.
>>  Now here is a funny thing, the master shuddered for about 10 minutes,
>> spewing those messages:
>>
>> 2010-09-20 21:23:45,826 DEBUG org.apache.hadoop.hbase.master.HMaster:
>> Event NodeCreated with state SyncConnected with path
>> /hbase/UNASSIGNED/97999366
>> 2010-09-20 21:23:45,827 DEBUG
>> org.apache.hadoop.hbase.master.ZKMasterAddressWatcher: Got event
>> NodeCreated with path /hbase/UNASSIGNED/97999366
>> 2010-09-20 21:23:45,827 DEBUG
>> org.apache.hadoop.hbase.master.ZKUnassignedWatcher: ZK-EVENT-PROCESS:
>> Got zkEvent NodeCreated state:SyncConnected
>> path:/hbase/UNASSIGNED/97999366
>> 2010-09-20 21:23:45,827 DEBUG
>> org.apache.hadoop.hbase.master.RegionManager: Created/updated
>> UNASSIGNED zNode img15,normal052q.jpg,1285001686282.97999366 in state
>> M2ZK_REGION_OFFLINE
>> 2010-09-20 21:23:45,828 INFO
>> org.apache.hadoop.hbase.master.RegionServerOperation:
>> img13,p1000319tq.jpg,1284952655960.812544765 open on
>> 10.103.2.3,60020,1285042333293
>> 2010-09-20 21:23:45,828 DEBUG
>> org.apache.hadoop.hbase.master.ZKUnassignedWatcher: Got event type [
>> M2ZK_REGION_OFFLINE ] for region 97999366
>> 2010-09-20 21:23:45,828 DEBUG org.apache.hadoop.hbase.master.HMaster:
>> Event NodeChildrenChanged with state SyncConnected with path
>> /hbase/UNASSIGNED
>> 2010-09-20 21:23:45,828 DEBUG
>> org.apache.hadoop.hbase.master.ZKMasterAddressWatcher: Got event
>> NodeChildrenChanged with path /hbase/UNASSIGNED
>> 2010-09-20 21:23:45,828 DEBUG
>> org.apache.hadoop.hbase.master.ZKUnassignedWatcher: ZK-EVENT-PROCESS:
>> Got zkEvent NodeChildrenChanged state:SyncConnected
>> path:/hbase/UNASSIGNED
>> 2010-09-20 21:23:45,830 DEBUG
>> org.apache.hadoop.hbase.master.BaseScanner: Current assignment of
>> img150,,1284859678248.3116007 is not valid;
>> serverAddress=10.103.2.1:60020, startCode=1285038205920 unknown.
>>
>>
>> Does anyone know what they mean?   At first it would kill one of my
>> datanodes.  But what helped is when I changed to heap size to 4GB for
>> master and 2GB for datanode that was dying, and after 10 minutes I got
>> into a clean state.
>>
>> -Jack
>>
>>
>> On Mon, Sep 20, 2010 at 9:28 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>> yes, on every single machine as well, and restart.
>>>
>>> again, not sure how how you'd do this in a scalable manner with your
>>> deb packages... on the source tarball you can just replace it, rsync
>>> it out and done.
>>>
>>> :-)
>>>
>>> On Mon, Sep 20, 2010 at 8:56 PM, Jack Levin <ma...@gmail.com> wrote:
>>>> ok, I found that file, do I replace hadoop-core.*.jar under /usr/lib/hbase/lib?
>>>> Then restart, etc?  All regionservers too?
>>>>
>>>> -Jack
>>>>
>>>> On Mon, Sep 20, 2010 at 8:40 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>>>> Well I don't really run CDH, I disagree with their rpm/deb packaging
>>>>> policies and I have to highly recommend not using DEBs to install
>>>>> software...
>>>>>
>>>>> So normally installing from tarball, the jar is in
>>>>> <installpath>/hadoop-0.20.0-320/hadoop-core-0.20.2+320.jar
>>>>>
>>>>> On CDH/DEB edition, it's somewhere silly ... locate and find will be
>>>>> your friend.  It should be called hadoop-core-0.20.2+320.jar though!
>>>>>
>>>>> I'm working on a github publish of SU's production system, which uses
>>>>> the cloudera maven repo to install the correct JAR in hbase so when
>>>>> you type 'mvn assembly:assembly' to build your own hbase-*-bin.tar.gz
>>>>> (the * being whatever version you specified in pom.xml) the cdh3b2 jar
>>>>> comes pre-packaged.
>>>>>
>>>>> Stay tuned :-)
>>>>>
>>>>> -ryan
>>>>>
>>>>> On Mon, Sep 20, 2010 at 8:36 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>> Ryan, hadoop jar, what is the usual path to the file? I just to to be
>>>>>> sure, and where do I put it?
>>>>>>
>>>>>> -Jack
>>>>>>
>>>>>> On Mon, Sep 20, 2010 at 8:30 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>>>>>> you need 2 more things:
>>>>>>>
>>>>>>> - restart hdfs
>>>>>>> - make sure the hadoop jar from your install replaces the one we ship with
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Sep 20, 2010 at 8:22 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>> So, I switched to 0.89, and we already had CDH3
>>>>>>>> (hadoop-0.20-datanode-0.20.2+320-3.noarch), even though I added
>>>>>>>>  <name>dfs.support.append</name> as true to both hdfs-site.xml and
>>>>>>>> hbase-site.xml, the master still reports this:
>>>>>>>>
>>>>>>>>  You are currently running the HMaster without HDFS append support
>>>>>>>> enabled. This may result in data loss. Please see the HBase wiki  for
>>>>>>>> details.
>>>>>>>> Master Attributes
>>>>>>>> Attribute Name  Value   Description
>>>>>>>> HBase Version   0.89.20100726, r979826  HBase version and svn revision
>>>>>>>> HBase Compiled  Sat Jul 31 02:01:58 PDT 2010, stack     When HBase version
>>>>>>>> was compiled and by whom
>>>>>>>> Hadoop Version  0.20.2, r911707 Hadoop version and svn revision
>>>>>>>> Hadoop Compiled Fri Feb 19 08:07:34 UTC 2010, chrisdo   When Hadoop
>>>>>>>> version was compiled and by whom
>>>>>>>> HBase Root Directory    hdfs://namenode-rd.imageshack.us:9000/hbase     Location
>>>>>>>> of HBase home directory
>>>>>>>>
>>>>>>>> Any ideas whats wrong?
>>>>>>>>
>>>>>>>> -Jack
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Sep 20, 2010 at 5:47 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>>>>>>>> Hey,
>>>>>>>>>
>>>>>>>>> There is actually only 1 active branch of hbase, that being the 0.89
>>>>>>>>> release, which is based on 'trunk'.  We have snapshotted a series of
>>>>>>>>> 0.89 "developer releases" in hopes that people would try them our and
>>>>>>>>> start thinking about the next major version.  One of these is what SU
>>>>>>>>> is running prod on.
>>>>>>>>>
>>>>>>>>> At this point tracking 0.89 and which ones are the 'best' peach sets
>>>>>>>>> to run is a bit of a contact sport, but if you are serious about not
>>>>>>>>> losing data it is worthwhile.  SU is based on the most recent DR with
>>>>>>>>> a few minor patches of our own concoction brought in.  If current
>>>>>>>>> works, but some Master ops are slow, and there are a few patches on
>>>>>>>>> top of that.  I'll poke about and see if its possible to publish to a
>>>>>>>>> github branch or something.
>>>>>>>>>
>>>>>>>>> -ryan
>>>>>>>>>
>>>>>>>>> On Mon, Sep 20, 2010 at 5:16 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>>> Sounds, good, only reason I ask is because of this:
>>>>>>>>>>
>>>>>>>>>> There are currently two active branches of HBase:
>>>>>>>>>>
>>>>>>>>>>    * 0.20 - the current stable release series, being maintained with
>>>>>>>>>> patches for bug fixes only. This release series does not support HDFS
>>>>>>>>>> durability - edits may be lost in the case of node failure.
>>>>>>>>>>    * 0.89 - a development release series with active feature and
>>>>>>>>>> stability development, not currently recommended for production use.
>>>>>>>>>> This release does support HDFS durability - cases in which edits are
>>>>>>>>>> lost are considered serious bugs.
>>>>>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Are we talking about data loss in case of datanode going down while
>>>>>>>>>> being written to, or RegionServer going down?
>>>>>>>>>>
>>>>>>>>>> -jack
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Sep 20, 2010 at 4:09 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>>>>>>>>>> We run 0.89 in production @ Stumbleupon.  We also employ 3 committers...
>>>>>>>>>>>
>>>>>>>>>>> As for safety, you have no choice but to run 0.89.  If you run a 0.20
>>>>>>>>>>> release you will lose data.  you must be on 0.89 and
>>>>>>>>>>> CDH3/append-branch to achieve data durability, and there really is no
>>>>>>>>>>> argument around it.  If you are doing your tests with 0.20.6 now, I'd
>>>>>>>>>>> stop and rebase those tests onto the latest DR announced on the list.
>>>>>>>>>>>
>>>>>>>>>>> -ryan
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Sep 20, 2010 at 3:17 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>>>>> Hi Stack, see inline:
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Sep 20, 2010 at 2:42 PM, Stack <st...@duboce.net> wrote:
>>>>>>>>>>>>> Hey Jack:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for writing.
>>>>>>>>>>>>>
>>>>>>>>>>>>> See below for some comments.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Sep 20, 2010 at 11:00 AM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Image-Shack gets close to two million image uploads per day, which are
>>>>>>>>>>>>>> usually stored on regular servers (we have about 700), as regular
>>>>>>>>>>>>>> files, and each server has its own host name, such as (img55).   I've
>>>>>>>>>>>>>> been researching on how to improve our backend design in terms of data
>>>>>>>>>>>>>> safety and stumped onto the Hbase project.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Any other requirements other than data safety? (latency, etc).
>>>>>>>>>>>>
>>>>>>>>>>>> Latency is the second requirement.  We have some services that are
>>>>>>>>>>>> very short tail, and can produce 95% cache hit rate, so I assume this
>>>>>>>>>>>> would really put cache into good use.  Some other services however,
>>>>>>>>>>>> have about 25% cache hit ratio, in which case the latency should be
>>>>>>>>>>>> 'adequate', e.g. if its slightly worse than getting data off raw disk,
>>>>>>>>>>>> then its good enough.   Safely is supremely important, then its
>>>>>>>>>>>> availability, then speed.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>> Now, I think hbase is he most beautiful thing that happen to
>>>>>>>>>>>>>> distributed DB world :).   The idea is to store image files (about
>>>>>>>>>>>>>> 400Kb on average into HBASE).
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'd guess some images are much bigger than this.  Do you ever limit
>>>>>>>>>>>>> the size of images folks can upload to your service?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> The setup will include the following
>>>>>>>>>>>>>> configuration:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 50 servers total (2 datacenters), with 8 GB RAM, dual core cpu, 6 x
>>>>>>>>>>>>>> 2TB disks each.
>>>>>>>>>>>>>> 3 to 5 Zookeepers
>>>>>>>>>>>>>> 2 Masters (in a datacenter each)
>>>>>>>>>>>>>> 10 to 20 Stargate REST instances (one per server, hash loadbalanced)
>>>>>>>>>>>>>
>>>>>>>>>>>>> Whats your frontend?  Why REST?  It might be more efficient if you
>>>>>>>>>>>>> could run with thrift given REST base64s its payload IIRC (check the
>>>>>>>>>>>>> src yourself).
>>>>>>>>>>>>
>>>>>>>>>>>> For insertion we use Haproxy, and balance curl PUTs across multiple REST APIs.
>>>>>>>>>>>> For reading, its a nginx proxy that does Content-type modification
>>>>>>>>>>>> from image/jpeg to octet-stream, and vice versa,
>>>>>>>>>>>> it then hits Haproxy again, which hits balanced REST.
>>>>>>>>>>>> Why REST, it was the simplest thing to run, given that its supports
>>>>>>>>>>>> HTTP, potentially we could rewrite something for thrift, as long as we
>>>>>>>>>>>> can use http still to send and receive data (anyone wrote anything
>>>>>>>>>>>> like that say in python, C or java?)
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> 40 to 50 RegionServers (will probably keep masters separate on dedicated boxes).
>>>>>>>>>>>>>> 2 Namenode servers (one backup, highly available, will do fsimage and
>>>>>>>>>>>>>> edits snapshots also)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> So far I got about 13 servers running, and doing about 20 insertions /
>>>>>>>>>>>>>> second (file size ranging from few KB to 2-3MB, ave. 400KB). via
>>>>>>>>>>>>>> Stargate API.  Our frontend servers receive files, and I just
>>>>>>>>>>>>>> fork-insert them into stargate via http (curl).
>>>>>>>>>>>>>> The inserts are humming along nicely, without any noticeable load on
>>>>>>>>>>>>>> regionservers, so far inserted about 2 TB worth of images.
>>>>>>>>>>>>>> I have adjusted the region file size to be 512MB, and table block size
>>>>>>>>>>>>>> to about 400KB , trying to match average access block to limit HDFS
>>>>>>>>>>>>>> trips.
>>>>>>>>>>>>>
>>>>>>>>>>>>> As Todd suggests, I'd go up from 512MB... 1G at least.  You'll
>>>>>>>>>>>>> probably want to up your flush size from 64MB to 128MB or maybe 192MB.
>>>>>>>>>>>>
>>>>>>>>>>>> Yep, i will adjust to 1G.  I thought flush was controlled by a
>>>>>>>>>>>> function of memstore HEAP, something like 40%?  Or are you talking
>>>>>>>>>>>> about HDFS block size?
>>>>>>>>>>>>
>>>>>>>>>>>>>  So far the read performance was more than adequate, and of
>>>>>>>>>>>>>> course write performance is nowhere near capacity.
>>>>>>>>>>>>>> So right now, all newly uploaded images go to HBASE.  But we do plan
>>>>>>>>>>>>>> to insert about 170 Million images (about 100 days worth), which is
>>>>>>>>>>>>>> only about 64 TB, or 10% of planned cluster size of 600TB.
>>>>>>>>>>>>>> The end goal is to have a storage system that creates data safety,
>>>>>>>>>>>>>> e.g. system may go down but data can not be lost.   Our Front-End
>>>>>>>>>>>>>> servers will continue to serve images from their own file system (we
>>>>>>>>>>>>>> are serving about 16 Gbits at peak), however should we need to bring
>>>>>>>>>>>>>> any of those down for maintenance, we will redirect all traffic to
>>>>>>>>>>>>>> Hbase (should be no more than few hundred Mbps), while the front end
>>>>>>>>>>>>>> server is repaired (for example having its disk replaced), after the
>>>>>>>>>>>>>> repairs, we quickly repopulate it with missing files, while serving
>>>>>>>>>>>>>> the missing remaining off Hbase.
>>>>>>>>>>>>>> All in all should be very interesting project, and I am hoping not to
>>>>>>>>>>>>>> run into any snags, however, should that happens, I am pleased to know
>>>>>>>>>>>>>> that such a great and vibrant tech group exists that supports and uses
>>>>>>>>>>>>>> HBASE :).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> We're definetly interested in how your project progresses.  If you are
>>>>>>>>>>>>> ever up in the city, you should drop by for a chat.
>>>>>>>>>>>>
>>>>>>>>>>>> Cool.  I'd like that.
>>>>>>>>>>>>
>>>>>>>>>>>>> St.Ack
>>>>>>>>>>>>>
>>>>>>>>>>>>> P.S. I'm also w/ Todd that you should move to 0.89 and blooms.
>>>>>>>>>>>>> P.P.S I updated the wiki on stargate REST:
>>>>>>>>>>>>> http://wiki.apache.org/hadoop/Hbase/Stargate
>>>>>>>>>>>>
>>>>>>>>>>>> Cool, I assume if we move to that it won't kill existing meta tables,
>>>>>>>>>>>> and data?  e.g. cross compatible?
>>>>>>>>>>>> Is 0.89 ready for production environment?
>>>>>>>>>>>>
>>>>>>>>>>>> -Jack
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Millions of photos into Hbase

Posted by Ryan Rawson <ry...@gmail.com>.

yes that is the new ZK based coordination.  when i publish the SU code
we have a patch which limits that and is faster.  2GB is a little
small for a regionserver memory... in my ideal world we'll be putting
20GB+ of ram to regionserver.

I just figured you were using the DEB/RPMs because your files were in
/usr/local... I usually run everything out of /home/hadoop b/c it
allows me to easily rsync as user hadoop.

but you are on the right track yes :-)

On Mon, Sep 20, 2010 at 9:32 PM, Jack Levin <ma...@gmail.com> wrote:
> Who said anything about deb :). I do use tarballs.... Yes, so what did
> it is the copy of that jar to under hbase/lib, and then full restart.
>  Now here is a funny thing, the master shuddered for about 10 minutes,
> spewing those messages:
>
> 2010-09-20 21:23:45,826 DEBUG org.apache.hadoop.hbase.master.HMaster:
> Event NodeCreated with state SyncConnected with path
> /hbase/UNASSIGNED/97999366
> 2010-09-20 21:23:45,827 DEBUG
> org.apache.hadoop.hbase.master.ZKMasterAddressWatcher: Got event
> NodeCreated with path /hbase/UNASSIGNED/97999366
> 2010-09-20 21:23:45,827 DEBUG
> org.apache.hadoop.hbase.master.ZKUnassignedWatcher: ZK-EVENT-PROCESS:
> Got zkEvent NodeCreated state:SyncConnected
> path:/hbase/UNASSIGNED/97999366
> 2010-09-20 21:23:45,827 DEBUG
> org.apache.hadoop.hbase.master.RegionManager: Created/updated
> UNASSIGNED zNode img15,normal052q.jpg,1285001686282.97999366 in state
> M2ZK_REGION_OFFLINE
> 2010-09-20 21:23:45,828 INFO
> org.apache.hadoop.hbase.master.RegionServerOperation:
> img13,p1000319tq.jpg,1284952655960.812544765 open on
> 10.103.2.3,60020,1285042333293
> 2010-09-20 21:23:45,828 DEBUG
> org.apache.hadoop.hbase.master.ZKUnassignedWatcher: Got event type [
> M2ZK_REGION_OFFLINE ] for region 97999366
> 2010-09-20 21:23:45,828 DEBUG org.apache.hadoop.hbase.master.HMaster:
> Event NodeChildrenChanged with state SyncConnected with path
> /hbase/UNASSIGNED
> 2010-09-20 21:23:45,828 DEBUG
> org.apache.hadoop.hbase.master.ZKMasterAddressWatcher: Got event
> NodeChildrenChanged with path /hbase/UNASSIGNED
> 2010-09-20 21:23:45,828 DEBUG
> org.apache.hadoop.hbase.master.ZKUnassignedWatcher: ZK-EVENT-PROCESS:
> Got zkEvent NodeChildrenChanged state:SyncConnected
> path:/hbase/UNASSIGNED
> 2010-09-20 21:23:45,830 DEBUG
> org.apache.hadoop.hbase.master.BaseScanner: Current assignment of
> img150,,1284859678248.3116007 is not valid;
> serverAddress=10.103.2.1:60020, startCode=1285038205920 unknown.
>
>
> Does anyone know what they mean?   At first it would kill one of my
> datanodes.  But what helped is when I changed to heap size to 4GB for
> master and 2GB for datanode that was dying, and after 10 minutes I got
> into a clean state.
>
> -Jack
>
>
> On Mon, Sep 20, 2010 at 9:28 PM, Ryan Rawson <ry...@gmail.com> wrote:
>> yes, on every single machine as well, and restart.
>>
>> again, not sure how how you'd do this in a scalable manner with your
>> deb packages... on the source tarball you can just replace it, rsync
>> it out and done.
>>
>> :-)
>>
>> On Mon, Sep 20, 2010 at 8:56 PM, Jack Levin <ma...@gmail.com> wrote:
>>> ok, I found that file, do I replace hadoop-core.*.jar under /usr/lib/hbase/lib?
>>> Then restart, etc?  All regionservers too?
>>>
>>> -Jack
>>>
>>> On Mon, Sep 20, 2010 at 8:40 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>>> Well I don't really run CDH, I disagree with their rpm/deb packaging
>>>> policies and I have to highly recommend not using DEBs to install
>>>> software...
>>>>
>>>> So normally installing from tarball, the jar is in
>>>> <installpath>/hadoop-0.20.0-320/hadoop-core-0.20.2+320.jar
>>>>
>>>> On CDH/DEB edition, it's somewhere silly ... locate and find will be
>>>> your friend.  It should be called hadoop-core-0.20.2+320.jar though!
>>>>
>>>> I'm working on a github publish of SU's production system, which uses
>>>> the cloudera maven repo to install the correct JAR in hbase so when
>>>> you type 'mvn assembly:assembly' to build your own hbase-*-bin.tar.gz
>>>> (the * being whatever version you specified in pom.xml) the cdh3b2 jar
>>>> comes pre-packaged.
>>>>
>>>> Stay tuned :-)
>>>>
>>>> -ryan
>>>>
>>>> On Mon, Sep 20, 2010 at 8:36 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>> Ryan, hadoop jar, what is the usual path to the file? I just to to be
>>>>> sure, and where do I put it?
>>>>>
>>>>> -Jack
>>>>>
>>>>> On Mon, Sep 20, 2010 at 8:30 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>>>>> you need 2 more things:
>>>>>>
>>>>>> - restart hdfs
>>>>>> - make sure the hadoop jar from your install replaces the one we ship with
>>>>>>
>>>>>>
>>>>>> On Mon, Sep 20, 2010 at 8:22 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>> So, I switched to 0.89, and we already had CDH3
>>>>>>> (hadoop-0.20-datanode-0.20.2+320-3.noarch), even though I added
>>>>>>>  <name>dfs.support.append</name> as true to both hdfs-site.xml and
>>>>>>> hbase-site.xml, the master still reports this:
>>>>>>>
>>>>>>>  You are currently running the HMaster without HDFS append support
>>>>>>> enabled. This may result in data loss. Please see the HBase wiki  for
>>>>>>> details.
>>>>>>> Master Attributes
>>>>>>> Attribute Name  Value   Description
>>>>>>> HBase Version   0.89.20100726, r979826  HBase version and svn revision
>>>>>>> HBase Compiled  Sat Jul 31 02:01:58 PDT 2010, stack     When HBase version
>>>>>>> was compiled and by whom
>>>>>>> Hadoop Version  0.20.2, r911707 Hadoop version and svn revision
>>>>>>> Hadoop Compiled Fri Feb 19 08:07:34 UTC 2010, chrisdo   When Hadoop
>>>>>>> version was compiled and by whom
>>>>>>> HBase Root Directory    hdfs://namenode-rd.imageshack.us:9000/hbase     Location
>>>>>>> of HBase home directory
>>>>>>>
>>>>>>> Any ideas whats wrong?
>>>>>>>
>>>>>>> -Jack
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Sep 20, 2010 at 5:47 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>>>>>>> Hey,
>>>>>>>>
>>>>>>>> There is actually only 1 active branch of hbase, that being the 0.89
>>>>>>>> release, which is based on 'trunk'.  We have snapshotted a series of
>>>>>>>> 0.89 "developer releases" in hopes that people would try them our and
>>>>>>>> start thinking about the next major version.  One of these is what SU
>>>>>>>> is running prod on.
>>>>>>>>
>>>>>>>> At this point tracking 0.89 and which ones are the 'best' peach sets
>>>>>>>> to run is a bit of a contact sport, but if you are serious about not
>>>>>>>> losing data it is worthwhile.  SU is based on the most recent DR with
>>>>>>>> a few minor patches of our own concoction brought in.  If current
>>>>>>>> works, but some Master ops are slow, and there are a few patches on
>>>>>>>> top of that.  I'll poke about and see if its possible to publish to a
>>>>>>>> github branch or something.
>>>>>>>>
>>>>>>>> -ryan
>>>>>>>>
>>>>>>>> On Mon, Sep 20, 2010 at 5:16 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>> Sounds, good, only reason I ask is because of this:
>>>>>>>>>
>>>>>>>>> There are currently two active branches of HBase:
>>>>>>>>>
>>>>>>>>>    * 0.20 - the current stable release series, being maintained with
>>>>>>>>> patches for bug fixes only. This release series does not support HDFS
>>>>>>>>> durability - edits may be lost in the case of node failure.
>>>>>>>>>    * 0.89 - a development release series with active feature and
>>>>>>>>> stability development, not currently recommended for production use.
>>>>>>>>> This release does support HDFS durability - cases in which edits are
>>>>>>>>> lost are considered serious bugs.
>>>>>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Are we talking about data loss in case of datanode going down while
>>>>>>>>> being written to, or RegionServer going down?
>>>>>>>>>
>>>>>>>>> -jack
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Sep 20, 2010 at 4:09 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>>>>>>>>> We run 0.89 in production @ Stumbleupon.  We also employ 3 committers...
>>>>>>>>>>
>>>>>>>>>> As for safety, you have no choice but to run 0.89.  If you run a 0.20
>>>>>>>>>> release you will lose data.  you must be on 0.89 and
>>>>>>>>>> CDH3/append-branch to achieve data durability, and there really is no
>>>>>>>>>> argument around it.  If you are doing your tests with 0.20.6 now, I'd
>>>>>>>>>> stop and rebase those tests onto the latest DR announced on the list.
>>>>>>>>>>
>>>>>>>>>> -ryan
>>>>>>>>>>
>>>>>>>>>> On Mon, Sep 20, 2010 at 3:17 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>>>> Hi Stack, see inline:
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Sep 20, 2010 at 2:42 PM, Stack <st...@duboce.net> wrote:
>>>>>>>>>>>> Hey Jack:
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for writing.
>>>>>>>>>>>>
>>>>>>>>>>>> See below for some comments.
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Sep 20, 2010 at 11:00 AM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Image-Shack gets close to two million image uploads per day, which are
>>>>>>>>>>>>> usually stored on regular servers (we have about 700), as regular
>>>>>>>>>>>>> files, and each server has its own host name, such as (img55).   I've
>>>>>>>>>>>>> been researching on how to improve our backend design in terms of data
>>>>>>>>>>>>> safety and stumped onto the Hbase project.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Any other requirements other than data safety? (latency, etc).
>>>>>>>>>>>
>>>>>>>>>>> Latency is the second requirement.  We have some services that are
>>>>>>>>>>> very short tail, and can produce 95% cache hit rate, so I assume this
>>>>>>>>>>> would really put cache into good use.  Some other services however,
>>>>>>>>>>> have about 25% cache hit ratio, in which case the latency should be
>>>>>>>>>>> 'adequate', e.g. if its slightly worse than getting data off raw disk,
>>>>>>>>>>> then its good enough.   Safely is supremely important, then its
>>>>>>>>>>> availability, then speed.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>> Now, I think hbase is he most beautiful thing that happen to
>>>>>>>>>>>>> distributed DB world :).   The idea is to store image files (about
>>>>>>>>>>>>> 400Kb on average into HBASE).
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I'd guess some images are much bigger than this.  Do you ever limit
>>>>>>>>>>>> the size of images folks can upload to your service?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> The setup will include the following
>>>>>>>>>>>>> configuration:
>>>>>>>>>>>>>
>>>>>>>>>>>>> 50 servers total (2 datacenters), with 8 GB RAM, dual core cpu, 6 x
>>>>>>>>>>>>> 2TB disks each.
>>>>>>>>>>>>> 3 to 5 Zookeepers
>>>>>>>>>>>>> 2 Masters (in a datacenter each)
>>>>>>>>>>>>> 10 to 20 Stargate REST instances (one per server, hash loadbalanced)
>>>>>>>>>>>>
>>>>>>>>>>>> Whats your frontend?  Why REST?  It might be more efficient if you
>>>>>>>>>>>> could run with thrift given REST base64s its payload IIRC (check the
>>>>>>>>>>>> src yourself).
>>>>>>>>>>>
>>>>>>>>>>> For insertion we use Haproxy, and balance curl PUTs across multiple REST APIs.
>>>>>>>>>>> For reading, its a nginx proxy that does Content-type modification
>>>>>>>>>>> from image/jpeg to octet-stream, and vice versa,
>>>>>>>>>>> it then hits Haproxy again, which hits balanced REST.
>>>>>>>>>>> Why REST, it was the simplest thing to run, given that its supports
>>>>>>>>>>> HTTP, potentially we could rewrite something for thrift, as long as we
>>>>>>>>>>> can use http still to send and receive data (anyone wrote anything
>>>>>>>>>>> like that say in python, C or java?)
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> 40 to 50 RegionServers (will probably keep masters separate on dedicated boxes).
>>>>>>>>>>>>> 2 Namenode servers (one backup, highly available, will do fsimage and
>>>>>>>>>>>>> edits snapshots also)
>>>>>>>>>>>>>
>>>>>>>>>>>>> So far I got about 13 servers running, and doing about 20 insertions /
>>>>>>>>>>>>> second (file size ranging from few KB to 2-3MB, ave. 400KB). via
>>>>>>>>>>>>> Stargate API.  Our frontend servers receive files, and I just
>>>>>>>>>>>>> fork-insert them into stargate via http (curl).
>>>>>>>>>>>>> The inserts are humming along nicely, without any noticeable load on
>>>>>>>>>>>>> regionservers, so far inserted about 2 TB worth of images.
>>>>>>>>>>>>> I have adjusted the region file size to be 512MB, and table block size
>>>>>>>>>>>>> to about 400KB , trying to match average access block to limit HDFS
>>>>>>>>>>>>> trips.
>>>>>>>>>>>>
>>>>>>>>>>>> As Todd suggests, I'd go up from 512MB... 1G at least.  You'll
>>>>>>>>>>>> probably want to up your flush size from 64MB to 128MB or maybe 192MB.
>>>>>>>>>>>
>>>>>>>>>>> Yep, i will adjust to 1G.  I thought flush was controlled by a
>>>>>>>>>>> function of memstore HEAP, something like 40%?  Or are you talking
>>>>>>>>>>> about HDFS block size?
>>>>>>>>>>>
>>>>>>>>>>>>  So far the read performance was more than adequate, and of
>>>>>>>>>>>>> course write performance is nowhere near capacity.
>>>>>>>>>>>>> So right now, all newly uploaded images go to HBASE.  But we do plan
>>>>>>>>>>>>> to insert about 170 Million images (about 100 days worth), which is
>>>>>>>>>>>>> only about 64 TB, or 10% of planned cluster size of 600TB.
>>>>>>>>>>>>> The end goal is to have a storage system that creates data safety,
>>>>>>>>>>>>> e.g. system may go down but data can not be lost.   Our Front-End
>>>>>>>>>>>>> servers will continue to serve images from their own file system (we
>>>>>>>>>>>>> are serving about 16 Gbits at peak), however should we need to bring
>>>>>>>>>>>>> any of those down for maintenance, we will redirect all traffic to
>>>>>>>>>>>>> Hbase (should be no more than few hundred Mbps), while the front end
>>>>>>>>>>>>> server is repaired (for example having its disk replaced), after the
>>>>>>>>>>>>> repairs, we quickly repopulate it with missing files, while serving
>>>>>>>>>>>>> the missing remaining off Hbase.
>>>>>>>>>>>>> All in all should be very interesting project, and I am hoping not to
>>>>>>>>>>>>> run into any snags, however, should that happens, I am pleased to know
>>>>>>>>>>>>> that such a great and vibrant tech group exists that supports and uses
>>>>>>>>>>>>> HBASE :).
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> We're definetly interested in how your project progresses.  If you are
>>>>>>>>>>>> ever up in the city, you should drop by for a chat.
>>>>>>>>>>>
>>>>>>>>>>> Cool.  I'd like that.
>>>>>>>>>>>
>>>>>>>>>>>> St.Ack
>>>>>>>>>>>>
>>>>>>>>>>>> P.S. I'm also w/ Todd that you should move to 0.89 and blooms.
>>>>>>>>>>>> P.P.S I updated the wiki on stargate REST:
>>>>>>>>>>>> http://wiki.apache.org/hadoop/Hbase/Stargate
>>>>>>>>>>>
>>>>>>>>>>> Cool, I assume if we move to that it won't kill existing meta tables,
>>>>>>>>>>> and data?  e.g. cross compatible?
>>>>>>>>>>> Is 0.89 ready for production environment?
>>>>>>>>>>>
>>>>>>>>>>> -Jack
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Millions of photos into Hbase

Posted by Jack Levin <ma...@gmail.com>.

Who said anything about deb :). I do use tarballs.... Yes, so what did
it is the copy of that jar to under hbase/lib, and then full restart.
 Now here is a funny thing, the master shuddered for about 10 minutes,
spewing those messages:

2010-09-20 21:23:45,826 DEBUG org.apache.hadoop.hbase.master.HMaster:
Event NodeCreated with state SyncConnected with path
/hbase/UNASSIGNED/97999366
2010-09-20 21:23:45,827 DEBUG
org.apache.hadoop.hbase.master.ZKMasterAddressWatcher: Got event
NodeCreated with path /hbase/UNASSIGNED/97999366
2010-09-20 21:23:45,827 DEBUG
org.apache.hadoop.hbase.master.ZKUnassignedWatcher: ZK-EVENT-PROCESS:
Got zkEvent NodeCreated state:SyncConnected
path:/hbase/UNASSIGNED/97999366
2010-09-20 21:23:45,827 DEBUG
org.apache.hadoop.hbase.master.RegionManager: Created/updated
UNASSIGNED zNode img15,normal052q.jpg,1285001686282.97999366 in state
M2ZK_REGION_OFFLINE
2010-09-20 21:23:45,828 INFO
org.apache.hadoop.hbase.master.RegionServerOperation:
img13,p1000319tq.jpg,1284952655960.812544765 open on
10.103.2.3,60020,1285042333293
2010-09-20 21:23:45,828 DEBUG
org.apache.hadoop.hbase.master.ZKUnassignedWatcher: Got event type [
M2ZK_REGION_OFFLINE ] for region 97999366
2010-09-20 21:23:45,828 DEBUG org.apache.hadoop.hbase.master.HMaster:
Event NodeChildrenChanged with state SyncConnected with path
/hbase/UNASSIGNED
2010-09-20 21:23:45,828 DEBUG
org.apache.hadoop.hbase.master.ZKMasterAddressWatcher: Got event
NodeChildrenChanged with path /hbase/UNASSIGNED
2010-09-20 21:23:45,828 DEBUG
org.apache.hadoop.hbase.master.ZKUnassignedWatcher: ZK-EVENT-PROCESS:
Got zkEvent NodeChildrenChanged state:SyncConnected
path:/hbase/UNASSIGNED
2010-09-20 21:23:45,830 DEBUG
org.apache.hadoop.hbase.master.BaseScanner: Current assignment of
img150,,1284859678248.3116007 is not valid;
serverAddress=10.103.2.1:60020, startCode=1285038205920 unknown.


Does anyone know what they mean?   At first it would kill one of my
datanodes.  But what helped is when I changed to heap size to 4GB for
master and 2GB for datanode that was dying, and after 10 minutes I got
into a clean state.

-Jack


On Mon, Sep 20, 2010 at 9:28 PM, Ryan Rawson <ry...@gmail.com> wrote:
> yes, on every single machine as well, and restart.
>
> again, not sure how how you'd do this in a scalable manner with your
> deb packages... on the source tarball you can just replace it, rsync
> it out and done.
>
> :-)
>
> On Mon, Sep 20, 2010 at 8:56 PM, Jack Levin <ma...@gmail.com> wrote:
>> ok, I found that file, do I replace hadoop-core.*.jar under /usr/lib/hbase/lib?
>> Then restart, etc?  All regionservers too?
>>
>> -Jack
>>
>> On Mon, Sep 20, 2010 at 8:40 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>> Well I don't really run CDH, I disagree with their rpm/deb packaging
>>> policies and I have to highly recommend not using DEBs to install
>>> software...
>>>
>>> So normally installing from tarball, the jar is in
>>> <installpath>/hadoop-0.20.0-320/hadoop-core-0.20.2+320.jar
>>>
>>> On CDH/DEB edition, it's somewhere silly ... locate and find will be
>>> your friend.  It should be called hadoop-core-0.20.2+320.jar though!
>>>
>>> I'm working on a github publish of SU's production system, which uses
>>> the cloudera maven repo to install the correct JAR in hbase so when
>>> you type 'mvn assembly:assembly' to build your own hbase-*-bin.tar.gz
>>> (the * being whatever version you specified in pom.xml) the cdh3b2 jar
>>> comes pre-packaged.
>>>
>>> Stay tuned :-)
>>>
>>> -ryan
>>>
>>> On Mon, Sep 20, 2010 at 8:36 PM, Jack Levin <ma...@gmail.com> wrote:
>>>> Ryan, hadoop jar, what is the usual path to the file? I just to to be
>>>> sure, and where do I put it?
>>>>
>>>> -Jack
>>>>
>>>> On Mon, Sep 20, 2010 at 8:30 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>>>> you need 2 more things:
>>>>>
>>>>> - restart hdfs
>>>>> - make sure the hadoop jar from your install replaces the one we ship with
>>>>>
>>>>>
>>>>> On Mon, Sep 20, 2010 at 8:22 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>> So, I switched to 0.89, and we already had CDH3
>>>>>> (hadoop-0.20-datanode-0.20.2+320-3.noarch), even though I added
>>>>>>  <name>dfs.support.append</name> as true to both hdfs-site.xml and
>>>>>> hbase-site.xml, the master still reports this:
>>>>>>
>>>>>>  You are currently running the HMaster without HDFS append support
>>>>>> enabled. This may result in data loss. Please see the HBase wiki  for
>>>>>> details.
>>>>>> Master Attributes
>>>>>> Attribute Name  Value   Description
>>>>>> HBase Version   0.89.20100726, r979826  HBase version and svn revision
>>>>>> HBase Compiled  Sat Jul 31 02:01:58 PDT 2010, stack     When HBase version
>>>>>> was compiled and by whom
>>>>>> Hadoop Version  0.20.2, r911707 Hadoop version and svn revision
>>>>>> Hadoop Compiled Fri Feb 19 08:07:34 UTC 2010, chrisdo   When Hadoop
>>>>>> version was compiled and by whom
>>>>>> HBase Root Directory    hdfs://namenode-rd.imageshack.us:9000/hbase     Location
>>>>>> of HBase home directory
>>>>>>
>>>>>> Any ideas whats wrong?
>>>>>>
>>>>>> -Jack
>>>>>>
>>>>>>
>>>>>> On Mon, Sep 20, 2010 at 5:47 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>>>>>> Hey,
>>>>>>>
>>>>>>> There is actually only 1 active branch of hbase, that being the 0.89
>>>>>>> release, which is based on 'trunk'.  We have snapshotted a series of
>>>>>>> 0.89 "developer releases" in hopes that people would try them our and
>>>>>>> start thinking about the next major version.  One of these is what SU
>>>>>>> is running prod on.
>>>>>>>
>>>>>>> At this point tracking 0.89 and which ones are the 'best' peach sets
>>>>>>> to run is a bit of a contact sport, but if you are serious about not
>>>>>>> losing data it is worthwhile.  SU is based on the most recent DR with
>>>>>>> a few minor patches of our own concoction brought in.  If current
>>>>>>> works, but some Master ops are slow, and there are a few patches on
>>>>>>> top of that.  I'll poke about and see if its possible to publish to a
>>>>>>> github branch or something.
>>>>>>>
>>>>>>> -ryan
>>>>>>>
>>>>>>> On Mon, Sep 20, 2010 at 5:16 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>> Sounds, good, only reason I ask is because of this:
>>>>>>>>
>>>>>>>> There are currently two active branches of HBase:
>>>>>>>>
>>>>>>>>    * 0.20 - the current stable release series, being maintained with
>>>>>>>> patches for bug fixes only. This release series does not support HDFS
>>>>>>>> durability - edits may be lost in the case of node failure.
>>>>>>>>    * 0.89 - a development release series with active feature and
>>>>>>>> stability development, not currently recommended for production use.
>>>>>>>> This release does support HDFS durability - cases in which edits are
>>>>>>>> lost are considered serious bugs.
>>>>>>>>>>>>>>
>>>>>>>>
>>>>>>>> Are we talking about data loss in case of datanode going down while
>>>>>>>> being written to, or RegionServer going down?
>>>>>>>>
>>>>>>>> -jack
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Sep 20, 2010 at 4:09 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>>>>>>>> We run 0.89 in production @ Stumbleupon.  We also employ 3 committers...
>>>>>>>>>
>>>>>>>>> As for safety, you have no choice but to run 0.89.  If you run a 0.20
>>>>>>>>> release you will lose data.  you must be on 0.89 and
>>>>>>>>> CDH3/append-branch to achieve data durability, and there really is no
>>>>>>>>> argument around it.  If you are doing your tests with 0.20.6 now, I'd
>>>>>>>>> stop and rebase those tests onto the latest DR announced on the list.
>>>>>>>>>
>>>>>>>>> -ryan
>>>>>>>>>
>>>>>>>>> On Mon, Sep 20, 2010 at 3:17 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>>> Hi Stack, see inline:
>>>>>>>>>>
>>>>>>>>>> On Mon, Sep 20, 2010 at 2:42 PM, Stack <st...@duboce.net> wrote:
>>>>>>>>>>> Hey Jack:
>>>>>>>>>>>
>>>>>>>>>>> Thanks for writing.
>>>>>>>>>>>
>>>>>>>>>>> See below for some comments.
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Sep 20, 2010 at 11:00 AM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Image-Shack gets close to two million image uploads per day, which are
>>>>>>>>>>>> usually stored on regular servers (we have about 700), as regular
>>>>>>>>>>>> files, and each server has its own host name, such as (img55).   I've
>>>>>>>>>>>> been researching on how to improve our backend design in terms of data
>>>>>>>>>>>> safety and stumped onto the Hbase project.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Any other requirements other than data safety? (latency, etc).
>>>>>>>>>>
>>>>>>>>>> Latency is the second requirement.  We have some services that are
>>>>>>>>>> very short tail, and can produce 95% cache hit rate, so I assume this
>>>>>>>>>> would really put cache into good use.  Some other services however,
>>>>>>>>>> have about 25% cache hit ratio, in which case the latency should be
>>>>>>>>>> 'adequate', e.g. if its slightly worse than getting data off raw disk,
>>>>>>>>>> then its good enough.   Safely is supremely important, then its
>>>>>>>>>> availability, then speed.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>> Now, I think hbase is he most beautiful thing that happen to
>>>>>>>>>>>> distributed DB world :).   The idea is to store image files (about
>>>>>>>>>>>> 400Kb on average into HBASE).
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I'd guess some images are much bigger than this.  Do you ever limit
>>>>>>>>>>> the size of images folks can upload to your service?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> The setup will include the following
>>>>>>>>>>>> configuration:
>>>>>>>>>>>>
>>>>>>>>>>>> 50 servers total (2 datacenters), with 8 GB RAM, dual core cpu, 6 x
>>>>>>>>>>>> 2TB disks each.
>>>>>>>>>>>> 3 to 5 Zookeepers
>>>>>>>>>>>> 2 Masters (in a datacenter each)
>>>>>>>>>>>> 10 to 20 Stargate REST instances (one per server, hash loadbalanced)
>>>>>>>>>>>
>>>>>>>>>>> Whats your frontend?  Why REST?  It might be more efficient if you
>>>>>>>>>>> could run with thrift given REST base64s its payload IIRC (check the
>>>>>>>>>>> src yourself).
>>>>>>>>>>
>>>>>>>>>> For insertion we use Haproxy, and balance curl PUTs across multiple REST APIs.
>>>>>>>>>> For reading, its a nginx proxy that does Content-type modification
>>>>>>>>>> from image/jpeg to octet-stream, and vice versa,
>>>>>>>>>> it then hits Haproxy again, which hits balanced REST.
>>>>>>>>>> Why REST, it was the simplest thing to run, given that its supports
>>>>>>>>>> HTTP, potentially we could rewrite something for thrift, as long as we
>>>>>>>>>> can use http still to send and receive data (anyone wrote anything
>>>>>>>>>> like that say in python, C or java?)
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> 40 to 50 RegionServers (will probably keep masters separate on dedicated boxes).
>>>>>>>>>>>> 2 Namenode servers (one backup, highly available, will do fsimage and
>>>>>>>>>>>> edits snapshots also)
>>>>>>>>>>>>
>>>>>>>>>>>> So far I got about 13 servers running, and doing about 20 insertions /
>>>>>>>>>>>> second (file size ranging from few KB to 2-3MB, ave. 400KB). via
>>>>>>>>>>>> Stargate API.  Our frontend servers receive files, and I just
>>>>>>>>>>>> fork-insert them into stargate via http (curl).
>>>>>>>>>>>> The inserts are humming along nicely, without any noticeable load on
>>>>>>>>>>>> regionservers, so far inserted about 2 TB worth of images.
>>>>>>>>>>>> I have adjusted the region file size to be 512MB, and table block size
>>>>>>>>>>>> to about 400KB , trying to match average access block to limit HDFS
>>>>>>>>>>>> trips.
>>>>>>>>>>>
>>>>>>>>>>> As Todd suggests, I'd go up from 512MB... 1G at least.  You'll
>>>>>>>>>>> probably want to up your flush size from 64MB to 128MB or maybe 192MB.
>>>>>>>>>>
>>>>>>>>>> Yep, i will adjust to 1G.  I thought flush was controlled by a
>>>>>>>>>> function of memstore HEAP, something like 40%?  Or are you talking
>>>>>>>>>> about HDFS block size?
>>>>>>>>>>
>>>>>>>>>>>  So far the read performance was more than adequate, and of
>>>>>>>>>>>> course write performance is nowhere near capacity.
>>>>>>>>>>>> So right now, all newly uploaded images go to HBASE.  But we do plan
>>>>>>>>>>>> to insert about 170 Million images (about 100 days worth), which is
>>>>>>>>>>>> only about 64 TB, or 10% of planned cluster size of 600TB.
>>>>>>>>>>>> The end goal is to have a storage system that creates data safety,
>>>>>>>>>>>> e.g. system may go down but data can not be lost.   Our Front-End
>>>>>>>>>>>> servers will continue to serve images from their own file system (we
>>>>>>>>>>>> are serving about 16 Gbits at peak), however should we need to bring
>>>>>>>>>>>> any of those down for maintenance, we will redirect all traffic to
>>>>>>>>>>>> Hbase (should be no more than few hundred Mbps), while the front end
>>>>>>>>>>>> server is repaired (for example having its disk replaced), after the
>>>>>>>>>>>> repairs, we quickly repopulate it with missing files, while serving
>>>>>>>>>>>> the missing remaining off Hbase.
>>>>>>>>>>>> All in all should be very interesting project, and I am hoping not to
>>>>>>>>>>>> run into any snags, however, should that happens, I am pleased to know
>>>>>>>>>>>> that such a great and vibrant tech group exists that supports and uses
>>>>>>>>>>>> HBASE :).
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> We're definetly interested in how your project progresses.  If you are
>>>>>>>>>>> ever up in the city, you should drop by for a chat.
>>>>>>>>>>
>>>>>>>>>> Cool.  I'd like that.
>>>>>>>>>>
>>>>>>>>>>> St.Ack
>>>>>>>>>>>
>>>>>>>>>>> P.S. I'm also w/ Todd that you should move to 0.89 and blooms.
>>>>>>>>>>> P.P.S I updated the wiki on stargate REST:
>>>>>>>>>>> http://wiki.apache.org/hadoop/Hbase/Stargate
>>>>>>>>>>
>>>>>>>>>> Cool, I assume if we move to that it won't kill existing meta tables,
>>>>>>>>>> and data?  e.g. cross compatible?
>>>>>>>>>> Is 0.89 ready for production environment?
>>>>>>>>>>
>>>>>>>>>> -Jack
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Millions of photos into Hbase

Posted by Ryan Rawson <ry...@gmail.com>.

yes, on every single machine as well, and restart.

again, not sure how how you'd do this in a scalable manner with your
deb packages... on the source tarball you can just replace it, rsync
it out and done.

:-)

On Mon, Sep 20, 2010 at 8:56 PM, Jack Levin <ma...@gmail.com> wrote:
> ok, I found that file, do I replace hadoop-core.*.jar under /usr/lib/hbase/lib?
> Then restart, etc?  All regionservers too?
>
> -Jack
>
> On Mon, Sep 20, 2010 at 8:40 PM, Ryan Rawson <ry...@gmail.com> wrote:
>> Well I don't really run CDH, I disagree with their rpm/deb packaging
>> policies and I have to highly recommend not using DEBs to install
>> software...
>>
>> So normally installing from tarball, the jar is in
>> <installpath>/hadoop-0.20.0-320/hadoop-core-0.20.2+320.jar
>>
>> On CDH/DEB edition, it's somewhere silly ... locate and find will be
>> your friend.  It should be called hadoop-core-0.20.2+320.jar though!
>>
>> I'm working on a github publish of SU's production system, which uses
>> the cloudera maven repo to install the correct JAR in hbase so when
>> you type 'mvn assembly:assembly' to build your own hbase-*-bin.tar.gz
>> (the * being whatever version you specified in pom.xml) the cdh3b2 jar
>> comes pre-packaged.
>>
>> Stay tuned :-)
>>
>> -ryan
>>
>> On Mon, Sep 20, 2010 at 8:36 PM, Jack Levin <ma...@gmail.com> wrote:
>>> Ryan, hadoop jar, what is the usual path to the file? I just to to be
>>> sure, and where do I put it?
>>>
>>> -Jack
>>>
>>> On Mon, Sep 20, 2010 at 8:30 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>>> you need 2 more things:
>>>>
>>>> - restart hdfs
>>>> - make sure the hadoop jar from your install replaces the one we ship with
>>>>
>>>>
>>>> On Mon, Sep 20, 2010 at 8:22 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>> So, I switched to 0.89, and we already had CDH3
>>>>> (hadoop-0.20-datanode-0.20.2+320-3.noarch), even though I added
>>>>>  <name>dfs.support.append</name> as true to both hdfs-site.xml and
>>>>> hbase-site.xml, the master still reports this:
>>>>>
>>>>>  You are currently running the HMaster without HDFS append support
>>>>> enabled. This may result in data loss. Please see the HBase wiki  for
>>>>> details.
>>>>> Master Attributes
>>>>> Attribute Name  Value   Description
>>>>> HBase Version   0.89.20100726, r979826  HBase version and svn revision
>>>>> HBase Compiled  Sat Jul 31 02:01:58 PDT 2010, stack     When HBase version
>>>>> was compiled and by whom
>>>>> Hadoop Version  0.20.2, r911707 Hadoop version and svn revision
>>>>> Hadoop Compiled Fri Feb 19 08:07:34 UTC 2010, chrisdo   When Hadoop
>>>>> version was compiled and by whom
>>>>> HBase Root Directory    hdfs://namenode-rd.imageshack.us:9000/hbase     Location
>>>>> of HBase home directory
>>>>>
>>>>> Any ideas whats wrong?
>>>>>
>>>>> -Jack
>>>>>
>>>>>
>>>>> On Mon, Sep 20, 2010 at 5:47 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>>>>> Hey,
>>>>>>
>>>>>> There is actually only 1 active branch of hbase, that being the 0.89
>>>>>> release, which is based on 'trunk'.  We have snapshotted a series of
>>>>>> 0.89 "developer releases" in hopes that people would try them our and
>>>>>> start thinking about the next major version.  One of these is what SU
>>>>>> is running prod on.
>>>>>>
>>>>>> At this point tracking 0.89 and which ones are the 'best' peach sets
>>>>>> to run is a bit of a contact sport, but if you are serious about not
>>>>>> losing data it is worthwhile.  SU is based on the most recent DR with
>>>>>> a few minor patches of our own concoction brought in.  If current
>>>>>> works, but some Master ops are slow, and there are a few patches on
>>>>>> top of that.  I'll poke about and see if its possible to publish to a
>>>>>> github branch or something.
>>>>>>
>>>>>> -ryan
>>>>>>
>>>>>> On Mon, Sep 20, 2010 at 5:16 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>> Sounds, good, only reason I ask is because of this:
>>>>>>>
>>>>>>> There are currently two active branches of HBase:
>>>>>>>
>>>>>>>    * 0.20 - the current stable release series, being maintained with
>>>>>>> patches for bug fixes only. This release series does not support HDFS
>>>>>>> durability - edits may be lost in the case of node failure.
>>>>>>>    * 0.89 - a development release series with active feature and
>>>>>>> stability development, not currently recommended for production use.
>>>>>>> This release does support HDFS durability - cases in which edits are
>>>>>>> lost are considered serious bugs.
>>>>>>>>>>>>>
>>>>>>>
>>>>>>> Are we talking about data loss in case of datanode going down while
>>>>>>> being written to, or RegionServer going down?
>>>>>>>
>>>>>>> -jack
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Sep 20, 2010 at 4:09 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>>>>>>> We run 0.89 in production @ Stumbleupon.  We also employ 3 committers...
>>>>>>>>
>>>>>>>> As for safety, you have no choice but to run 0.89.  If you run a 0.20
>>>>>>>> release you will lose data.  you must be on 0.89 and
>>>>>>>> CDH3/append-branch to achieve data durability, and there really is no
>>>>>>>> argument around it.  If you are doing your tests with 0.20.6 now, I'd
>>>>>>>> stop and rebase those tests onto the latest DR announced on the list.
>>>>>>>>
>>>>>>>> -ryan
>>>>>>>>
>>>>>>>> On Mon, Sep 20, 2010 at 3:17 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>> Hi Stack, see inline:
>>>>>>>>>
>>>>>>>>> On Mon, Sep 20, 2010 at 2:42 PM, Stack <st...@duboce.net> wrote:
>>>>>>>>>> Hey Jack:
>>>>>>>>>>
>>>>>>>>>> Thanks for writing.
>>>>>>>>>>
>>>>>>>>>> See below for some comments.
>>>>>>>>>>
>>>>>>>>>> On Mon, Sep 20, 2010 at 11:00 AM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Image-Shack gets close to two million image uploads per day, which are
>>>>>>>>>>> usually stored on regular servers (we have about 700), as regular
>>>>>>>>>>> files, and each server has its own host name, such as (img55).   I've
>>>>>>>>>>> been researching on how to improve our backend design in terms of data
>>>>>>>>>>> safety and stumped onto the Hbase project.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Any other requirements other than data safety? (latency, etc).
>>>>>>>>>
>>>>>>>>> Latency is the second requirement.  We have some services that are
>>>>>>>>> very short tail, and can produce 95% cache hit rate, so I assume this
>>>>>>>>> would really put cache into good use.  Some other services however,
>>>>>>>>> have about 25% cache hit ratio, in which case the latency should be
>>>>>>>>> 'adequate', e.g. if its slightly worse than getting data off raw disk,
>>>>>>>>> then its good enough.   Safely is supremely important, then its
>>>>>>>>> availability, then speed.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> Now, I think hbase is he most beautiful thing that happen to
>>>>>>>>>>> distributed DB world :).   The idea is to store image files (about
>>>>>>>>>>> 400Kb on average into HBASE).
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I'd guess some images are much bigger than this.  Do you ever limit
>>>>>>>>>> the size of images folks can upload to your service?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The setup will include the following
>>>>>>>>>>> configuration:
>>>>>>>>>>>
>>>>>>>>>>> 50 servers total (2 datacenters), with 8 GB RAM, dual core cpu, 6 x
>>>>>>>>>>> 2TB disks each.
>>>>>>>>>>> 3 to 5 Zookeepers
>>>>>>>>>>> 2 Masters (in a datacenter each)
>>>>>>>>>>> 10 to 20 Stargate REST instances (one per server, hash loadbalanced)
>>>>>>>>>>
>>>>>>>>>> Whats your frontend?  Why REST?  It might be more efficient if you
>>>>>>>>>> could run with thrift given REST base64s its payload IIRC (check the
>>>>>>>>>> src yourself).
>>>>>>>>>
>>>>>>>>> For insertion we use Haproxy, and balance curl PUTs across multiple REST APIs.
>>>>>>>>> For reading, its a nginx proxy that does Content-type modification
>>>>>>>>> from image/jpeg to octet-stream, and vice versa,
>>>>>>>>> it then hits Haproxy again, which hits balanced REST.
>>>>>>>>> Why REST, it was the simplest thing to run, given that its supports
>>>>>>>>> HTTP, potentially we could rewrite something for thrift, as long as we
>>>>>>>>> can use http still to send and receive data (anyone wrote anything
>>>>>>>>> like that say in python, C or java?)
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> 40 to 50 RegionServers (will probably keep masters separate on dedicated boxes).
>>>>>>>>>>> 2 Namenode servers (one backup, highly available, will do fsimage and
>>>>>>>>>>> edits snapshots also)
>>>>>>>>>>>
>>>>>>>>>>> So far I got about 13 servers running, and doing about 20 insertions /
>>>>>>>>>>> second (file size ranging from few KB to 2-3MB, ave. 400KB). via
>>>>>>>>>>> Stargate API.  Our frontend servers receive files, and I just
>>>>>>>>>>> fork-insert them into stargate via http (curl).
>>>>>>>>>>> The inserts are humming along nicely, without any noticeable load on
>>>>>>>>>>> regionservers, so far inserted about 2 TB worth of images.
>>>>>>>>>>> I have adjusted the region file size to be 512MB, and table block size
>>>>>>>>>>> to about 400KB , trying to match average access block to limit HDFS
>>>>>>>>>>> trips.
>>>>>>>>>>
>>>>>>>>>> As Todd suggests, I'd go up from 512MB... 1G at least.  You'll
>>>>>>>>>> probably want to up your flush size from 64MB to 128MB or maybe 192MB.
>>>>>>>>>
>>>>>>>>> Yep, i will adjust to 1G.  I thought flush was controlled by a
>>>>>>>>> function of memstore HEAP, something like 40%?  Or are you talking
>>>>>>>>> about HDFS block size?
>>>>>>>>>
>>>>>>>>>>  So far the read performance was more than adequate, and of
>>>>>>>>>>> course write performance is nowhere near capacity.
>>>>>>>>>>> So right now, all newly uploaded images go to HBASE.  But we do plan
>>>>>>>>>>> to insert about 170 Million images (about 100 days worth), which is
>>>>>>>>>>> only about 64 TB, or 10% of planned cluster size of 600TB.
>>>>>>>>>>> The end goal is to have a storage system that creates data safety,
>>>>>>>>>>> e.g. system may go down but data can not be lost.   Our Front-End
>>>>>>>>>>> servers will continue to serve images from their own file system (we
>>>>>>>>>>> are serving about 16 Gbits at peak), however should we need to bring
>>>>>>>>>>> any of those down for maintenance, we will redirect all traffic to
>>>>>>>>>>> Hbase (should be no more than few hundred Mbps), while the front end
>>>>>>>>>>> server is repaired (for example having its disk replaced), after the
>>>>>>>>>>> repairs, we quickly repopulate it with missing files, while serving
>>>>>>>>>>> the missing remaining off Hbase.
>>>>>>>>>>> All in all should be very interesting project, and I am hoping not to
>>>>>>>>>>> run into any snags, however, should that happens, I am pleased to know
>>>>>>>>>>> that such a great and vibrant tech group exists that supports and uses
>>>>>>>>>>> HBASE :).
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> We're definetly interested in how your project progresses.  If you are
>>>>>>>>>> ever up in the city, you should drop by for a chat.
>>>>>>>>>
>>>>>>>>> Cool.  I'd like that.
>>>>>>>>>
>>>>>>>>>> St.Ack
>>>>>>>>>>
>>>>>>>>>> P.S. I'm also w/ Todd that you should move to 0.89 and blooms.
>>>>>>>>>> P.P.S I updated the wiki on stargate REST:
>>>>>>>>>> http://wiki.apache.org/hadoop/Hbase/Stargate
>>>>>>>>>
>>>>>>>>> Cool, I assume if we move to that it won't kill existing meta tables,
>>>>>>>>> and data?  e.g. cross compatible?
>>>>>>>>> Is 0.89 ready for production environment?
>>>>>>>>>
>>>>>>>>> -Jack
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Millions of photos into Hbase

Posted by Jack Levin <ma...@gmail.com>.

ok, I found that file, do I replace hadoop-core.*.jar under /usr/lib/hbase/lib?
Then restart, etc?  All regionservers too?

-Jack

On Mon, Sep 20, 2010 at 8:40 PM, Ryan Rawson <ry...@gmail.com> wrote:
> Well I don't really run CDH, I disagree with their rpm/deb packaging
> policies and I have to highly recommend not using DEBs to install
> software...
>
> So normally installing from tarball, the jar is in
> <installpath>/hadoop-0.20.0-320/hadoop-core-0.20.2+320.jar
>
> On CDH/DEB edition, it's somewhere silly ... locate and find will be
> your friend.  It should be called hadoop-core-0.20.2+320.jar though!
>
> I'm working on a github publish of SU's production system, which uses
> the cloudera maven repo to install the correct JAR in hbase so when
> you type 'mvn assembly:assembly' to build your own hbase-*-bin.tar.gz
> (the * being whatever version you specified in pom.xml) the cdh3b2 jar
> comes pre-packaged.
>
> Stay tuned :-)
>
> -ryan
>
> On Mon, Sep 20, 2010 at 8:36 PM, Jack Levin <ma...@gmail.com> wrote:
>> Ryan, hadoop jar, what is the usual path to the file? I just to to be
>> sure, and where do I put it?
>>
>> -Jack
>>
>> On Mon, Sep 20, 2010 at 8:30 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>> you need 2 more things:
>>>
>>> - restart hdfs
>>> - make sure the hadoop jar from your install replaces the one we ship with
>>>
>>>
>>> On Mon, Sep 20, 2010 at 8:22 PM, Jack Levin <ma...@gmail.com> wrote:
>>>> So, I switched to 0.89, and we already had CDH3
>>>> (hadoop-0.20-datanode-0.20.2+320-3.noarch), even though I added
>>>>  <name>dfs.support.append</name> as true to both hdfs-site.xml and
>>>> hbase-site.xml, the master still reports this:
>>>>
>>>>  You are currently running the HMaster without HDFS append support
>>>> enabled. This may result in data loss. Please see the HBase wiki  for
>>>> details.
>>>> Master Attributes
>>>> Attribute Name  Value   Description
>>>> HBase Version   0.89.20100726, r979826  HBase version and svn revision
>>>> HBase Compiled  Sat Jul 31 02:01:58 PDT 2010, stack     When HBase version
>>>> was compiled and by whom
>>>> Hadoop Version  0.20.2, r911707 Hadoop version and svn revision
>>>> Hadoop Compiled Fri Feb 19 08:07:34 UTC 2010, chrisdo   When Hadoop
>>>> version was compiled and by whom
>>>> HBase Root Directory    hdfs://namenode-rd.imageshack.us:9000/hbase     Location
>>>> of HBase home directory
>>>>
>>>> Any ideas whats wrong?
>>>>
>>>> -Jack
>>>>
>>>>
>>>> On Mon, Sep 20, 2010 at 5:47 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>>>> Hey,
>>>>>
>>>>> There is actually only 1 active branch of hbase, that being the 0.89
>>>>> release, which is based on 'trunk'.  We have snapshotted a series of
>>>>> 0.89 "developer releases" in hopes that people would try them our and
>>>>> start thinking about the next major version.  One of these is what SU
>>>>> is running prod on.
>>>>>
>>>>> At this point tracking 0.89 and which ones are the 'best' peach sets
>>>>> to run is a bit of a contact sport, but if you are serious about not
>>>>> losing data it is worthwhile.  SU is based on the most recent DR with
>>>>> a few minor patches of our own concoction brought in.  If current
>>>>> works, but some Master ops are slow, and there are a few patches on
>>>>> top of that.  I'll poke about and see if its possible to publish to a
>>>>> github branch or something.
>>>>>
>>>>> -ryan
>>>>>
>>>>> On Mon, Sep 20, 2010 at 5:16 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>> Sounds, good, only reason I ask is because of this:
>>>>>>
>>>>>> There are currently two active branches of HBase:
>>>>>>
>>>>>>    * 0.20 - the current stable release series, being maintained with
>>>>>> patches for bug fixes only. This release series does not support HDFS
>>>>>> durability - edits may be lost in the case of node failure.
>>>>>>    * 0.89 - a development release series with active feature and
>>>>>> stability development, not currently recommended for production use.
>>>>>> This release does support HDFS durability - cases in which edits are
>>>>>> lost are considered serious bugs.
>>>>>>>>>>>>
>>>>>>
>>>>>> Are we talking about data loss in case of datanode going down while
>>>>>> being written to, or RegionServer going down?
>>>>>>
>>>>>> -jack
>>>>>>
>>>>>>
>>>>>> On Mon, Sep 20, 2010 at 4:09 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>>>>>> We run 0.89 in production @ Stumbleupon.  We also employ 3 committers...
>>>>>>>
>>>>>>> As for safety, you have no choice but to run 0.89.  If you run a 0.20
>>>>>>> release you will lose data.  you must be on 0.89 and
>>>>>>> CDH3/append-branch to achieve data durability, and there really is no
>>>>>>> argument around it.  If you are doing your tests with 0.20.6 now, I'd
>>>>>>> stop and rebase those tests onto the latest DR announced on the list.
>>>>>>>
>>>>>>> -ryan
>>>>>>>
>>>>>>> On Mon, Sep 20, 2010 at 3:17 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>> Hi Stack, see inline:
>>>>>>>>
>>>>>>>> On Mon, Sep 20, 2010 at 2:42 PM, Stack <st...@duboce.net> wrote:
>>>>>>>>> Hey Jack:
>>>>>>>>>
>>>>>>>>> Thanks for writing.
>>>>>>>>>
>>>>>>>>> See below for some comments.
>>>>>>>>>
>>>>>>>>> On Mon, Sep 20, 2010 at 11:00 AM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>> Image-Shack gets close to two million image uploads per day, which are
>>>>>>>>>> usually stored on regular servers (we have about 700), as regular
>>>>>>>>>> files, and each server has its own host name, such as (img55).   I've
>>>>>>>>>> been researching on how to improve our backend design in terms of data
>>>>>>>>>> safety and stumped onto the Hbase project.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Any other requirements other than data safety? (latency, etc).
>>>>>>>>
>>>>>>>> Latency is the second requirement.  We have some services that are
>>>>>>>> very short tail, and can produce 95% cache hit rate, so I assume this
>>>>>>>> would really put cache into good use.  Some other services however,
>>>>>>>> have about 25% cache hit ratio, in which case the latency should be
>>>>>>>> 'adequate', e.g. if its slightly worse than getting data off raw disk,
>>>>>>>> then its good enough.   Safely is supremely important, then its
>>>>>>>> availability, then speed.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>> Now, I think hbase is he most beautiful thing that happen to
>>>>>>>>>> distributed DB world :).   The idea is to store image files (about
>>>>>>>>>> 400Kb on average into HBASE).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I'd guess some images are much bigger than this.  Do you ever limit
>>>>>>>>> the size of images folks can upload to your service?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The setup will include the following
>>>>>>>>>> configuration:
>>>>>>>>>>
>>>>>>>>>> 50 servers total (2 datacenters), with 8 GB RAM, dual core cpu, 6 x
>>>>>>>>>> 2TB disks each.
>>>>>>>>>> 3 to 5 Zookeepers
>>>>>>>>>> 2 Masters (in a datacenter each)
>>>>>>>>>> 10 to 20 Stargate REST instances (one per server, hash loadbalanced)
>>>>>>>>>
>>>>>>>>> Whats your frontend?  Why REST?  It might be more efficient if you
>>>>>>>>> could run with thrift given REST base64s its payload IIRC (check the
>>>>>>>>> src yourself).
>>>>>>>>
>>>>>>>> For insertion we use Haproxy, and balance curl PUTs across multiple REST APIs.
>>>>>>>> For reading, its a nginx proxy that does Content-type modification
>>>>>>>> from image/jpeg to octet-stream, and vice versa,
>>>>>>>> it then hits Haproxy again, which hits balanced REST.
>>>>>>>> Why REST, it was the simplest thing to run, given that its supports
>>>>>>>> HTTP, potentially we could rewrite something for thrift, as long as we
>>>>>>>> can use http still to send and receive data (anyone wrote anything
>>>>>>>> like that say in python, C or java?)
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> 40 to 50 RegionServers (will probably keep masters separate on dedicated boxes).
>>>>>>>>>> 2 Namenode servers (one backup, highly available, will do fsimage and
>>>>>>>>>> edits snapshots also)
>>>>>>>>>>
>>>>>>>>>> So far I got about 13 servers running, and doing about 20 insertions /
>>>>>>>>>> second (file size ranging from few KB to 2-3MB, ave. 400KB). via
>>>>>>>>>> Stargate API.  Our frontend servers receive files, and I just
>>>>>>>>>> fork-insert them into stargate via http (curl).
>>>>>>>>>> The inserts are humming along nicely, without any noticeable load on
>>>>>>>>>> regionservers, so far inserted about 2 TB worth of images.
>>>>>>>>>> I have adjusted the region file size to be 512MB, and table block size
>>>>>>>>>> to about 400KB , trying to match average access block to limit HDFS
>>>>>>>>>> trips.
>>>>>>>>>
>>>>>>>>> As Todd suggests, I'd go up from 512MB... 1G at least.  You'll
>>>>>>>>> probably want to up your flush size from 64MB to 128MB or maybe 192MB.
>>>>>>>>
>>>>>>>> Yep, i will adjust to 1G.  I thought flush was controlled by a
>>>>>>>> function of memstore HEAP, something like 40%?  Or are you talking
>>>>>>>> about HDFS block size?
>>>>>>>>
>>>>>>>>>  So far the read performance was more than adequate, and of
>>>>>>>>>> course write performance is nowhere near capacity.
>>>>>>>>>> So right now, all newly uploaded images go to HBASE.  But we do plan
>>>>>>>>>> to insert about 170 Million images (about 100 days worth), which is
>>>>>>>>>> only about 64 TB, or 10% of planned cluster size of 600TB.
>>>>>>>>>> The end goal is to have a storage system that creates data safety,
>>>>>>>>>> e.g. system may go down but data can not be lost.   Our Front-End
>>>>>>>>>> servers will continue to serve images from their own file system (we
>>>>>>>>>> are serving about 16 Gbits at peak), however should we need to bring
>>>>>>>>>> any of those down for maintenance, we will redirect all traffic to
>>>>>>>>>> Hbase (should be no more than few hundred Mbps), while the front end
>>>>>>>>>> server is repaired (for example having its disk replaced), after the
>>>>>>>>>> repairs, we quickly repopulate it with missing files, while serving
>>>>>>>>>> the missing remaining off Hbase.
>>>>>>>>>> All in all should be very interesting project, and I am hoping not to
>>>>>>>>>> run into any snags, however, should that happens, I am pleased to know
>>>>>>>>>> that such a great and vibrant tech group exists that supports and uses
>>>>>>>>>> HBASE :).
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> We're definetly interested in how your project progresses.  If you are
>>>>>>>>> ever up in the city, you should drop by for a chat.
>>>>>>>>
>>>>>>>> Cool.  I'd like that.
>>>>>>>>
>>>>>>>>> St.Ack
>>>>>>>>>
>>>>>>>>> P.S. I'm also w/ Todd that you should move to 0.89 and blooms.
>>>>>>>>> P.P.S I updated the wiki on stargate REST:
>>>>>>>>> http://wiki.apache.org/hadoop/Hbase/Stargate
>>>>>>>>
>>>>>>>> Cool, I assume if we move to that it won't kill existing meta tables,
>>>>>>>> and data?  e.g. cross compatible?
>>>>>>>> Is 0.89 ready for production environment?
>>>>>>>>
>>>>>>>> -Jack
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Millions of photos into Hbase

Posted by Ryan Rawson <ry...@gmail.com>.

Well I don't really run CDH, I disagree with their rpm/deb packaging
policies and I have to highly recommend not using DEBs to install
software...

So normally installing from tarball, the jar is in
<installpath>/hadoop-0.20.0-320/hadoop-core-0.20.2+320.jar

On CDH/DEB edition, it's somewhere silly ... locate and find will be
your friend.  It should be called hadoop-core-0.20.2+320.jar though!

I'm working on a github publish of SU's production system, which uses
the cloudera maven repo to install the correct JAR in hbase so when
you type 'mvn assembly:assembly' to build your own hbase-*-bin.tar.gz
(the * being whatever version you specified in pom.xml) the cdh3b2 jar
comes pre-packaged.

Stay tuned :-)

-ryan

On Mon, Sep 20, 2010 at 8:36 PM, Jack Levin <ma...@gmail.com> wrote:
> Ryan, hadoop jar, what is the usual path to the file? I just to to be
> sure, and where do I put it?
>
> -Jack
>
> On Mon, Sep 20, 2010 at 8:30 PM, Ryan Rawson <ry...@gmail.com> wrote:
>> you need 2 more things:
>>
>> - restart hdfs
>> - make sure the hadoop jar from your install replaces the one we ship with
>>
>>
>> On Mon, Sep 20, 2010 at 8:22 PM, Jack Levin <ma...@gmail.com> wrote:
>>> So, I switched to 0.89, and we already had CDH3
>>> (hadoop-0.20-datanode-0.20.2+320-3.noarch), even though I added
>>>  <name>dfs.support.append</name> as true to both hdfs-site.xml and
>>> hbase-site.xml, the master still reports this:
>>>
>>>  You are currently running the HMaster without HDFS append support
>>> enabled. This may result in data loss. Please see the HBase wiki  for
>>> details.
>>> Master Attributes
>>> Attribute Name  Value   Description
>>> HBase Version   0.89.20100726, r979826  HBase version and svn revision
>>> HBase Compiled  Sat Jul 31 02:01:58 PDT 2010, stack     When HBase version
>>> was compiled and by whom
>>> Hadoop Version  0.20.2, r911707 Hadoop version and svn revision
>>> Hadoop Compiled Fri Feb 19 08:07:34 UTC 2010, chrisdo   When Hadoop
>>> version was compiled and by whom
>>> HBase Root Directory    hdfs://namenode-rd.imageshack.us:9000/hbase     Location
>>> of HBase home directory
>>>
>>> Any ideas whats wrong?
>>>
>>> -Jack
>>>
>>>
>>> On Mon, Sep 20, 2010 at 5:47 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>>> Hey,
>>>>
>>>> There is actually only 1 active branch of hbase, that being the 0.89
>>>> release, which is based on 'trunk'.  We have snapshotted a series of
>>>> 0.89 "developer releases" in hopes that people would try them our and
>>>> start thinking about the next major version.  One of these is what SU
>>>> is running prod on.
>>>>
>>>> At this point tracking 0.89 and which ones are the 'best' peach sets
>>>> to run is a bit of a contact sport, but if you are serious about not
>>>> losing data it is worthwhile.  SU is based on the most recent DR with
>>>> a few minor patches of our own concoction brought in.  If current
>>>> works, but some Master ops are slow, and there are a few patches on
>>>> top of that.  I'll poke about and see if its possible to publish to a
>>>> github branch or something.
>>>>
>>>> -ryan
>>>>
>>>> On Mon, Sep 20, 2010 at 5:16 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>> Sounds, good, only reason I ask is because of this:
>>>>>
>>>>> There are currently two active branches of HBase:
>>>>>
>>>>>    * 0.20 - the current stable release series, being maintained with
>>>>> patches for bug fixes only. This release series does not support HDFS
>>>>> durability - edits may be lost in the case of node failure.
>>>>>    * 0.89 - a development release series with active feature and
>>>>> stability development, not currently recommended for production use.
>>>>> This release does support HDFS durability - cases in which edits are
>>>>> lost are considered serious bugs.
>>>>>>>>>>>
>>>>>
>>>>> Are we talking about data loss in case of datanode going down while
>>>>> being written to, or RegionServer going down?
>>>>>
>>>>> -jack
>>>>>
>>>>>
>>>>> On Mon, Sep 20, 2010 at 4:09 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>>>>> We run 0.89 in production @ Stumbleupon.  We also employ 3 committers...
>>>>>>
>>>>>> As for safety, you have no choice but to run 0.89.  If you run a 0.20
>>>>>> release you will lose data.  you must be on 0.89 and
>>>>>> CDH3/append-branch to achieve data durability, and there really is no
>>>>>> argument around it.  If you are doing your tests with 0.20.6 now, I'd
>>>>>> stop and rebase those tests onto the latest DR announced on the list.
>>>>>>
>>>>>> -ryan
>>>>>>
>>>>>> On Mon, Sep 20, 2010 at 3:17 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>> Hi Stack, see inline:
>>>>>>>
>>>>>>> On Mon, Sep 20, 2010 at 2:42 PM, Stack <st...@duboce.net> wrote:
>>>>>>>> Hey Jack:
>>>>>>>>
>>>>>>>> Thanks for writing.
>>>>>>>>
>>>>>>>> See below for some comments.
>>>>>>>>
>>>>>>>> On Mon, Sep 20, 2010 at 11:00 AM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> Image-Shack gets close to two million image uploads per day, which are
>>>>>>>>> usually stored on regular servers (we have about 700), as regular
>>>>>>>>> files, and each server has its own host name, such as (img55).   I've
>>>>>>>>> been researching on how to improve our backend design in terms of data
>>>>>>>>> safety and stumped onto the Hbase project.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Any other requirements other than data safety? (latency, etc).
>>>>>>>
>>>>>>> Latency is the second requirement.  We have some services that are
>>>>>>> very short tail, and can produce 95% cache hit rate, so I assume this
>>>>>>> would really put cache into good use.  Some other services however,
>>>>>>> have about 25% cache hit ratio, in which case the latency should be
>>>>>>> 'adequate', e.g. if its slightly worse than getting data off raw disk,
>>>>>>> then its good enough.   Safely is supremely important, then its
>>>>>>> availability, then speed.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>> Now, I think hbase is he most beautiful thing that happen to
>>>>>>>>> distributed DB world :).   The idea is to store image files (about
>>>>>>>>> 400Kb on average into HBASE).
>>>>>>>>
>>>>>>>>
>>>>>>>> I'd guess some images are much bigger than this.  Do you ever limit
>>>>>>>> the size of images folks can upload to your service?
>>>>>>>>
>>>>>>>>
>>>>>>>> The setup will include the following
>>>>>>>>> configuration:
>>>>>>>>>
>>>>>>>>> 50 servers total (2 datacenters), with 8 GB RAM, dual core cpu, 6 x
>>>>>>>>> 2TB disks each.
>>>>>>>>> 3 to 5 Zookeepers
>>>>>>>>> 2 Masters (in a datacenter each)
>>>>>>>>> 10 to 20 Stargate REST instances (one per server, hash loadbalanced)
>>>>>>>>
>>>>>>>> Whats your frontend?  Why REST?  It might be more efficient if you
>>>>>>>> could run with thrift given REST base64s its payload IIRC (check the
>>>>>>>> src yourself).
>>>>>>>
>>>>>>> For insertion we use Haproxy, and balance curl PUTs across multiple REST APIs.
>>>>>>> For reading, its a nginx proxy that does Content-type modification
>>>>>>> from image/jpeg to octet-stream, and vice versa,
>>>>>>> it then hits Haproxy again, which hits balanced REST.
>>>>>>> Why REST, it was the simplest thing to run, given that its supports
>>>>>>> HTTP, potentially we could rewrite something for thrift, as long as we
>>>>>>> can use http still to send and receive data (anyone wrote anything
>>>>>>> like that say in python, C or java?)
>>>>>>>
>>>>>>>>
>>>>>>>>> 40 to 50 RegionServers (will probably keep masters separate on dedicated boxes).
>>>>>>>>> 2 Namenode servers (one backup, highly available, will do fsimage and
>>>>>>>>> edits snapshots also)
>>>>>>>>>
>>>>>>>>> So far I got about 13 servers running, and doing about 20 insertions /
>>>>>>>>> second (file size ranging from few KB to 2-3MB, ave. 400KB). via
>>>>>>>>> Stargate API.  Our frontend servers receive files, and I just
>>>>>>>>> fork-insert them into stargate via http (curl).
>>>>>>>>> The inserts are humming along nicely, without any noticeable load on
>>>>>>>>> regionservers, so far inserted about 2 TB worth of images.
>>>>>>>>> I have adjusted the region file size to be 512MB, and table block size
>>>>>>>>> to about 400KB , trying to match average access block to limit HDFS
>>>>>>>>> trips.
>>>>>>>>
>>>>>>>> As Todd suggests, I'd go up from 512MB... 1G at least.  You'll
>>>>>>>> probably want to up your flush size from 64MB to 128MB or maybe 192MB.
>>>>>>>
>>>>>>> Yep, i will adjust to 1G.  I thought flush was controlled by a
>>>>>>> function of memstore HEAP, something like 40%?  Or are you talking
>>>>>>> about HDFS block size?
>>>>>>>
>>>>>>>>  So far the read performance was more than adequate, and of
>>>>>>>>> course write performance is nowhere near capacity.
>>>>>>>>> So right now, all newly uploaded images go to HBASE.  But we do plan
>>>>>>>>> to insert about 170 Million images (about 100 days worth), which is
>>>>>>>>> only about 64 TB, or 10% of planned cluster size of 600TB.
>>>>>>>>> The end goal is to have a storage system that creates data safety,
>>>>>>>>> e.g. system may go down but data can not be lost.   Our Front-End
>>>>>>>>> servers will continue to serve images from their own file system (we
>>>>>>>>> are serving about 16 Gbits at peak), however should we need to bring
>>>>>>>>> any of those down for maintenance, we will redirect all traffic to
>>>>>>>>> Hbase (should be no more than few hundred Mbps), while the front end
>>>>>>>>> server is repaired (for example having its disk replaced), after the
>>>>>>>>> repairs, we quickly repopulate it with missing files, while serving
>>>>>>>>> the missing remaining off Hbase.
>>>>>>>>> All in all should be very interesting project, and I am hoping not to
>>>>>>>>> run into any snags, however, should that happens, I am pleased to know
>>>>>>>>> that such a great and vibrant tech group exists that supports and uses
>>>>>>>>> HBASE :).
>>>>>>>>>
>>>>>>>>
>>>>>>>> We're definetly interested in how your project progresses.  If you are
>>>>>>>> ever up in the city, you should drop by for a chat.
>>>>>>>
>>>>>>> Cool.  I'd like that.
>>>>>>>
>>>>>>>> St.Ack
>>>>>>>>
>>>>>>>> P.S. I'm also w/ Todd that you should move to 0.89 and blooms.
>>>>>>>> P.P.S I updated the wiki on stargate REST:
>>>>>>>> http://wiki.apache.org/hadoop/Hbase/Stargate
>>>>>>>
>>>>>>> Cool, I assume if we move to that it won't kill existing meta tables,
>>>>>>> and data?  e.g. cross compatible?
>>>>>>> Is 0.89 ready for production environment?
>>>>>>>
>>>>>>> -Jack
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Millions of photos into Hbase

Posted by Jack Levin <ma...@gmail.com>.

Ryan, hadoop jar, what is the usual path to the file? I just to to be
sure, and where do I put it?

-Jack

On Mon, Sep 20, 2010 at 8:30 PM, Ryan Rawson <ry...@gmail.com> wrote:
> you need 2 more things:
>
> - restart hdfs
> - make sure the hadoop jar from your install replaces the one we ship with
>
>
> On Mon, Sep 20, 2010 at 8:22 PM, Jack Levin <ma...@gmail.com> wrote:
>> So, I switched to 0.89, and we already had CDH3
>> (hadoop-0.20-datanode-0.20.2+320-3.noarch), even though I added
>>  <name>dfs.support.append</name> as true to both hdfs-site.xml and
>> hbase-site.xml, the master still reports this:
>>
>>  You are currently running the HMaster without HDFS append support
>> enabled. This may result in data loss. Please see the HBase wiki  for
>> details.
>> Master Attributes
>> Attribute Name  Value   Description
>> HBase Version   0.89.20100726, r979826  HBase version and svn revision
>> HBase Compiled  Sat Jul 31 02:01:58 PDT 2010, stack     When HBase version
>> was compiled and by whom
>> Hadoop Version  0.20.2, r911707 Hadoop version and svn revision
>> Hadoop Compiled Fri Feb 19 08:07:34 UTC 2010, chrisdo   When Hadoop
>> version was compiled and by whom
>> HBase Root Directory    hdfs://namenode-rd.imageshack.us:9000/hbase     Location
>> of HBase home directory
>>
>> Any ideas whats wrong?
>>
>> -Jack
>>
>>
>> On Mon, Sep 20, 2010 at 5:47 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>> Hey,
>>>
>>> There is actually only 1 active branch of hbase, that being the 0.89
>>> release, which is based on 'trunk'.  We have snapshotted a series of
>>> 0.89 "developer releases" in hopes that people would try them our and
>>> start thinking about the next major version.  One of these is what SU
>>> is running prod on.
>>>
>>> At this point tracking 0.89 and which ones are the 'best' peach sets
>>> to run is a bit of a contact sport, but if you are serious about not
>>> losing data it is worthwhile.  SU is based on the most recent DR with
>>> a few minor patches of our own concoction brought in.  If current
>>> works, but some Master ops are slow, and there are a few patches on
>>> top of that.  I'll poke about and see if its possible to publish to a
>>> github branch or something.
>>>
>>> -ryan
>>>
>>> On Mon, Sep 20, 2010 at 5:16 PM, Jack Levin <ma...@gmail.com> wrote:
>>>> Sounds, good, only reason I ask is because of this:
>>>>
>>>> There are currently two active branches of HBase:
>>>>
>>>>    * 0.20 - the current stable release series, being maintained with
>>>> patches for bug fixes only. This release series does not support HDFS
>>>> durability - edits may be lost in the case of node failure.
>>>>    * 0.89 - a development release series with active feature and
>>>> stability development, not currently recommended for production use.
>>>> This release does support HDFS durability - cases in which edits are
>>>> lost are considered serious bugs.
>>>>>>>>>>
>>>>
>>>> Are we talking about data loss in case of datanode going down while
>>>> being written to, or RegionServer going down?
>>>>
>>>> -jack
>>>>
>>>>
>>>> On Mon, Sep 20, 2010 at 4:09 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>>>> We run 0.89 in production @ Stumbleupon.  We also employ 3 committers...
>>>>>
>>>>> As for safety, you have no choice but to run 0.89.  If you run a 0.20
>>>>> release you will lose data.  you must be on 0.89 and
>>>>> CDH3/append-branch to achieve data durability, and there really is no
>>>>> argument around it.  If you are doing your tests with 0.20.6 now, I'd
>>>>> stop and rebase those tests onto the latest DR announced on the list.
>>>>>
>>>>> -ryan
>>>>>
>>>>> On Mon, Sep 20, 2010 at 3:17 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>> Hi Stack, see inline:
>>>>>>
>>>>>> On Mon, Sep 20, 2010 at 2:42 PM, Stack <st...@duboce.net> wrote:
>>>>>>> Hey Jack:
>>>>>>>
>>>>>>> Thanks for writing.
>>>>>>>
>>>>>>> See below for some comments.
>>>>>>>
>>>>>>> On Mon, Sep 20, 2010 at 11:00 AM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>
>>>>>>>> Image-Shack gets close to two million image uploads per day, which are
>>>>>>>> usually stored on regular servers (we have about 700), as regular
>>>>>>>> files, and each server has its own host name, such as (img55).   I've
>>>>>>>> been researching on how to improve our backend design in terms of data
>>>>>>>> safety and stumped onto the Hbase project.
>>>>>>>>
>>>>>>>
>>>>>>> Any other requirements other than data safety? (latency, etc).
>>>>>>
>>>>>> Latency is the second requirement.  We have some services that are
>>>>>> very short tail, and can produce 95% cache hit rate, so I assume this
>>>>>> would really put cache into good use.  Some other services however,
>>>>>> have about 25% cache hit ratio, in which case the latency should be
>>>>>> 'adequate', e.g. if its slightly worse than getting data off raw disk,
>>>>>> then its good enough.   Safely is supremely important, then its
>>>>>> availability, then speed.
>>>>>>
>>>>>>
>>>>>>
>>>>>>>> Now, I think hbase is he most beautiful thing that happen to
>>>>>>>> distributed DB world :).   The idea is to store image files (about
>>>>>>>> 400Kb on average into HBASE).
>>>>>>>
>>>>>>>
>>>>>>> I'd guess some images are much bigger than this.  Do you ever limit
>>>>>>> the size of images folks can upload to your service?
>>>>>>>
>>>>>>>
>>>>>>> The setup will include the following
>>>>>>>> configuration:
>>>>>>>>
>>>>>>>> 50 servers total (2 datacenters), with 8 GB RAM, dual core cpu, 6 x
>>>>>>>> 2TB disks each.
>>>>>>>> 3 to 5 Zookeepers
>>>>>>>> 2 Masters (in a datacenter each)
>>>>>>>> 10 to 20 Stargate REST instances (one per server, hash loadbalanced)
>>>>>>>
>>>>>>> Whats your frontend?  Why REST?  It might be more efficient if you
>>>>>>> could run with thrift given REST base64s its payload IIRC (check the
>>>>>>> src yourself).
>>>>>>
>>>>>> For insertion we use Haproxy, and balance curl PUTs across multiple REST APIs.
>>>>>> For reading, its a nginx proxy that does Content-type modification
>>>>>> from image/jpeg to octet-stream, and vice versa,
>>>>>> it then hits Haproxy again, which hits balanced REST.
>>>>>> Why REST, it was the simplest thing to run, given that its supports
>>>>>> HTTP, potentially we could rewrite something for thrift, as long as we
>>>>>> can use http still to send and receive data (anyone wrote anything
>>>>>> like that say in python, C or java?)
>>>>>>
>>>>>>>
>>>>>>>> 40 to 50 RegionServers (will probably keep masters separate on dedicated boxes).
>>>>>>>> 2 Namenode servers (one backup, highly available, will do fsimage and
>>>>>>>> edits snapshots also)
>>>>>>>>
>>>>>>>> So far I got about 13 servers running, and doing about 20 insertions /
>>>>>>>> second (file size ranging from few KB to 2-3MB, ave. 400KB). via
>>>>>>>> Stargate API.  Our frontend servers receive files, and I just
>>>>>>>> fork-insert them into stargate via http (curl).
>>>>>>>> The inserts are humming along nicely, without any noticeable load on
>>>>>>>> regionservers, so far inserted about 2 TB worth of images.
>>>>>>>> I have adjusted the region file size to be 512MB, and table block size
>>>>>>>> to about 400KB , trying to match average access block to limit HDFS
>>>>>>>> trips.
>>>>>>>
>>>>>>> As Todd suggests, I'd go up from 512MB... 1G at least.  You'll
>>>>>>> probably want to up your flush size from 64MB to 128MB or maybe 192MB.
>>>>>>
>>>>>> Yep, i will adjust to 1G.  I thought flush was controlled by a
>>>>>> function of memstore HEAP, something like 40%?  Or are you talking
>>>>>> about HDFS block size?
>>>>>>
>>>>>>>  So far the read performance was more than adequate, and of
>>>>>>>> course write performance is nowhere near capacity.
>>>>>>>> So right now, all newly uploaded images go to HBASE.  But we do plan
>>>>>>>> to insert about 170 Million images (about 100 days worth), which is
>>>>>>>> only about 64 TB, or 10% of planned cluster size of 600TB.
>>>>>>>> The end goal is to have a storage system that creates data safety,
>>>>>>>> e.g. system may go down but data can not be lost.   Our Front-End
>>>>>>>> servers will continue to serve images from their own file system (we
>>>>>>>> are serving about 16 Gbits at peak), however should we need to bring
>>>>>>>> any of those down for maintenance, we will redirect all traffic to
>>>>>>>> Hbase (should be no more than few hundred Mbps), while the front end
>>>>>>>> server is repaired (for example having its disk replaced), after the
>>>>>>>> repairs, we quickly repopulate it with missing files, while serving
>>>>>>>> the missing remaining off Hbase.
>>>>>>>> All in all should be very interesting project, and I am hoping not to
>>>>>>>> run into any snags, however, should that happens, I am pleased to know
>>>>>>>> that such a great and vibrant tech group exists that supports and uses
>>>>>>>> HBASE :).
>>>>>>>>
>>>>>>>
>>>>>>> We're definetly interested in how your project progresses.  If you are
>>>>>>> ever up in the city, you should drop by for a chat.
>>>>>>
>>>>>> Cool.  I'd like that.
>>>>>>
>>>>>>> St.Ack
>>>>>>>
>>>>>>> P.S. I'm also w/ Todd that you should move to 0.89 and blooms.
>>>>>>> P.P.S I updated the wiki on stargate REST:
>>>>>>> http://wiki.apache.org/hadoop/Hbase/Stargate
>>>>>>
>>>>>> Cool, I assume if we move to that it won't kill existing meta tables,
>>>>>> and data?  e.g. cross compatible?
>>>>>> Is 0.89 ready for production environment?
>>>>>>
>>>>>> -Jack
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Millions of photos into Hbase

Posted by Ryan Rawson <ry...@gmail.com>.

you need 2 more things:

- restart hdfs
- make sure the hadoop jar from your install replaces the one we ship with


On Mon, Sep 20, 2010 at 8:22 PM, Jack Levin <ma...@gmail.com> wrote:
> So, I switched to 0.89, and we already had CDH3
> (hadoop-0.20-datanode-0.20.2+320-3.noarch), even though I added
>  <name>dfs.support.append</name> as true to both hdfs-site.xml and
> hbase-site.xml, the master still reports this:
>
>  You are currently running the HMaster without HDFS append support
> enabled. This may result in data loss. Please see the HBase wiki  for
> details.
> Master Attributes
> Attribute Name  Value   Description
> HBase Version   0.89.20100726, r979826  HBase version and svn revision
> HBase Compiled  Sat Jul 31 02:01:58 PDT 2010, stack     When HBase version
> was compiled and by whom
> Hadoop Version  0.20.2, r911707 Hadoop version and svn revision
> Hadoop Compiled Fri Feb 19 08:07:34 UTC 2010, chrisdo   When Hadoop
> version was compiled and by whom
> HBase Root Directory    hdfs://namenode-rd.imageshack.us:9000/hbase     Location
> of HBase home directory
>
> Any ideas whats wrong?
>
> -Jack
>
>
> On Mon, Sep 20, 2010 at 5:47 PM, Ryan Rawson <ry...@gmail.com> wrote:
>> Hey,
>>
>> There is actually only 1 active branch of hbase, that being the 0.89
>> release, which is based on 'trunk'.  We have snapshotted a series of
>> 0.89 "developer releases" in hopes that people would try them our and
>> start thinking about the next major version.  One of these is what SU
>> is running prod on.
>>
>> At this point tracking 0.89 and which ones are the 'best' peach sets
>> to run is a bit of a contact sport, but if you are serious about not
>> losing data it is worthwhile.  SU is based on the most recent DR with
>> a few minor patches of our own concoction brought in.  If current
>> works, but some Master ops are slow, and there are a few patches on
>> top of that.  I'll poke about and see if its possible to publish to a
>> github branch or something.
>>
>> -ryan
>>
>> On Mon, Sep 20, 2010 at 5:16 PM, Jack Levin <ma...@gmail.com> wrote:
>>> Sounds, good, only reason I ask is because of this:
>>>
>>> There are currently two active branches of HBase:
>>>
>>>    * 0.20 - the current stable release series, being maintained with
>>> patches for bug fixes only. This release series does not support HDFS
>>> durability - edits may be lost in the case of node failure.
>>>    * 0.89 - a development release series with active feature and
>>> stability development, not currently recommended for production use.
>>> This release does support HDFS durability - cases in which edits are
>>> lost are considered serious bugs.
>>>>>>>>>
>>>
>>> Are we talking about data loss in case of datanode going down while
>>> being written to, or RegionServer going down?
>>>
>>> -jack
>>>
>>>
>>> On Mon, Sep 20, 2010 at 4:09 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>>> We run 0.89 in production @ Stumbleupon.  We also employ 3 committers...
>>>>
>>>> As for safety, you have no choice but to run 0.89.  If you run a 0.20
>>>> release you will lose data.  you must be on 0.89 and
>>>> CDH3/append-branch to achieve data durability, and there really is no
>>>> argument around it.  If you are doing your tests with 0.20.6 now, I'd
>>>> stop and rebase those tests onto the latest DR announced on the list.
>>>>
>>>> -ryan
>>>>
>>>> On Mon, Sep 20, 2010 at 3:17 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>> Hi Stack, see inline:
>>>>>
>>>>> On Mon, Sep 20, 2010 at 2:42 PM, Stack <st...@duboce.net> wrote:
>>>>>> Hey Jack:
>>>>>>
>>>>>> Thanks for writing.
>>>>>>
>>>>>> See below for some comments.
>>>>>>
>>>>>> On Mon, Sep 20, 2010 at 11:00 AM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>
>>>>>>> Image-Shack gets close to two million image uploads per day, which are
>>>>>>> usually stored on regular servers (we have about 700), as regular
>>>>>>> files, and each server has its own host name, such as (img55).   I've
>>>>>>> been researching on how to improve our backend design in terms of data
>>>>>>> safety and stumped onto the Hbase project.
>>>>>>>
>>>>>>
>>>>>> Any other requirements other than data safety? (latency, etc).
>>>>>
>>>>> Latency is the second requirement.  We have some services that are
>>>>> very short tail, and can produce 95% cache hit rate, so I assume this
>>>>> would really put cache into good use.  Some other services however,
>>>>> have about 25% cache hit ratio, in which case the latency should be
>>>>> 'adequate', e.g. if its slightly worse than getting data off raw disk,
>>>>> then its good enough.   Safely is supremely important, then its
>>>>> availability, then speed.
>>>>>
>>>>>
>>>>>
>>>>>>> Now, I think hbase is he most beautiful thing that happen to
>>>>>>> distributed DB world :).   The idea is to store image files (about
>>>>>>> 400Kb on average into HBASE).
>>>>>>
>>>>>>
>>>>>> I'd guess some images are much bigger than this.  Do you ever limit
>>>>>> the size of images folks can upload to your service?
>>>>>>
>>>>>>
>>>>>> The setup will include the following
>>>>>>> configuration:
>>>>>>>
>>>>>>> 50 servers total (2 datacenters), with 8 GB RAM, dual core cpu, 6 x
>>>>>>> 2TB disks each.
>>>>>>> 3 to 5 Zookeepers
>>>>>>> 2 Masters (in a datacenter each)
>>>>>>> 10 to 20 Stargate REST instances (one per server, hash loadbalanced)
>>>>>>
>>>>>> Whats your frontend?  Why REST?  It might be more efficient if you
>>>>>> could run with thrift given REST base64s its payload IIRC (check the
>>>>>> src yourself).
>>>>>
>>>>> For insertion we use Haproxy, and balance curl PUTs across multiple REST APIs.
>>>>> For reading, its a nginx proxy that does Content-type modification
>>>>> from image/jpeg to octet-stream, and vice versa,
>>>>> it then hits Haproxy again, which hits balanced REST.
>>>>> Why REST, it was the simplest thing to run, given that its supports
>>>>> HTTP, potentially we could rewrite something for thrift, as long as we
>>>>> can use http still to send and receive data (anyone wrote anything
>>>>> like that say in python, C or java?)
>>>>>
>>>>>>
>>>>>>> 40 to 50 RegionServers (will probably keep masters separate on dedicated boxes).
>>>>>>> 2 Namenode servers (one backup, highly available, will do fsimage and
>>>>>>> edits snapshots also)
>>>>>>>
>>>>>>> So far I got about 13 servers running, and doing about 20 insertions /
>>>>>>> second (file size ranging from few KB to 2-3MB, ave. 400KB). via
>>>>>>> Stargate API.  Our frontend servers receive files, and I just
>>>>>>> fork-insert them into stargate via http (curl).
>>>>>>> The inserts are humming along nicely, without any noticeable load on
>>>>>>> regionservers, so far inserted about 2 TB worth of images.
>>>>>>> I have adjusted the region file size to be 512MB, and table block size
>>>>>>> to about 400KB , trying to match average access block to limit HDFS
>>>>>>> trips.
>>>>>>
>>>>>> As Todd suggests, I'd go up from 512MB... 1G at least.  You'll
>>>>>> probably want to up your flush size from 64MB to 128MB or maybe 192MB.
>>>>>
>>>>> Yep, i will adjust to 1G.  I thought flush was controlled by a
>>>>> function of memstore HEAP, something like 40%?  Or are you talking
>>>>> about HDFS block size?
>>>>>
>>>>>>  So far the read performance was more than adequate, and of
>>>>>>> course write performance is nowhere near capacity.
>>>>>>> So right now, all newly uploaded images go to HBASE.  But we do plan
>>>>>>> to insert about 170 Million images (about 100 days worth), which is
>>>>>>> only about 64 TB, or 10% of planned cluster size of 600TB.
>>>>>>> The end goal is to have a storage system that creates data safety,
>>>>>>> e.g. system may go down but data can not be lost.   Our Front-End
>>>>>>> servers will continue to serve images from their own file system (we
>>>>>>> are serving about 16 Gbits at peak), however should we need to bring
>>>>>>> any of those down for maintenance, we will redirect all traffic to
>>>>>>> Hbase (should be no more than few hundred Mbps), while the front end
>>>>>>> server is repaired (for example having its disk replaced), after the
>>>>>>> repairs, we quickly repopulate it with missing files, while serving
>>>>>>> the missing remaining off Hbase.
>>>>>>> All in all should be very interesting project, and I am hoping not to
>>>>>>> run into any snags, however, should that happens, I am pleased to know
>>>>>>> that such a great and vibrant tech group exists that supports and uses
>>>>>>> HBASE :).
>>>>>>>
>>>>>>
>>>>>> We're definetly interested in how your project progresses.  If you are
>>>>>> ever up in the city, you should drop by for a chat.
>>>>>
>>>>> Cool.  I'd like that.
>>>>>
>>>>>> St.Ack
>>>>>>
>>>>>> P.S. I'm also w/ Todd that you should move to 0.89 and blooms.
>>>>>> P.P.S I updated the wiki on stargate REST:
>>>>>> http://wiki.apache.org/hadoop/Hbase/Stargate
>>>>>
>>>>> Cool, I assume if we move to that it won't kill existing meta tables,
>>>>> and data?  e.g. cross compatible?
>>>>> Is 0.89 ready for production environment?
>>>>>
>>>>> -Jack
>>>>>
>>>>
>>>
>>
>

Re: Millions of photos into Hbase

Posted by Jack Levin <ma...@gmail.com>.

So, I switched to 0.89, and we already had CDH3
(hadoop-0.20-datanode-0.20.2+320-3.noarch), even though I added
 <name>dfs.support.append</name> as true to both hdfs-site.xml and
hbase-site.xml, the master still reports this:

 You are currently running the HMaster without HDFS append support
enabled. This may result in data loss. Please see the HBase wiki  for
details.
Master Attributes
Attribute Name	Value	Description
HBase Version	0.89.20100726, r979826	HBase version and svn revision
HBase Compiled	Sat Jul 31 02:01:58 PDT 2010, stack	When HBase version
was compiled and by whom
Hadoop Version	0.20.2, r911707	Hadoop version and svn revision
Hadoop Compiled	Fri Feb 19 08:07:34 UTC 2010, chrisdo	When Hadoop
version was compiled and by whom
HBase Root Directory	hdfs://namenode-rd.imageshack.us:9000/hbase	Location
of HBase home directory

Any ideas whats wrong?

-Jack


On Mon, Sep 20, 2010 at 5:47 PM, Ryan Rawson <ry...@gmail.com> wrote:
> Hey,
>
> There is actually only 1 active branch of hbase, that being the 0.89
> release, which is based on 'trunk'.  We have snapshotted a series of
> 0.89 "developer releases" in hopes that people would try them our and
> start thinking about the next major version.  One of these is what SU
> is running prod on.
>
> At this point tracking 0.89 and which ones are the 'best' peach sets
> to run is a bit of a contact sport, but if you are serious about not
> losing data it is worthwhile.  SU is based on the most recent DR with
> a few minor patches of our own concoction brought in.  If current
> works, but some Master ops are slow, and there are a few patches on
> top of that.  I'll poke about and see if its possible to publish to a
> github branch or something.
>
> -ryan
>
> On Mon, Sep 20, 2010 at 5:16 PM, Jack Levin <ma...@gmail.com> wrote:
>> Sounds, good, only reason I ask is because of this:
>>
>> There are currently two active branches of HBase:
>>
>>    * 0.20 - the current stable release series, being maintained with
>> patches for bug fixes only. This release series does not support HDFS
>> durability - edits may be lost in the case of node failure.
>>    * 0.89 - a development release series with active feature and
>> stability development, not currently recommended for production use.
>> This release does support HDFS durability - cases in which edits are
>> lost are considered serious bugs.
>>>>>>>>
>>
>> Are we talking about data loss in case of datanode going down while
>> being written to, or RegionServer going down?
>>
>> -jack
>>
>>
>> On Mon, Sep 20, 2010 at 4:09 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>> We run 0.89 in production @ Stumbleupon.  We also employ 3 committers...
>>>
>>> As for safety, you have no choice but to run 0.89.  If you run a 0.20
>>> release you will lose data.  you must be on 0.89 and
>>> CDH3/append-branch to achieve data durability, and there really is no
>>> argument around it.  If you are doing your tests with 0.20.6 now, I'd
>>> stop and rebase those tests onto the latest DR announced on the list.
>>>
>>> -ryan
>>>
>>> On Mon, Sep 20, 2010 at 3:17 PM, Jack Levin <ma...@gmail.com> wrote:
>>>> Hi Stack, see inline:
>>>>
>>>> On Mon, Sep 20, 2010 at 2:42 PM, Stack <st...@duboce.net> wrote:
>>>>> Hey Jack:
>>>>>
>>>>> Thanks for writing.
>>>>>
>>>>> See below for some comments.
>>>>>
>>>>> On Mon, Sep 20, 2010 at 11:00 AM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>
>>>>>> Image-Shack gets close to two million image uploads per day, which are
>>>>>> usually stored on regular servers (we have about 700), as regular
>>>>>> files, and each server has its own host name, such as (img55).   I've
>>>>>> been researching on how to improve our backend design in terms of data
>>>>>> safety and stumped onto the Hbase project.
>>>>>>
>>>>>
>>>>> Any other requirements other than data safety? (latency, etc).
>>>>
>>>> Latency is the second requirement.  We have some services that are
>>>> very short tail, and can produce 95% cache hit rate, so I assume this
>>>> would really put cache into good use.  Some other services however,
>>>> have about 25% cache hit ratio, in which case the latency should be
>>>> 'adequate', e.g. if its slightly worse than getting data off raw disk,
>>>> then its good enough.   Safely is supremely important, then its
>>>> availability, then speed.
>>>>
>>>>
>>>>
>>>>>> Now, I think hbase is he most beautiful thing that happen to
>>>>>> distributed DB world :).   The idea is to store image files (about
>>>>>> 400Kb on average into HBASE).
>>>>>
>>>>>
>>>>> I'd guess some images are much bigger than this.  Do you ever limit
>>>>> the size of images folks can upload to your service?
>>>>>
>>>>>
>>>>> The setup will include the following
>>>>>> configuration:
>>>>>>
>>>>>> 50 servers total (2 datacenters), with 8 GB RAM, dual core cpu, 6 x
>>>>>> 2TB disks each.
>>>>>> 3 to 5 Zookeepers
>>>>>> 2 Masters (in a datacenter each)
>>>>>> 10 to 20 Stargate REST instances (one per server, hash loadbalanced)
>>>>>
>>>>> Whats your frontend?  Why REST?  It might be more efficient if you
>>>>> could run with thrift given REST base64s its payload IIRC (check the
>>>>> src yourself).
>>>>
>>>> For insertion we use Haproxy, and balance curl PUTs across multiple REST APIs.
>>>> For reading, its a nginx proxy that does Content-type modification
>>>> from image/jpeg to octet-stream, and vice versa,
>>>> it then hits Haproxy again, which hits balanced REST.
>>>> Why REST, it was the simplest thing to run, given that its supports
>>>> HTTP, potentially we could rewrite something for thrift, as long as we
>>>> can use http still to send and receive data (anyone wrote anything
>>>> like that say in python, C or java?)
>>>>
>>>>>
>>>>>> 40 to 50 RegionServers (will probably keep masters separate on dedicated boxes).
>>>>>> 2 Namenode servers (one backup, highly available, will do fsimage and
>>>>>> edits snapshots also)
>>>>>>
>>>>>> So far I got about 13 servers running, and doing about 20 insertions /
>>>>>> second (file size ranging from few KB to 2-3MB, ave. 400KB). via
>>>>>> Stargate API.  Our frontend servers receive files, and I just
>>>>>> fork-insert them into stargate via http (curl).
>>>>>> The inserts are humming along nicely, without any noticeable load on
>>>>>> regionservers, so far inserted about 2 TB worth of images.
>>>>>> I have adjusted the region file size to be 512MB, and table block size
>>>>>> to about 400KB , trying to match average access block to limit HDFS
>>>>>> trips.
>>>>>
>>>>> As Todd suggests, I'd go up from 512MB... 1G at least.  You'll
>>>>> probably want to up your flush size from 64MB to 128MB or maybe 192MB.
>>>>
>>>> Yep, i will adjust to 1G.  I thought flush was controlled by a
>>>> function of memstore HEAP, something like 40%?  Or are you talking
>>>> about HDFS block size?
>>>>
>>>>>  So far the read performance was more than adequate, and of
>>>>>> course write performance is nowhere near capacity.
>>>>>> So right now, all newly uploaded images go to HBASE.  But we do plan
>>>>>> to insert about 170 Million images (about 100 days worth), which is
>>>>>> only about 64 TB, or 10% of planned cluster size of 600TB.
>>>>>> The end goal is to have a storage system that creates data safety,
>>>>>> e.g. system may go down but data can not be lost.   Our Front-End
>>>>>> servers will continue to serve images from their own file system (we
>>>>>> are serving about 16 Gbits at peak), however should we need to bring
>>>>>> any of those down for maintenance, we will redirect all traffic to
>>>>>> Hbase (should be no more than few hundred Mbps), while the front end
>>>>>> server is repaired (for example having its disk replaced), after the
>>>>>> repairs, we quickly repopulate it with missing files, while serving
>>>>>> the missing remaining off Hbase.
>>>>>> All in all should be very interesting project, and I am hoping not to
>>>>>> run into any snags, however, should that happens, I am pleased to know
>>>>>> that such a great and vibrant tech group exists that supports and uses
>>>>>> HBASE :).
>>>>>>
>>>>>
>>>>> We're definetly interested in how your project progresses.  If you are
>>>>> ever up in the city, you should drop by for a chat.
>>>>
>>>> Cool.  I'd like that.
>>>>
>>>>> St.Ack
>>>>>
>>>>> P.S. I'm also w/ Todd that you should move to 0.89 and blooms.
>>>>> P.P.S I updated the wiki on stargate REST:
>>>>> http://wiki.apache.org/hadoop/Hbase/Stargate
>>>>
>>>> Cool, I assume if we move to that it won't kill existing meta tables,
>>>> and data?  e.g. cross compatible?
>>>> Is 0.89 ready for production environment?
>>>>
>>>> -Jack
>>>>
>>>
>>
>

RE: old master to 0.90 and new master to 0.92? (was RE: Millions of photos into Hbase)

Posted by Jonathan Gray <jg...@facebook.com>.

Yes, there are way more changes/improvements made beyond the initial scope of working master failover / zk-based region transitions.  Lots of cleanup, removal of many of the nasty bits we've always hated, and the groundwork is laid for a lot of future improvements.

I'm not apprehensive.  I think many of the points raised are valid, but my concerns have mostly been addressed at this point.  I'm going to be spending the next 10 days or so helping stabilize and test.  We'll see where we get by then.

JG

> -----Original Message-----
> From: Andrew Purtell [mailto:apurtell@apache.org]
> Sent: Thursday, September 23, 2010 3:47 AM
> To: dev@hbase.apache.org
> Subject: Re: old master to 0.90 and new master to 0.92? (was RE:
> Millions of photos into Hbase)
> 
> Thanks. I'm going to try it out right away.
> 
> > From: Ryan Rawson
> > I've been using the new master a bit, and the improvements are just
> > too dramatic to ignore or delay.  Instant table transitions, clusters
> > that actually shut down, etc, etc.
> >
> >
> > On Wed, Sep 22, 2010 at 3:11 PM, Andrew Purtell <ap...@apache.org>
> > wrote:
> > > If JG is apprehensive about putting the new master
> > into 0.90, that's good enough for me. But I'll vote +0. I'm
> > too unfamiliar with the details.
> 
> 
> 
>

Re: old master to 0.90 and new master to 0.92? (was RE: Millions of photos into Hbase)

Posted by Andrew Purtell <ap...@apache.org>.

Thanks. I'm going to try it out right away.

> From: Ryan Rawson
> I've been using the new master a bit, and the improvements are just
> too dramatic to ignore or delay.  Instant table transitions, clusters
> that actually shut down, etc, etc.
> 
>
> On Wed, Sep 22, 2010 at 3:11 PM, Andrew Purtell <ap...@apache.org>
> wrote:
> > If JG is apprehensive about putting the new master
> into 0.90, that's good enough for me. But I'll vote +0. I'm
> too unfamiliar with the details.

Re: old master to 0.90 and new master to 0.92? (was RE: Millions of photos into Hbase)

Posted by Ryan Rawson <ry...@gmail.com>.

I've been using the new master a bit, and the improvements are just
too dramatic to ignore or delay.  Instant table transitions, clusters
that actually shut down, etc, etc.

On Wed, Sep 22, 2010 at 3:11 PM, Andrew Purtell <ap...@apache.org> wrote:
> If JG is apprehensive about putting the new master into 0.90, that's good enough for me. But I'll vote +0. I'm too unfamiliar with the details.
>
> Best regards,
>
>    - Andy
>
> Why is this email five sentences or less?
> http://five.sentenc.es/
>
>
>
>
>
>

Re: old master to 0.90 and new master to 0.92? (was RE: Millions of photos into Hbase)

Posted by Andrew Purtell <ap...@apache.org>.

If JG is apprehensive about putting the new master into 0.90, that's good enough for me. But I'll vote +0. I'm too unfamiliar with the details. 

Best regards,

    - Andy

Why is this email five sentences or less?
http://five.sentenc.es/

Re: old master to 0.90 and new master to 0.92? (was RE: Millions of photos into Hbase)

Posted by Todd Lipcon <to...@cloudera.com>.

On Wed, Sep 22, 2010 at 5:47 AM, Jonathan Gray <jg...@facebook.com> wrote:

> > > We also confuse the 0.90 'message' given we've been talking about new
> > > master at HUGs and here on the lists with a good while now.
> >
> > This might be a little naïve but what about calling the "SU"-build
> > 0.88 and release it (with a small disclaimer). It would make sense in
> > that the 0.89-dev releases have more stuff in them than 0.88 and it
> > would still allow for a 0.90 release with all the announced features.
>
> I like this too... if we go the direction of an old master release.
>

I'm -1 on an out-of-order release like this. I think it will really confuse
people, please won't work correctly with upgrades on packaging systems.

I know we've been promising "new master for 0.90" at meetups and stuff, but
outside the dev team, does anyone really care about the implementation
details? The feature here isn't a new master, it's more stable master
failover, better locality, etc. Whether we have that or say that it's going
to be in the next version, I don't think people will be too upset.

The argument that it's hell to maintain two branches is a fair one, but
unless we can get trunk usable very soon now, we're already going to be in
that situation. People are migrating production instances to 0.89.x now, and
once people go to production they are very hesitant to upgrade, regardless
of what we tell them.

-Todd

-- 
Todd Lipcon
Software Engineer, Cloudera

RE: old master to 0.90 and new master to 0.92? (was RE: Millions of photos into Hbase)

Posted by Jonathan Gray <jg...@facebook.com>.

> > We also confuse the 0.90 'message' given we've been talking about new
> > master at HUGs and here on the lists with a good while now.
> 
> This might be a little naïve but what about calling the "SU"-build
> 0.88 and release it (with a small disclaimer). It would make sense in
> that the 0.89-dev releases have more stuff in them than 0.88 and it
> would still allow for a 0.90 release with all the announced features.

I like this too... if we go the direction of an old master release.

It addresses most of the issues, besides slowing release of new master.  So not necessarily +1 on doing it but this would be better than an old master 0.90 release imo.

Good idea Lars.

JG

Re: old master to 0.90 and new master to 0.92? (was RE: Millions of photos into Hbase)

Posted by Stack <st...@duboce.net>.

On Tue, Sep 21, 2010 at 4:54 PM, Lars Francke <la...@gmail.com> wrote:
>> We also confuse the 0.90 'message' given we've been talking about new
>> master at HUGs and here on the lists with a good while now.
>
> This might be a little naïve but what about calling the "SU"-build
> 0.88 and release it (with a small disclaimer). It would make sense in
> that the 0.89-dev releases have more stuff in them than 0.88 and it
> would still allow for a 0.90 release with all the announced features.
>

Thats an idea.

> As a side note: I plan to finish work on the Maven changes in time for
> HW but tests keep on failing (Hudson has the same problem...) which
> doesn't make it easy to completely test the build.

Sorry about that.  I've been picking away at them from time to time
but some require more than a glance to fix (For me at least, most of
what fails on hudson does not fail for me here below).

I don't know if
> I'll be able to get Thrift done though, sorry. I'm doing this in my
> spare time and I didn't have a lot of that :( but with a bit of luck I
> can work on HBase in my new job. But I'll upload updated Thrift files
> in the next few days if someone wants to check what I've done (modeled
> it after Jeff's Avro work).
>

Thanks for the update and good luck in the new Job.

St.Ack

Re: old master to 0.90 and new master to 0.92? (was RE: Millions of photos into Hbase)

Posted by Lars Francke <la...@gmail.com>.

> We also confuse the 0.90 'message' given we've been talking about new
> master at HUGs and here on the lists with a good while now.

This might be a little naïve but what about calling the "SU"-build
0.88 and release it (with a small disclaimer). It would make sense in
that the 0.89-dev releases have more stuff in them than 0.88 and it
would still allow for a 0.90 release with all the announced features.

As a side note: I plan to finish work on the Maven changes in time for
HW but tests keep on failing (Hudson has the same problem...) which
doesn't make it easy to completely test the build. I don't know if
I'll be able to get Thrift done though, sorry. I'm doing this in my
spare time and I didn't have a lot of that :( but with a bit of luck I
can work on HBase in my new job. But I'll upload updated Thrift files
in the next few days if someone wants to check what I've done (modeled
it after Jeff's Avro work).

Cheers,
Lars

Re: old master to 0.90 and new master to 0.92? (was RE: Millions of photos into Hbase)

Posted by Stack <st...@duboce.net>.

On Tue, Sep 21, 2010 at 10:19 AM, Todd Lipcon <to...@cloudera.com> wrote:
> CDH3b3 will be ready for Hadoop world, and we'd kind of like to freeze
> component versions at this point in the beta cycle. So if 0.90 is out,
> that would be great. We can certainly work with what we've got
> (20100830 minus ZK assignment) but if a production-worth new master
> isn't ready by the end of October or so we'll probably push that out
> til CDH4.
>

Ok.

Its my thought that we'd have a production ready master way before end
of October.

>> I'm thinking we can afford to take such a position
>> because if someone wants durable hbase now, they can run with the
>> SU-prod 0.89.x that J-D is about to put up.
>>
>
> This is going to be an official Apache release, right? I guess my
> question is: if it's stable and usable for production, why call it a
> "development release"? If we're recommending it for all new users over
> and above 0.20.6, then it seems like this should be deemed stable (ie
> even release number).
>

I was not talking of recommending the next 0.89.x to all new users
over 0.20.6.  I thought it plain that running a release in production
at SU did not mean the release 'stable and useable' by others. SU has
3 hbase committers aboard to fix and hand-hold the software over rough
patches.

I was thinking that the next 0.89.x comes w/ the same caveats as
previous 0.89.xs, that its a 'developer release' for those willing to
put up w/ some rough edges and that its just another step on the way
to 0.90.0.

>> Putting off new master till 0.92 means it'll be maybe 6 months before
>> it appears.  During this time we'll be paying a high price keeping up
>> two disparate branches -- TRUNK w/ new master and the release branch
>> -- shoe-horning patches to fit both.
>>
>
> If you guys are running 20100830 in production, won't you be doing
> that anyway? Assumedly we'd treat this 0.90 as "no new features" and
> put the new features into 0.91.x leading up to 0.92?
>

Nope.  We'd move to new master release.  If it don't work for us, we'd
feel a little awkward recommending it to others.

>> We also confuse the 0.90 'message' given we've been talking about new
>> master at HUGs and here on the lists with a good while now.
>
> True. The question is whether we prefer to slip time or slip scope. In
> my opinion slipping scope is better - it's open source and people
> understand that schedules slip. Keeping strong release momentum up
> helps adoption and will get people off 0.20.6 which no one really
> wants to support anymore.
>

This is the question.  I'm just suggesting that new master MAY not be
that far out.  I want to do another couple of days work on it and then
have us make a call; i.e. vote that we press on to get new master into
0.90 or punt on new master for 0.90.

St.Ack

Re: old master to 0.90 and new master to 0.92? (was RE: Millions of photos into Hbase)

Posted by Todd Lipcon <to...@cloudera.com>.

On Tue, Sep 21, 2010 at 10:11 AM, Stack <st...@duboce.net> wrote:
> On Tue, Sep 21, 2010 at 4:49 AM, Jonathan Gray <jg...@facebook.com> wrote:
>> My main concern is that this move will push back any release of the new master significantly.  There are countless improvements in the codebase that came along with the rewrite, well beyond just zk transitions.  But doing a production release in time for HW is probably the most important thing.
>>
>
> IMO, we're almost there w/ the new master.  I'd like to wait till
> Thursday or Friday before making a call.
>
> Regards HW, we might consider pressing on with new master even if it
> means we don't exactly line up a new master 0.89.x (or a 0.90.xRC)
> release w/ HW.  You fellas talking at HW can fudge it some if it comes
> to that (We won't be the only ones.  CDH3 will not be ready for HW
> apparently).

CDH3b3 will be ready for Hadoop world, and we'd kind of like to freeze
component versions at this point in the beta cycle. So if 0.90 is out,
that would be great. We can certainly work with what we've got
(20100830 minus ZK assignment) but if a production-worth new master
isn't ready by the end of October or so we'll probably push that out
til CDH4.

> I'm thinking we can afford to take such a position
> because if someone wants durable hbase now, they can run with the
> SU-prod 0.89.x that J-D is about to put up.
>

This is going to be an official Apache release, right? I guess my
question is: if it's stable and usable for production, why call it a
"development release"? If we're recommending it for all new users over
and above 0.20.6, then it seems like this should be deemed stable (ie
even release number).

> Putting off new master till 0.92 means it'll be maybe 6 months before
> it appears.  During this time we'll be paying a high price keeping up
> two disparate branches -- TRUNK w/ new master and the release branch
> -- shoe-horning patches to fit both.
>

If you guys are running 20100830 in production, won't you be doing
that anyway? Assumedly we'd treat this 0.90 as "no new features" and
put the new features into 0.91.x leading up to 0.92?

> We also confuse the 0.90 'message' given we've been talking about new
> master at HUGs and here on the lists with a good while now.

True. The question is whether we prefer to slip time or slip scope. In
my opinion slipping scope is better - it's open source and people
understand that schedules slip. Keeping strong release momentum up
helps adoption and will get people off 0.20.6 which no one really
wants to support anymore.

-Todd

-- 
Todd Lipcon
Software Engineer, Cloudera

Re: old master to 0.90 and new master to 0.92? (was RE: Millions of photos into Hbase)

Posted by Stack <st...@duboce.net>.

On Tue, Sep 21, 2010 at 4:49 AM, Jonathan Gray <jg...@facebook.com> wrote:
> My main concern is that this move will push back any release of the new master significantly.  There are countless improvements in the codebase that came along with the rewrite, well beyond just zk transitions.  But doing a production release in time for HW is probably the most important thing.
>

IMO, we're almost there w/ the new master.  I'd like to wait till
Thursday or Friday before making a call.

Regards HW, we might consider pressing on with new master even if it
means we don't exactly line up a new master 0.89.x (or a 0.90.xRC)
release w/ HW.  You fellas talking at HW can fudge it some if it comes
to that (We won't be the only ones.  CDH3 will not be ready for HW
apparently).  I'm thinking we can afford to take such a position
because if someone wants durable hbase now, they can run with the
SU-prod 0.89.x that J-D is about to put up.

Putting off new master till 0.92 means it'll be maybe 6 months before
it appears.  During this time we'll be paying a high price keeping up
two disparate branches -- TRUNK w/ new master and the release branch
-- shoe-horning patches to fit both.

We also confuse the 0.90 'message' given we've been talking about new
master at HUGs and here on the lists with a good while now.

St.Ack

Re: old master to 0.90 and new master to 0.92? (was RE: Millions of photos into Hbase)

Posted by Jean-Daniel Cryans <jd...@apache.org>.

So while you were in the plane a few things were discussed :)

In the thread "[VOTE] Release 'development release' HBase 0.89.2010830
rc2", it was decided that we sink RC2 in favor of a new RC with a
rollback of the zk-based assignment. This is what we're running here,
and Ryan published our repo yesterday
http://github.com/stumbleupon/hbase  (see the top of the CHANGES file
for what we added)

Also some of your arguments correlate those of Todd, expressed in the
thread "Next release" that was posted on 09/15. Stack and Andrew
expressed their opinions in favor of Todd's option #1 (although Stack
said that he would decide tomorrow if he changes his opinion after
some more work on the master).

My personal opinion leans towards Todd's option #2. 0.20.0 is now more
than a year old, we are able to release almost production-ready code
that is durable, so why not move forward with it (this is what we did
here). There are a few issues to fix, but it's small compared to
getting the new master in shape plus getting the rest working (like
the replication code's interaction with ZooKeeper that needs to be
redone).

J-D

On Tue, Sep 21, 2010 at 4:49 AM, Jonathan Gray <jg...@facebook.com> wrote:
> Deep within my soul I do not want to do this.
>
> But it might be practical.
>
> FB is going into prod w/ the old master and we've been doing work stabilizing it in ways that do not apply at all to the new one (especially around zk).  That'll make having two different active branches a bit of a nightmare but right now some of these patches are not even available as we've been operating under the assumption that the release would be on the new master (and the patches don't apply on trunk).
>
> I guess SU is also putting an 0.89 old master release into prod.
>
> What's a little unfortunate is that the 0.89 releases include the unnecessary move of some transition communication into ZK.  That was put in as a first step towards new master before we decided to branch it off.
>
> The question for me would be whether we think it's at all feasible to get the new master 0.90 released in time for HW.  If not, maybe we should consider taking an 0.89 branch, making it an 0.90 branch, and focus on stabilizing for release in time for HW, as Todd suggests.
>
> There are other ramifications of this.  It would be very difficult for me to get my flush/compact/split improvements in to the old master as the new implementation I've been working on relies completely on the new stuff.  But maybe better to punt that for 0.92 as well so can really nail it?
>
> The other factor is the enormous amount of ZK improvements that were done in the new master branch.  It's a real fuckin mess in 0.89 releases, though I've done a bit of cleanup already towards making it at least tenable for production.
>
> My main concern is that this move will push back any release of the new master significantly.  There are countless improvements in the codebase that came along with the rewrite, well beyond just zk transitions.  But doing a production release in time for HW is probably the most important thing.
>
> JG
>
>> -----Original Message-----
>> From: Todd Lipcon [mailto:todd@cloudera.com]
>> Sent: Tuesday, September 21, 2010 8:10 AM
>> To: dev
>> Subject: Re: Millions of photos into Hbase
>>
>> On Mon, Sep 20, 2010 at 5:47 PM, Ryan Rawson <ry...@gmail.com>
>> wrote:
>> > Hey,
>> >
>> > There is actually only 1 active branch of hbase, that being the 0.89
>> > release, which is based on 'trunk'.  We have snapshotted a series of
>> > 0.89 "developer releases" in hopes that people would try them our and
>> > start thinking about the next major version.  One of these is what SU
>> > is running prod on.
>> >
>> > At this point tracking 0.89 and which ones are the 'best' peach sets
>> > to run is a bit of a contact sport, but if you are serious about not
>> > losing data it is worthwhile.  SU is based on the most recent DR with
>> > a few minor patches of our own concoction brought in.  If current
>> > works, but some Master ops are slow, and there are a few patches on
>> > top of that.  I'll poke about and see if its possible to publish to a
>> > github branch or something.
>> >
>>
>> This is why I kind of want to release the latest 0.89 dev release as
>> 0.90, and push off the new master stuff as 0.92 :)
>>
>> -Todd
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>

old master to 0.90 and new master to 0.92? (was RE: Millions of photos into Hbase)

Posted by Jonathan Gray <jg...@facebook.com>.

Deep within my soul I do not want to do this.

But it might be practical.

FB is going into prod w/ the old master and we've been doing work stabilizing it in ways that do not apply at all to the new one (especially around zk).  That'll make having two different active branches a bit of a nightmare but right now some of these patches are not even available as we've been operating under the assumption that the release would be on the new master (and the patches don't apply on trunk).

I guess SU is also putting an 0.89 old master release into prod.

What's a little unfortunate is that the 0.89 releases include the unnecessary move of some transition communication into ZK.  That was put in as a first step towards new master before we decided to branch it off.

The question for me would be whether we think it's at all feasible to get the new master 0.90 released in time for HW.  If not, maybe we should consider taking an 0.89 branch, making it an 0.90 branch, and focus on stabilizing for release in time for HW, as Todd suggests.

There are other ramifications of this.  It would be very difficult for me to get my flush/compact/split improvements in to the old master as the new implementation I've been working on relies completely on the new stuff.  But maybe better to punt that for 0.92 as well so can really nail it?

The other factor is the enormous amount of ZK improvements that were done in the new master branch.  It's a real fuckin mess in 0.89 releases, though I've done a bit of cleanup already towards making it at least tenable for production.

My main concern is that this move will push back any release of the new master significantly.  There are countless improvements in the codebase that came along with the rewrite, well beyond just zk transitions.  But doing a production release in time for HW is probably the most important thing.

JG

> -----Original Message-----
> From: Todd Lipcon [mailto:todd@cloudera.com]
> Sent: Tuesday, September 21, 2010 8:10 AM
> To: dev
> Subject: Re: Millions of photos into Hbase
> 
> On Mon, Sep 20, 2010 at 5:47 PM, Ryan Rawson <ry...@gmail.com>
> wrote:
> > Hey,
> >
> > There is actually only 1 active branch of hbase, that being the 0.89
> > release, which is based on 'trunk'.  We have snapshotted a series of
> > 0.89 "developer releases" in hopes that people would try them our and
> > start thinking about the next major version.  One of these is what SU
> > is running prod on.
> >
> > At this point tracking 0.89 and which ones are the 'best' peach sets
> > to run is a bit of a contact sport, but if you are serious about not
> > losing data it is worthwhile.  SU is based on the most recent DR with
> > a few minor patches of our own concoction brought in.  If current
> > works, but some Master ops are slow, and there are a few patches on
> > top of that.  I'll poke about and see if its possible to publish to a
> > github branch or something.
> >
> 
> This is why I kind of want to release the latest 0.89 dev release as
> 0.90, and push off the new master stuff as 0.92 :)
> 
> -Todd
> 
> --
> Todd Lipcon
> Software Engineer, Cloudera

Re: Millions of photos into Hbase

Posted by Todd Lipcon <to...@cloudera.com>.

On Mon, Sep 20, 2010 at 5:47 PM, Ryan Rawson <ry...@gmail.com> wrote:
> Hey,
>
> There is actually only 1 active branch of hbase, that being the 0.89
> release, which is based on 'trunk'.  We have snapshotted a series of
> 0.89 "developer releases" in hopes that people would try them our and
> start thinking about the next major version.  One of these is what SU
> is running prod on.
>
> At this point tracking 0.89 and which ones are the 'best' peach sets
> to run is a bit of a contact sport, but if you are serious about not
> losing data it is worthwhile.  SU is based on the most recent DR with
> a few minor patches of our own concoction brought in.  If current
> works, but some Master ops are slow, and there are a few patches on
> top of that.  I'll poke about and see if its possible to publish to a
> github branch or something.
>

This is why I kind of want to release the latest 0.89 dev release as
0.90, and push off the new master stuff as 0.92 :)

-Todd

-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Millions of photos into Hbase

Posted by Ryan Rawson <ry...@gmail.com>.

Hey,

There is actually only 1 active branch of hbase, that being the 0.89
release, which is based on 'trunk'.  We have snapshotted a series of
0.89 "developer releases" in hopes that people would try them our and
start thinking about the next major version.  One of these is what SU
is running prod on.

At this point tracking 0.89 and which ones are the 'best' peach sets
to run is a bit of a contact sport, but if you are serious about not
losing data it is worthwhile.  SU is based on the most recent DR with
a few minor patches of our own concoction brought in.  If current
works, but some Master ops are slow, and there are a few patches on
top of that.  I'll poke about and see if its possible to publish to a
github branch or something.

-ryan

On Mon, Sep 20, 2010 at 5:16 PM, Jack Levin <ma...@gmail.com> wrote:
> Sounds, good, only reason I ask is because of this:
>
> There are currently two active branches of HBase:
>
>    * 0.20 - the current stable release series, being maintained with
> patches for bug fixes only. This release series does not support HDFS
> durability - edits may be lost in the case of node failure.
>    * 0.89 - a development release series with active feature and
> stability development, not currently recommended for production use.
> This release does support HDFS durability - cases in which edits are
> lost are considered serious bugs.
>>>>>>>
>
> Are we talking about data loss in case of datanode going down while
> being written to, or RegionServer going down?
>
> -jack
>
>
> On Mon, Sep 20, 2010 at 4:09 PM, Ryan Rawson <ry...@gmail.com> wrote:
>> We run 0.89 in production @ Stumbleupon.  We also employ 3 committers...
>>
>> As for safety, you have no choice but to run 0.89.  If you run a 0.20
>> release you will lose data.  you must be on 0.89 and
>> CDH3/append-branch to achieve data durability, and there really is no
>> argument around it.  If you are doing your tests with 0.20.6 now, I'd
>> stop and rebase those tests onto the latest DR announced on the list.
>>
>> -ryan
>>
>> On Mon, Sep 20, 2010 at 3:17 PM, Jack Levin <ma...@gmail.com> wrote:
>>> Hi Stack, see inline:
>>>
>>> On Mon, Sep 20, 2010 at 2:42 PM, Stack <st...@duboce.net> wrote:
>>>> Hey Jack:
>>>>
>>>> Thanks for writing.
>>>>
>>>> See below for some comments.
>>>>
>>>> On Mon, Sep 20, 2010 at 11:00 AM, Jack Levin <ma...@gmail.com> wrote:
>>>>>
>>>>> Image-Shack gets close to two million image uploads per day, which are
>>>>> usually stored on regular servers (we have about 700), as regular
>>>>> files, and each server has its own host name, such as (img55).   I've
>>>>> been researching on how to improve our backend design in terms of data
>>>>> safety and stumped onto the Hbase project.
>>>>>
>>>>
>>>> Any other requirements other than data safety? (latency, etc).
>>>
>>> Latency is the second requirement.  We have some services that are
>>> very short tail, and can produce 95% cache hit rate, so I assume this
>>> would really put cache into good use.  Some other services however,
>>> have about 25% cache hit ratio, in which case the latency should be
>>> 'adequate', e.g. if its slightly worse than getting data off raw disk,
>>> then its good enough.   Safely is supremely important, then its
>>> availability, then speed.
>>>
>>>
>>>
>>>>> Now, I think hbase is he most beautiful thing that happen to
>>>>> distributed DB world :).   The idea is to store image files (about
>>>>> 400Kb on average into HBASE).
>>>>
>>>>
>>>> I'd guess some images are much bigger than this.  Do you ever limit
>>>> the size of images folks can upload to your service?
>>>>
>>>>
>>>> The setup will include the following
>>>>> configuration:
>>>>>
>>>>> 50 servers total (2 datacenters), with 8 GB RAM, dual core cpu, 6 x
>>>>> 2TB disks each.
>>>>> 3 to 5 Zookeepers
>>>>> 2 Masters (in a datacenter each)
>>>>> 10 to 20 Stargate REST instances (one per server, hash loadbalanced)
>>>>
>>>> Whats your frontend?  Why REST?  It might be more efficient if you
>>>> could run with thrift given REST base64s its payload IIRC (check the
>>>> src yourself).
>>>
>>> For insertion we use Haproxy, and balance curl PUTs across multiple REST APIs.
>>> For reading, its a nginx proxy that does Content-type modification
>>> from image/jpeg to octet-stream, and vice versa,
>>> it then hits Haproxy again, which hits balanced REST.
>>> Why REST, it was the simplest thing to run, given that its supports
>>> HTTP, potentially we could rewrite something for thrift, as long as we
>>> can use http still to send and receive data (anyone wrote anything
>>> like that say in python, C or java?)
>>>
>>>>
>>>>> 40 to 50 RegionServers (will probably keep masters separate on dedicated boxes).
>>>>> 2 Namenode servers (one backup, highly available, will do fsimage and
>>>>> edits snapshots also)
>>>>>
>>>>> So far I got about 13 servers running, and doing about 20 insertions /
>>>>> second (file size ranging from few KB to 2-3MB, ave. 400KB). via
>>>>> Stargate API.  Our frontend servers receive files, and I just
>>>>> fork-insert them into stargate via http (curl).
>>>>> The inserts are humming along nicely, without any noticeable load on
>>>>> regionservers, so far inserted about 2 TB worth of images.
>>>>> I have adjusted the region file size to be 512MB, and table block size
>>>>> to about 400KB , trying to match average access block to limit HDFS
>>>>> trips.
>>>>
>>>> As Todd suggests, I'd go up from 512MB... 1G at least.  You'll
>>>> probably want to up your flush size from 64MB to 128MB or maybe 192MB.
>>>
>>> Yep, i will adjust to 1G.  I thought flush was controlled by a
>>> function of memstore HEAP, something like 40%?  Or are you talking
>>> about HDFS block size?
>>>
>>>>  So far the read performance was more than adequate, and of
>>>>> course write performance is nowhere near capacity.
>>>>> So right now, all newly uploaded images go to HBASE.  But we do plan
>>>>> to insert about 170 Million images (about 100 days worth), which is
>>>>> only about 64 TB, or 10% of planned cluster size of 600TB.
>>>>> The end goal is to have a storage system that creates data safety,
>>>>> e.g. system may go down but data can not be lost.   Our Front-End
>>>>> servers will continue to serve images from their own file system (we
>>>>> are serving about 16 Gbits at peak), however should we need to bring
>>>>> any of those down for maintenance, we will redirect all traffic to
>>>>> Hbase (should be no more than few hundred Mbps), while the front end
>>>>> server is repaired (for example having its disk replaced), after the
>>>>> repairs, we quickly repopulate it with missing files, while serving
>>>>> the missing remaining off Hbase.
>>>>> All in all should be very interesting project, and I am hoping not to
>>>>> run into any snags, however, should that happens, I am pleased to know
>>>>> that such a great and vibrant tech group exists that supports and uses
>>>>> HBASE :).
>>>>>
>>>>
>>>> We're definetly interested in how your project progresses.  If you are
>>>> ever up in the city, you should drop by for a chat.
>>>
>>> Cool.  I'd like that.
>>>
>>>> St.Ack
>>>>
>>>> P.S. I'm also w/ Todd that you should move to 0.89 and blooms.
>>>> P.P.S I updated the wiki on stargate REST:
>>>> http://wiki.apache.org/hadoop/Hbase/Stargate
>>>
>>> Cool, I assume if we move to that it won't kill existing meta tables,
>>> and data?  e.g. cross compatible?
>>> Is 0.89 ready for production environment?
>>>
>>> -Jack
>>>
>>
>

Re: Millions of photos into Hbase

Posted by Jack Levin <ma...@gmail.com>.

Sounds, good, only reason I ask is because of this:

There are currently two active branches of HBase:

    * 0.20 - the current stable release series, being maintained with
patches for bug fixes only. This release series does not support HDFS
durability - edits may be lost in the case of node failure.
    * 0.89 - a development release series with active feature and
stability development, not currently recommended for production use.
This release does support HDFS durability - cases in which edits are
lost are considered serious bugs.
>>>>>>

Are we talking about data loss in case of datanode going down while
being written to, or RegionServer going down?

-jack


On Mon, Sep 20, 2010 at 4:09 PM, Ryan Rawson <ry...@gmail.com> wrote:
> We run 0.89 in production @ Stumbleupon.  We also employ 3 committers...
>
> As for safety, you have no choice but to run 0.89.  If you run a 0.20
> release you will lose data.  you must be on 0.89 and
> CDH3/append-branch to achieve data durability, and there really is no
> argument around it.  If you are doing your tests with 0.20.6 now, I'd
> stop and rebase those tests onto the latest DR announced on the list.
>
> -ryan
>
> On Mon, Sep 20, 2010 at 3:17 PM, Jack Levin <ma...@gmail.com> wrote:
>> Hi Stack, see inline:
>>
>> On Mon, Sep 20, 2010 at 2:42 PM, Stack <st...@duboce.net> wrote:
>>> Hey Jack:
>>>
>>> Thanks for writing.
>>>
>>> See below for some comments.
>>>
>>> On Mon, Sep 20, 2010 at 11:00 AM, Jack Levin <ma...@gmail.com> wrote:
>>>>
>>>> Image-Shack gets close to two million image uploads per day, which are
>>>> usually stored on regular servers (we have about 700), as regular
>>>> files, and each server has its own host name, such as (img55).   I've
>>>> been researching on how to improve our backend design in terms of data
>>>> safety and stumped onto the Hbase project.
>>>>
>>>
>>> Any other requirements other than data safety? (latency, etc).
>>
>> Latency is the second requirement.  We have some services that are
>> very short tail, and can produce 95% cache hit rate, so I assume this
>> would really put cache into good use.  Some other services however,
>> have about 25% cache hit ratio, in which case the latency should be
>> 'adequate', e.g. if its slightly worse than getting data off raw disk,
>> then its good enough.   Safely is supremely important, then its
>> availability, then speed.
>>
>>
>>
>>>> Now, I think hbase is he most beautiful thing that happen to
>>>> distributed DB world :).   The idea is to store image files (about
>>>> 400Kb on average into HBASE).
>>>
>>>
>>> I'd guess some images are much bigger than this.  Do you ever limit
>>> the size of images folks can upload to your service?
>>>
>>>
>>> The setup will include the following
>>>> configuration:
>>>>
>>>> 50 servers total (2 datacenters), with 8 GB RAM, dual core cpu, 6 x
>>>> 2TB disks each.
>>>> 3 to 5 Zookeepers
>>>> 2 Masters (in a datacenter each)
>>>> 10 to 20 Stargate REST instances (one per server, hash loadbalanced)
>>>
>>> Whats your frontend?  Why REST?  It might be more efficient if you
>>> could run with thrift given REST base64s its payload IIRC (check the
>>> src yourself).
>>
>> For insertion we use Haproxy, and balance curl PUTs across multiple REST APIs.
>> For reading, its a nginx proxy that does Content-type modification
>> from image/jpeg to octet-stream, and vice versa,
>> it then hits Haproxy again, which hits balanced REST.
>> Why REST, it was the simplest thing to run, given that its supports
>> HTTP, potentially we could rewrite something for thrift, as long as we
>> can use http still to send and receive data (anyone wrote anything
>> like that say in python, C or java?)
>>
>>>
>>>> 40 to 50 RegionServers (will probably keep masters separate on dedicated boxes).
>>>> 2 Namenode servers (one backup, highly available, will do fsimage and
>>>> edits snapshots also)
>>>>
>>>> So far I got about 13 servers running, and doing about 20 insertions /
>>>> second (file size ranging from few KB to 2-3MB, ave. 400KB). via
>>>> Stargate API.  Our frontend servers receive files, and I just
>>>> fork-insert them into stargate via http (curl).
>>>> The inserts are humming along nicely, without any noticeable load on
>>>> regionservers, so far inserted about 2 TB worth of images.
>>>> I have adjusted the region file size to be 512MB, and table block size
>>>> to about 400KB , trying to match average access block to limit HDFS
>>>> trips.
>>>
>>> As Todd suggests, I'd go up from 512MB... 1G at least.  You'll
>>> probably want to up your flush size from 64MB to 128MB or maybe 192MB.
>>
>> Yep, i will adjust to 1G.  I thought flush was controlled by a
>> function of memstore HEAP, something like 40%?  Or are you talking
>> about HDFS block size?
>>
>>>  So far the read performance was more than adequate, and of
>>>> course write performance is nowhere near capacity.
>>>> So right now, all newly uploaded images go to HBASE.  But we do plan
>>>> to insert about 170 Million images (about 100 days worth), which is
>>>> only about 64 TB, or 10% of planned cluster size of 600TB.
>>>> The end goal is to have a storage system that creates data safety,
>>>> e.g. system may go down but data can not be lost.   Our Front-End
>>>> servers will continue to serve images from their own file system (we
>>>> are serving about 16 Gbits at peak), however should we need to bring
>>>> any of those down for maintenance, we will redirect all traffic to
>>>> Hbase (should be no more than few hundred Mbps), while the front end
>>>> server is repaired (for example having its disk replaced), after the
>>>> repairs, we quickly repopulate it with missing files, while serving
>>>> the missing remaining off Hbase.
>>>> All in all should be very interesting project, and I am hoping not to
>>>> run into any snags, however, should that happens, I am pleased to know
>>>> that such a great and vibrant tech group exists that supports and uses
>>>> HBASE :).
>>>>
>>>
>>> We're definetly interested in how your project progresses.  If you are
>>> ever up in the city, you should drop by for a chat.
>>
>> Cool.  I'd like that.
>>
>>> St.Ack
>>>
>>> P.S. I'm also w/ Todd that you should move to 0.89 and blooms.
>>> P.P.S I updated the wiki on stargate REST:
>>> http://wiki.apache.org/hadoop/Hbase/Stargate
>>
>> Cool, I assume if we move to that it won't kill existing meta tables,
>> and data?  e.g. cross compatible?
>> Is 0.89 ready for production environment?
>>
>> -Jack
>>
>

Re: Millions of photos into Hbase

Posted by Ryan Rawson <ry...@gmail.com>.

We run 0.89 in production @ Stumbleupon.  We also employ 3 committers...

As for safety, you have no choice but to run 0.89.  If you run a 0.20
release you will lose data.  you must be on 0.89 and
CDH3/append-branch to achieve data durability, and there really is no
argument around it.  If you are doing your tests with 0.20.6 now, I'd
stop and rebase those tests onto the latest DR announced on the list.

-ryan

On Mon, Sep 20, 2010 at 3:17 PM, Jack Levin <ma...@gmail.com> wrote:
> Hi Stack, see inline:
>
> On Mon, Sep 20, 2010 at 2:42 PM, Stack <st...@duboce.net> wrote:
>> Hey Jack:
>>
>> Thanks for writing.
>>
>> See below for some comments.
>>
>> On Mon, Sep 20, 2010 at 11:00 AM, Jack Levin <ma...@gmail.com> wrote:
>>>
>>> Image-Shack gets close to two million image uploads per day, which are
>>> usually stored on regular servers (we have about 700), as regular
>>> files, and each server has its own host name, such as (img55).   I've
>>> been researching on how to improve our backend design in terms of data
>>> safety and stumped onto the Hbase project.
>>>
>>
>> Any other requirements other than data safety? (latency, etc).
>
> Latency is the second requirement.  We have some services that are
> very short tail, and can produce 95% cache hit rate, so I assume this
> would really put cache into good use.  Some other services however,
> have about 25% cache hit ratio, in which case the latency should be
> 'adequate', e.g. if its slightly worse than getting data off raw disk,
> then its good enough.   Safely is supremely important, then its
> availability, then speed.
>
>
>
>>> Now, I think hbase is he most beautiful thing that happen to
>>> distributed DB world :).   The idea is to store image files (about
>>> 400Kb on average into HBASE).
>>
>>
>> I'd guess some images are much bigger than this.  Do you ever limit
>> the size of images folks can upload to your service?
>>
>>
>> The setup will include the following
>>> configuration:
>>>
>>> 50 servers total (2 datacenters), with 8 GB RAM, dual core cpu, 6 x
>>> 2TB disks each.
>>> 3 to 5 Zookeepers
>>> 2 Masters (in a datacenter each)
>>> 10 to 20 Stargate REST instances (one per server, hash loadbalanced)
>>
>> Whats your frontend?  Why REST?  It might be more efficient if you
>> could run with thrift given REST base64s its payload IIRC (check the
>> src yourself).
>
> For insertion we use Haproxy, and balance curl PUTs across multiple REST APIs.
> For reading, its a nginx proxy that does Content-type modification
> from image/jpeg to octet-stream, and vice versa,
> it then hits Haproxy again, which hits balanced REST.
> Why REST, it was the simplest thing to run, given that its supports
> HTTP, potentially we could rewrite something for thrift, as long as we
> can use http still to send and receive data (anyone wrote anything
> like that say in python, C or java?)
>
>>
>>> 40 to 50 RegionServers (will probably keep masters separate on dedicated boxes).
>>> 2 Namenode servers (one backup, highly available, will do fsimage and
>>> edits snapshots also)
>>>
>>> So far I got about 13 servers running, and doing about 20 insertions /
>>> second (file size ranging from few KB to 2-3MB, ave. 400KB). via
>>> Stargate API.  Our frontend servers receive files, and I just
>>> fork-insert them into stargate via http (curl).
>>> The inserts are humming along nicely, without any noticeable load on
>>> regionservers, so far inserted about 2 TB worth of images.
>>> I have adjusted the region file size to be 512MB, and table block size
>>> to about 400KB , trying to match average access block to limit HDFS
>>> trips.
>>
>> As Todd suggests, I'd go up from 512MB... 1G at least.  You'll
>> probably want to up your flush size from 64MB to 128MB or maybe 192MB.
>
> Yep, i will adjust to 1G.  I thought flush was controlled by a
> function of memstore HEAP, something like 40%?  Or are you talking
> about HDFS block size?
>
>>  So far the read performance was more than adequate, and of
>>> course write performance is nowhere near capacity.
>>> So right now, all newly uploaded images go to HBASE.  But we do plan
>>> to insert about 170 Million images (about 100 days worth), which is
>>> only about 64 TB, or 10% of planned cluster size of 600TB.
>>> The end goal is to have a storage system that creates data safety,
>>> e.g. system may go down but data can not be lost.   Our Front-End
>>> servers will continue to serve images from their own file system (we
>>> are serving about 16 Gbits at peak), however should we need to bring
>>> any of those down for maintenance, we will redirect all traffic to
>>> Hbase (should be no more than few hundred Mbps), while the front end
>>> server is repaired (for example having its disk replaced), after the
>>> repairs, we quickly repopulate it with missing files, while serving
>>> the missing remaining off Hbase.
>>> All in all should be very interesting project, and I am hoping not to
>>> run into any snags, however, should that happens, I am pleased to know
>>> that such a great and vibrant tech group exists that supports and uses
>>> HBASE :).
>>>
>>
>> We're definetly interested in how your project progresses.  If you are
>> ever up in the city, you should drop by for a chat.
>
> Cool.  I'd like that.
>
>> St.Ack
>>
>> P.S. I'm also w/ Todd that you should move to 0.89 and blooms.
>> P.P.S I updated the wiki on stargate REST:
>> http://wiki.apache.org/hadoop/Hbase/Stargate
>
> Cool, I assume if we move to that it won't kill existing meta tables,
> and data?  e.g. cross compatible?
> Is 0.89 ready for production environment?
>
> -Jack
>

Re: Millions of photos into Hbase

Posted by Jack Levin <ma...@gmail.com>.

Hi Stack, see inline:

On Mon, Sep 20, 2010 at 2:42 PM, Stack <st...@duboce.net> wrote:
> Hey Jack:
>
> Thanks for writing.
>
> See below for some comments.
>
> On Mon, Sep 20, 2010 at 11:00 AM, Jack Levin <ma...@gmail.com> wrote:
>>
>> Image-Shack gets close to two million image uploads per day, which are
>> usually stored on regular servers (we have about 700), as regular
>> files, and each server has its own host name, such as (img55).   I've
>> been researching on how to improve our backend design in terms of data
>> safety and stumped onto the Hbase project.
>>
>
> Any other requirements other than data safety? (latency, etc).

Latency is the second requirement.  We have some services that are
very short tail, and can produce 95% cache hit rate, so I assume this
would really put cache into good use.  Some other services however,
have about 25% cache hit ratio, in which case the latency should be
'adequate', e.g. if its slightly worse than getting data off raw disk,
then its good enough.   Safely is supremely important, then its
availability, then speed.



>> Now, I think hbase is he most beautiful thing that happen to
>> distributed DB world :).   The idea is to store image files (about
>> 400Kb on average into HBASE).
>
>
> I'd guess some images are much bigger than this.  Do you ever limit
> the size of images folks can upload to your service?
>
>
> The setup will include the following
>> configuration:
>>
>> 50 servers total (2 datacenters), with 8 GB RAM, dual core cpu, 6 x
>> 2TB disks each.
>> 3 to 5 Zookeepers
>> 2 Masters (in a datacenter each)
>> 10 to 20 Stargate REST instances (one per server, hash loadbalanced)
>
> Whats your frontend?  Why REST?  It might be more efficient if you
> could run with thrift given REST base64s its payload IIRC (check the
> src yourself).

For insertion we use Haproxy, and balance curl PUTs across multiple REST APIs.
For reading, its a nginx proxy that does Content-type modification
from image/jpeg to octet-stream, and vice versa,
it then hits Haproxy again, which hits balanced REST.
Why REST, it was the simplest thing to run, given that its supports
HTTP, potentially we could rewrite something for thrift, as long as we
can use http still to send and receive data (anyone wrote anything
like that say in python, C or java?)

>
>> 40 to 50 RegionServers (will probably keep masters separate on dedicated boxes).
>> 2 Namenode servers (one backup, highly available, will do fsimage and
>> edits snapshots also)
>>
>> So far I got about 13 servers running, and doing about 20 insertions /
>> second (file size ranging from few KB to 2-3MB, ave. 400KB). via
>> Stargate API.  Our frontend servers receive files, and I just
>> fork-insert them into stargate via http (curl).
>> The inserts are humming along nicely, without any noticeable load on
>> regionservers, so far inserted about 2 TB worth of images.
>> I have adjusted the region file size to be 512MB, and table block size
>> to about 400KB , trying to match average access block to limit HDFS
>> trips.
>
> As Todd suggests, I'd go up from 512MB... 1G at least.  You'll
> probably want to up your flush size from 64MB to 128MB or maybe 192MB.

Yep, i will adjust to 1G.  I thought flush was controlled by a
function of memstore HEAP, something like 40%?  Or are you talking
about HDFS block size?

>  So far the read performance was more than adequate, and of
>> course write performance is nowhere near capacity.
>> So right now, all newly uploaded images go to HBASE.  But we do plan
>> to insert about 170 Million images (about 100 days worth), which is
>> only about 64 TB, or 10% of planned cluster size of 600TB.
>> The end goal is to have a storage system that creates data safety,
>> e.g. system may go down but data can not be lost.   Our Front-End
>> servers will continue to serve images from their own file system (we
>> are serving about 16 Gbits at peak), however should we need to bring
>> any of those down for maintenance, we will redirect all traffic to
>> Hbase (should be no more than few hundred Mbps), while the front end
>> server is repaired (for example having its disk replaced), after the
>> repairs, we quickly repopulate it with missing files, while serving
>> the missing remaining off Hbase.
>> All in all should be very interesting project, and I am hoping not to
>> run into any snags, however, should that happens, I am pleased to know
>> that such a great and vibrant tech group exists that supports and uses
>> HBASE :).
>>
>
> We're definetly interested in how your project progresses.  If you are
> ever up in the city, you should drop by for a chat.

Cool.  I'd like that.

> St.Ack
>
> P.S. I'm also w/ Todd that you should move to 0.89 and blooms.
> P.P.S I updated the wiki on stargate REST:
> http://wiki.apache.org/hadoop/Hbase/Stargate

Cool, I assume if we move to that it won't kill existing meta tables,
and data?  e.g. cross compatible?
Is 0.89 ready for production environment?

-Jack

Re: Millions of photos into Hbase

Posted by Andrew Purtell <ap...@apache.org>.

> From: Stack <st...@duboce.net>
> Whats your frontend?  Why REST?  It might be more efficient if you
> could run with thrift given REST base64s its payload IIRC (check the
> src yourself).

Stargate (and rest in trunk) supports binary puts via protobufs or application/octet-stream. 

Best regards,

    - Andy

Why is this email five sentences or less?
http://five.sentenc.es/

Re: Millions of photos into Hbase

Posted by Stack <st...@duboce.net>.

Hey Jack:

Thanks for writing.

See below for some comments.

On Mon, Sep 20, 2010 at 11:00 AM, Jack Levin <ma...@gmail.com> wrote:
>
> Image-Shack gets close to two million image uploads per day, which are
> usually stored on regular servers (we have about 700), as regular
> files, and each server has its own host name, such as (img55).   I've
> been researching on how to improve our backend design in terms of data
> safety and stumped onto the Hbase project.
>

Any other requirements other than data safety? (latency, etc).

> Now, I think hbase is he most beautiful thing that happen to
> distributed DB world :).   The idea is to store image files (about
> 400Kb on average into HBASE).


I'd guess some images are much bigger than this.  Do you ever limit
the size of images folks can upload to your service?


The setup will include the following
> configuration:
>
> 50 servers total (2 datacenters), with 8 GB RAM, dual core cpu, 6 x
> 2TB disks each.
> 3 to 5 Zookeepers
> 2 Masters (in a datacenter each)
> 10 to 20 Stargate REST instances (one per server, hash loadbalanced)

Whats your frontend?  Why REST?  It might be more efficient if you
could run with thrift given REST base64s its payload IIRC (check the
src yourself).

> 40 to 50 RegionServers (will probably keep masters separate on dedicated boxes).
> 2 Namenode servers (one backup, highly available, will do fsimage and
> edits snapshots also)
>
> So far I got about 13 servers running, and doing about 20 insertions /
> second (file size ranging from few KB to 2-3MB, ave. 400KB). via
> Stargate API.  Our frontend servers receive files, and I just
> fork-insert them into stargate via http (curl).
> The inserts are humming along nicely, without any noticeable load on
> regionservers, so far inserted about 2 TB worth of images.
> I have adjusted the region file size to be 512MB, and table block size
> to about 400KB , trying to match average access block to limit HDFS
> trips.

As Todd suggests, I'd go up from 512MB... 1G at least.  You'll
probably want to up your flush size from 64MB to 128MB or maybe 192MB.

 So far the read performance was more than adequate, and of
> course write performance is nowhere near capacity.
> So right now, all newly uploaded images go to HBASE.  But we do plan
> to insert about 170 Million images (about 100 days worth), which is
> only about 64 TB, or 10% of planned cluster size of 600TB.
> The end goal is to have a storage system that creates data safety,
> e.g. system may go down but data can not be lost.   Our Front-End
> servers will continue to serve images from their own file system (we
> are serving about 16 Gbits at peak), however should we need to bring
> any of those down for maintenance, we will redirect all traffic to
> Hbase (should be no more than few hundred Mbps), while the front end
> server is repaired (for example having its disk replaced), after the
> repairs, we quickly repopulate it with missing files, while serving
> the missing remaining off Hbase.
> All in all should be very interesting project, and I am hoping not to
> run into any snags, however, should that happens, I am pleased to know
> that such a great and vibrant tech group exists that supports and uses
> HBASE :).
>

We're definetly interested in how your project progresses.  If you are
ever up in the city, you should drop by for a chat.

St.Ack

P.S. I'm also w/ Todd that you should move to 0.89 and blooms.
P.P.S I updated the wiki on stargate REST:
http://wiki.apache.org/hadoop/Hbase/Stargate