You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by "Edward J. Yoon" <ed...@apache.org> on 2008/11/18 03:18:11 UTC

Large webmail storage and Hbase

Hi,

I'm considering to store the large-scale web-mail data on the Hbase.
IMO, I expect to be able to solve both real-time  and batch (e.g. spam
filtering, from/to graph, ..., etc) issues. But I'm still not sure
whether it's suitable for storing web mail data. The stable online
real-time service should be possible to be a web mail service.

Does anyone tried similar one (real-time application), Or know about
gmail architecture?
Any advices are welcome, Thanks!

-- 
Best Regards, Edward J. Yoon @ NHN, corp.
edwardyoon@apache.org
http://blog.udanax.org

Re: Large webmail storage and Hbase

Posted by "Edward J. Yoon" <ed...@apache.org>.
Thanks for your information. Its really helpful to me :)

On Thu, Nov 20, 2008 at 3:58 AM, Joost Ouwerkerk <jo...@openplaces.org> wrote:
> Edward,
>
> We're working on a user-facing web system backed by Hbase.  More
> read-oriented than a mail system, but it does also have web users writing to
> it.  We're making heavy use of memcached because HBase random read is not
> fast enough.  Haven't tried BLOCKCACHE yet, but reading a random row from
> HBase generally costs us about 150ms, which when multiplied by 10-20 records
> is expensive.  We think it's this slow because of the quantity of data we're
> transporting, but haven't fully figured it out yet -- MySQL and memcached
> can deliver the same quantity of data in 1/10th the time.  If you can model
> your data to favour reading with scanners instead of randomly, I'm sure you
> could do much better.  I know that the scanner code was recently optimized
> with a batching strategy.
>
> We're using Solr/Lucene for secondary indexes & searching.  We often display
> indexed results instead of retrieving data from the database.  We generally
> do only one HBase getRow call per user HTTP request, the rest comes from
> Solr or memcached.
>
> We haven't rolled out beyond a small alpha user group, so the system is not
> proven in the real world.  Like Stack says: try it and see what happens.
> And be prepared to switch to an ugly MySQL sharding approach if it doesn't
> work out.
>
> j
>
> On Tue, Nov 18, 2008 at 9:21 PM, Edward J. Yoon <ed...@apache.org>wrote:
>
>> Does anyone have some opinion about this?
>>
>> On Tue, Nov 18, 2008 at 11:18 AM, Edward J. Yoon <ed...@apache.org>
>> wrote:
>> > Hi,
>> >
>> > I'm considering to store the large-scale web-mail data on the Hbase.
>> > IMO, I expect to be able to solve both real-time  and batch (e.g. spam
>> > filtering, from/to graph, ..., etc) issues. But I'm still not sure
>> > whether it's suitable for storing web mail data. The stable online
>> > real-time service should be possible to be a web mail service.
>> >
>> > Does anyone tried similar one (real-time application), Or know about
>> > gmail architecture?
>> > Any advices are welcome, Thanks!
>> >
>> > --
>> > Best Regards, Edward J. Yoon @ NHN, corp.
>> > edwardyoon@apache.org
>> > http://blog.udanax.org
>> >
>>
>>
>>
>> --
>> Best Regards, Edward J. Yoon @ NHN, corp.
>> edwardyoon@apache.org
>> http://blog.udanax.org
>>
>



-- 
Best Regards, Edward J. Yoon @ NHN, corp.
edwardyoon@apache.org
http://blog.udanax.org

RE: Large webmail storage and Hbase

Posted by Jonathan Gray <jl...@streamy.com>.
Edward,

We have a user-facing website backed fully by HBase.

Like Joost, we have significant random reading and to this point
out-of-the-box performance for random reading on HBase is not sufficient.
We have a very similar system to memcached to solve this issue.  We also
have external indexes to deal with sorting, secondary indexing, etc.

Blockcache can help significantly depending on your usage patterns and the
0.20 release of HBase is heavily focused on random read performance, though
this is still months away.

I would say it's certainly possible to build a webmail system on top of
HBase, but if running on 0.18/0.19 you'll first want to do performance
testing with blockcache but will probably require a key/val cache like
memcached (I'm using Tokyo Cabinet).  Since e-mails are typically immutable,
this kind of cache will go a long way.

JG

> -----Original Message-----
> From: Joost Ouwerkerk [mailto:joost@www.openplaces.com]
> Sent: Wednesday, November 19, 2008 10:58 AM
> To: hbase-user@hadoop.apache.org
> Subject: Re: Large webmail storage and Hbase
> 
> Edward,
> 
> We're working on a user-facing web system backed by Hbase.  More
> read-oriented than a mail system, but it does also have web users
> writing to
> it.  We're making heavy use of memcached because HBase random read is
> not
> fast enough.  Haven't tried BLOCKCACHE yet, but reading a random row
> from
> HBase generally costs us about 150ms, which when multiplied by 10-20
> records
> is expensive.  We think it's this slow because of the quantity of data
> we're
> transporting, but haven't fully figured it out yet -- MySQL and
> memcached
> can deliver the same quantity of data in 1/10th the time.  If you can
> model
> your data to favour reading with scanners instead of randomly, I'm sure
> you
> could do much better.  I know that the scanner code was recently
> optimized
> with a batching strategy.
> 
> We're using Solr/Lucene for secondary indexes & searching.  We often
> display
> indexed results instead of retrieving data from the database.  We
> generally
> do only one HBase getRow call per user HTTP request, the rest comes
> from
> Solr or memcached.
> 
> We haven't rolled out beyond a small alpha user group, so the system is
> not
> proven in the real world.  Like Stack says: try it and see what
> happens.
> And be prepared to switch to an ugly MySQL sharding approach if it
> doesn't
> work out.
> 
> j
> 
> On Tue, Nov 18, 2008 at 9:21 PM, Edward J. Yoon
> <ed...@apache.org>wrote:
> 
> > Does anyone have some opinion about this?
> >
> > On Tue, Nov 18, 2008 at 11:18 AM, Edward J. Yoon
> <ed...@apache.org>
> > wrote:
> > > Hi,
> > >
> > > I'm considering to store the large-scale web-mail data on the
> Hbase.
> > > IMO, I expect to be able to solve both real-time  and batch (e.g.
> spam
> > > filtering, from/to graph, ..., etc) issues. But I'm still not sure
> > > whether it's suitable for storing web mail data. The stable online
> > > real-time service should be possible to be a web mail service.
> > >
> > > Does anyone tried similar one (real-time application), Or know
> about
> > > gmail architecture?
> > > Any advices are welcome, Thanks!
> > >
> > > --
> > > Best Regards, Edward J. Yoon @ NHN, corp.
> > > edwardyoon@apache.org
> > > http://blog.udanax.org
> > >
> >
> >
> >
> > --
> > Best Regards, Edward J. Yoon @ NHN, corp.
> > edwardyoon@apache.org
> > http://blog.udanax.org
> >


Re: Large webmail storage and Hbase

Posted by Joost Ouwerkerk <jo...@openplaces.org>.
Edward,

We're working on a user-facing web system backed by Hbase.  More
read-oriented than a mail system, but it does also have web users writing to
it.  We're making heavy use of memcached because HBase random read is not
fast enough.  Haven't tried BLOCKCACHE yet, but reading a random row from
HBase generally costs us about 150ms, which when multiplied by 10-20 records
is expensive.  We think it's this slow because of the quantity of data we're
transporting, but haven't fully figured it out yet -- MySQL and memcached
can deliver the same quantity of data in 1/10th the time.  If you can model
your data to favour reading with scanners instead of randomly, I'm sure you
could do much better.  I know that the scanner code was recently optimized
with a batching strategy.

We're using Solr/Lucene for secondary indexes & searching.  We often display
indexed results instead of retrieving data from the database.  We generally
do only one HBase getRow call per user HTTP request, the rest comes from
Solr or memcached.

We haven't rolled out beyond a small alpha user group, so the system is not
proven in the real world.  Like Stack says: try it and see what happens.
And be prepared to switch to an ugly MySQL sharding approach if it doesn't
work out.

j

On Tue, Nov 18, 2008 at 9:21 PM, Edward J. Yoon <ed...@apache.org>wrote:

> Does anyone have some opinion about this?
>
> On Tue, Nov 18, 2008 at 11:18 AM, Edward J. Yoon <ed...@apache.org>
> wrote:
> > Hi,
> >
> > I'm considering to store the large-scale web-mail data on the Hbase.
> > IMO, I expect to be able to solve both real-time  and batch (e.g. spam
> > filtering, from/to graph, ..., etc) issues. But I'm still not sure
> > whether it's suitable for storing web mail data. The stable online
> > real-time service should be possible to be a web mail service.
> >
> > Does anyone tried similar one (real-time application), Or know about
> > gmail architecture?
> > Any advices are welcome, Thanks!
> >
> > --
> > Best Regards, Edward J. Yoon @ NHN, corp.
> > edwardyoon@apache.org
> > http://blog.udanax.org
> >
>
>
>
> --
> Best Regards, Edward J. Yoon @ NHN, corp.
> edwardyoon@apache.org
> http://blog.udanax.org
>

Re: Large webmail storage and Hbase

Posted by "Edward J. Yoon" <ed...@apache.org>.
Thanks, St.Ack. I'll check the speed of migration.

On Thu, Nov 20, 2008 at 2:41 AM, stack <st...@duboce.net> wrote:
> Edward J. Yoon wrote:
>>
>> Does anyone have some opinion about this?
>>
>
> Try it.  We'll help you out.
> St.Ack
>



-- 
Best Regards, Edward J. Yoon @ NHN, corp.
edwardyoon@apache.org
http://blog.udanax.org

Re: Large webmail storage and Hbase

Posted by stack <st...@duboce.net>.
Edward J. Yoon wrote:
> Does anyone have some opinion about this?
>   
Try it.  We'll help you out.
St.Ack

Re: Large webmail storage and Hbase

Posted by "Edward J. Yoon" <ed...@apache.org>.
Does anyone have some opinion about this?

On Tue, Nov 18, 2008 at 11:18 AM, Edward J. Yoon <ed...@apache.org> wrote:
> Hi,
>
> I'm considering to store the large-scale web-mail data on the Hbase.
> IMO, I expect to be able to solve both real-time  and batch (e.g. spam
> filtering, from/to graph, ..., etc) issues. But I'm still not sure
> whether it's suitable for storing web mail data. The stable online
> real-time service should be possible to be a web mail service.
>
> Does anyone tried similar one (real-time application), Or know about
> gmail architecture?
> Any advices are welcome, Thanks!
>
> --
> Best Regards, Edward J. Yoon @ NHN, corp.
> edwardyoon@apache.org
> http://blog.udanax.org
>



-- 
Best Regards, Edward J. Yoon @ NHN, corp.
edwardyoon@apache.org
http://blog.udanax.org