You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Bradford Stephens <br...@gmail.com> on 2009/03/01 01:41:41 UTC
Re: HBase and Web-Scale BI

Yeah, it seems on the edge of feasibility for me. I'd much rather see a
queue + MapReduceful model.

On Fri, Feb 27, 2009 at 8:28 AM, Andrew Purtell <ap...@apache.org> wrote:

> I have done something like this for a different domain but with
> similar scale and user demands. The analysts by necessity needed
> to specify their queries in advance and we periodically ran
> mapreduce jobs to materialize into a cache (another HBase table)
> new results as new fresh data arrived. Serving answers out of
> cache of course then was very fast. Because we were precomputing
> answers the analysts needed to apply some forethought and
> discipline, query capacity had to be rationed, workflows had to
> tolerate slightly out of date information, and despite all of
> these necessary "drawbacks" the system was quite successful.
>
> Without precomputing the answer to such queries I don't see how
> one can present an assembly of such information sourced from TB
> (or PB) of data in less than 10 seconds. The essential strategy
> here is shifting computation in time and trading cheap disk for
> probably impossible CPU and index I/O demands for would be real-
> time queries.
>
> Maybe someone else can speak up if they think I am being too
> pessimistic here.
>
> Hope this helps,
>
>   - Andy
>
> > From: Bradford Stephens <br...@gmail.com>
> > Subject: Re: HBase and Web-Scale BI
> > To: hbase-user@hadoop.apache.org
> > Date: Thursday, February 26, 2009, 4:05 PM
> > Sure, here we go! I'm not at all opposed to indexing
> > tables, etc. I just
> > want this thing to be fast and non-klugdy.
> >
> > Basically, we're getting social media (like Blogs),
> > normalizing the data
> > into fields, and then doing BI on that.
> >
> > Our data is pretty simple ... here's an example:
> >
> > Document:
> >
> > BodyText (string)
> > BodyText Keywords (Lucene indexed)
> > URL (indexed, key with collection time?)
> > ParentDocumentID
> > Post Date (datetime)
> > Author Name (indexed, string)
> > Post Topic (string)
> > BodyLinks (list of URLs, possibly indexed?)
> >
> >
> > An example query our user would build in the web interface
> > might be, "What
> > are the top 15 keywords for all documents from Feb 1st -
> > April 10th where
> > the author is one of these five people".   We would
> > need to aggregate this
> > data and have it presented in no more than 10 seconds.
> >
> > We're expecting dozens of TB of data, perhaps more...
> >
> >
> > On Thu, Feb 26, 2009 at 1:52 PM, Ryan Rawson
> > <ry...@gmail.com> wrote:
> >
> > > I may have misspoke somewhat - hbase is actually quite
> > good at random
> > > reads.  But the catch is, it can only randomly read
> > via the row id.  It's
> > > more or less akin to having a DB table with only a
> > index primary key, and
> > > no
> > > secondary indexes.
> > >
> > > So, yes, random reads and "index scans"
> > work, and work well.  You just have
> > > to handle the index creation and maintenance yourself.
> > >
> > > -ryan
> > >
> > > On Thu, Feb 26, 2009 at 12:06 PM, Jonathan Gray
> > <jl...@streamy.com> wrote:
> > >
> > > > Bradford,
> > > >
> > > > Many of us probably have some input but it's
> > really difficult to help
> > > > without having more detail.
> > > >
> > > > Can you be more specific about the layout of the
> > data and the queries
> > > you'd
> > > > want to run?
> > > >
> > > > HBase is efficient at scanning (as with hdfs),
> > but also efficient at
> > > > randomly accessing by row key.  If you need to
> > fetch based on column
> > > names
> > > > or values, then hbase will not be efficient
> > without some form of
> > > secondary
> > > > indexing (additional tables in hbase or something
> > external like lucene).
> > > >
> > > > JG
> > > >
> > > > > -----Original Message-----
> > > > > From: Bradford Stephens
> > [mailto:bradfordstephens@gmail.com]
> > > > > Sent: Thursday, February 26, 2009 10:37 AM
> > > > > To: hbase-user@hadoop.apache.org
> > > > > Subject: Re: HBase and Web-Scale BI
> > > > >
> > > > > Yes, it seems that the fundamental
> > 'differentness' of HDFS/MapReduce is
> > > > > that
> > > > > they're not very well suited to random
> > access -- I was hoping HBase had
> > > > > found a way 'around' that, but of
> > course that 'differentness' is a
> > > > > fundamental strength of the HDFS way of
> > doing things.
> > > > >
> > > > > Where things have gotten murky is that our
> > data is very simple -- we
> > > > > just
> > > > > have a lot of it. And we don't need to
> > do a *lot* of random access to
> > > > > our
> > > > > data -- it really doesn't feel like an
> > RDBMS situation.
> > > > >
> > > > > Perhaps if we made an index out of a hash of
> > each of our data values,
> > > > > and
> > > > > did some 'normalization',  that
> > could be the key. Or maybe the metadata
> > > > > is
> > > > > not going to be as large as I thought...
> > hrm.
> > > > >
> > > > > I appreciate the input, and hope more people
> > will chime in :)
> > > > >
> > > > > On Wed, Feb 25, 2009 at 10:18 PM, Ryan
> > Rawson <ry...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hey,
> > > > > >
> > > > > > You have to be clear about what hbase
> > does and does not do.  HBase is
> > > > > just
> > > > > > not a rational database - it's
> > "weakness" is it's strength.
> > > > > >
> > > > > > In general, you can only access rows in
> > key order.  Keys are stored
> > > > > > lexicographically sorted however.
> > There aren't declarative secondary
> > > > > > indexes (minus the lucene thing, but
> > that isn't an index).  You have
> > > > > to put
> > > > > > all these pieces together to build a
> > system.
> > > > > >
> > > > > > But, you get scalability, and
> > reasonable performance, and in 0.20 you
> > > > > will
> > > > > > get really good performance (fast
> > enough to serve websites
> > > > > hopefully).
> > > > > >
> > > > > > In general you need to make sure your
> > row-key sorts data in the order
> > > > > you
> > > > > > want to query by.  You can do something
> > like this:
> > > > > >
> > > > > > <user> <Long.MAX_VALUE -
> > System.currentTimeMillis()> <event id>
> > > > > >
> > > > > > to store events in reverse
> > chronological order by users.
> > > > > >
> > > > > > If you want another access method, you
> > need to use a map-reduce and
> > > > > build a
> > > > > > secondary index.
> > > > > >
> > > > > > I dont know if this exactly answers
> > your question, but hopefully
> > > > > should
> > > > > > give
> > > > > > you more of an idea of what hbase does
> > and does not do.
> > > > > >
> > > > > > -ryan
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Wed, Feb 25, 2009 at 9:02 PM,
> > Bradford Stephens <
> > > > > > bradfordstephens@gmail.com> wrote:
> > > > > >
> > > > > > > Greetings,
> > > > > > >
> > > > > > > I'm in charge of the data
> > analysis and collection platform at my
> > > > > company,
> > > > > > > and we're basing a large part
> > of our core analysis platform on
> > > > > Hadoop,
> > > > > > > Nutch, and Lucene -- it's a
> > delight to use. However, we're going to
> > > > > be
> > > > > > > wanting some on-demand
> > "web-scale" business intelligence, and I'm
> > > > > > wondering
> > > > > > > if HBase is the right solution --
> > my research hasn't given me any
> > > > > > > conclusions.
> > > > > > >
> > > > > > > Our data set is pretty simple -- a
> > bunch of XML documents which
> > > > > have been
> > > > > > > parsed from HTML pages, and some
> > associated data (Author Name, Post
> > > > > Date,
> > > > > > > Influence, etc). What we would
> > like to be able to do is have our
> > > > > end
> > > > > > users
> > > > > > > do real-time (< 10 seconds)
> > OLAP-type analysis on this, and have it
> > > > > > > presented on a webpage. For
> > example, queries like ("All authors for
> > > > > the
> > > > > > > past
> > > > > > > two weeks who have used these
> > keywords in the post bodies and what
> > > > > their
> > > > > > > influence score is"). I
> > imagine we'll have several terabytes of
> > > > > data to
> > > > > > go
> > > > > > > through, and we won't be able
> > to do much pre-population of results.
> > > > > > >
> > > > > > > Is HBase low-latency enough that
> > we can scale-out to solve these
> > > > > sorts of
> > > > > > > problems?
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Bradford
> > > > > > >
> > > > > >
> > > >
> > > >
> > >
>
>
>
>