You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Eric Czech <ec...@gmail.com> on 2012/07/12 07:26:13 UTC

Index building process design

Hi everyone,

I have a general design question (apologies in advanced if this has
been asked before).

I'd like to build indexes off of a raw data store and I'm trying to
think of the best way to control processing so some part of my cluster
can still serve reads and writes without being affected heavily by the
index building process.

I get the sense that the typical process for this involves something
like the following:

1.  Dedicate one cluster for index building (let's call it the INDEX
cluster) and one for serving application reads on the indexes as well
as writes/reads on the raw data set (let's call it the MAIN cluster).
2.  Have the raw data set replicated from the MAIN cluster to the INDEX cluster.
3.  On the INDEX cluster, use the replicated raw data to constantly
rebuild indexes and copy the new versions to the MAIN cluster,
overwriting the old versions if necessary.

While conceptually simple, I can't help but wonder if it doesn't make
more sense to simply switch application reads / writes from one
cluster to another based on which one is NOT currently building
indexes (but still have the raw data set replicate master-master
between them).

To be more clear, I'm proposing doing this:

1.  Have two clusters, call them CLUSTER_1 and CLUSTER_2, and have the
raw data set replicated master-master between them.
2.  if CLUSTER_1 is currently rebuilding indexes, redirect all
application traffic to CLUSTER_2 including reads from the indexes as
well as writes to the raw data set (and vise-versa).

I know I'm not addressing a lot of details here but I'm just curious
if anyone has ever implemented something along these lines.

The main advantage to what I'm proposing would be not having to copy
potentially massive indexes across the network but at the cost of
having to deal with having clients not always read from the same
cluster (seems doable though).

Any advice would be much appreciated!

Thanks

Re: Index building process design

Posted by Michael Segel <mi...@hotmail.com>.

Ok, I'll take a stab at the shorter one. :-)

You can create a base data table which contains your raw data. 
Depending on your index... like an inverted table, you can run a map/reduce job that builds up a second table.  And a third, a fourth... depending on how many inverted indexes you want. 

When you want to find a data set based on a known value in the index, you can scan the index table, and the result set will contain a list of keys for the data in the base table. 

Now you can then just fetch those rows from HBase.
If you are using multiple indexes, you just take the intersection of the result set(s) and now you have the end data set to fetch.

Not sure why you would want a second cluster. Could you expand on your use case?

On Jul 23, 2012, at 3:06 PM, Eric Czech wrote:

> Hmm, maybe that was too long -- I'll keep this one shorter I swear:
> 
> Would it make sense to build indexes with two Hadoop/Hbase clusters by
> simply pointing client traffic at the cluster that is currently NOT
> building indexes via M/R jobs?  Basically, has anyone ever tried switching
> back and forth between clusters instead of building indexes on one cluster
> and copying them to another?
> 
> 
> On Thu, Jul 12, 2012 at 1:26 AM, Eric Czech <ec...@gmail.com> wrote:
> 
>> Hi everyone,
>> 
>> I have a general design question (apologies in advanced if this has
>> been asked before).
>> 
>> I'd like to build indexes off of a raw data store and I'm trying to
>> think of the best way to control processing so some part of my cluster
>> can still serve reads and writes without being affected heavily by the
>> index building process.
>> 
>> I get the sense that the typical process for this involves something
>> like the following:
>> 
>> 1.  Dedicate one cluster for index building (let's call it the INDEX
>> cluster) and one for serving application reads on the indexes as well
>> as writes/reads on the raw data set (let's call it the MAIN cluster).
>> 2.  Have the raw data set replicated from the MAIN cluster to the INDEX
>> cluster.
>> 3.  On the INDEX cluster, use the replicated raw data to constantly
>> rebuild indexes and copy the new versions to the MAIN cluster,
>> overwriting the old versions if necessary.
>> 
>> While conceptually simple, I can't help but wonder if it doesn't make
>> more sense to simply switch application reads / writes from one
>> cluster to another based on which one is NOT currently building
>> indexes (but still have the raw data set replicate master-master
>> between them).
>> 
>> To be more clear, I'm proposing doing this:
>> 
>> 1.  Have two clusters, call them CLUSTER_1 and CLUSTER_2, and have the
>> raw data set replicated master-master between them.
>> 2.  if CLUSTER_1 is currently rebuilding indexes, redirect all
>> application traffic to CLUSTER_2 including reads from the indexes as
>> well as writes to the raw data set (and vise-versa).
>> 
>> I know I'm not addressing a lot of details here but I'm just curious
>> if anyone has ever implemented something along these lines.
>> 
>> The main advantage to what I'm proposing would be not having to copy
>> potentially massive indexes across the network but at the cost of
>> having to deal with having clients not always read from the same
>> cluster (seems doable though).
>> 
>> Any advice would be much appreciated!
>> 
>> Thanks
>>

Re: Index building process design

Posted by Eric Czech <er...@nextbigsound.com>.

Hmm, maybe that was too long -- I'll keep this one shorter I swear:

Would it make sense to build indexes with two Hadoop/Hbase clusters by
simply pointing client traffic at the cluster that is currently NOT
building indexes via M/R jobs?  Basically, has anyone ever tried switching
back and forth between clusters instead of building indexes on one cluster
and copying them to another?


On Thu, Jul 12, 2012 at 1:26 AM, Eric Czech <ec...@gmail.com> wrote:

> Hi everyone,
>
> I have a general design question (apologies in advanced if this has
> been asked before).
>
> I'd like to build indexes off of a raw data store and I'm trying to
> think of the best way to control processing so some part of my cluster
> can still serve reads and writes without being affected heavily by the
> index building process.
>
> I get the sense that the typical process for this involves something
> like the following:
>
> 1.  Dedicate one cluster for index building (let's call it the INDEX
> cluster) and one for serving application reads on the indexes as well
> as writes/reads on the raw data set (let's call it the MAIN cluster).
> 2.  Have the raw data set replicated from the MAIN cluster to the INDEX
> cluster.
> 3.  On the INDEX cluster, use the replicated raw data to constantly
> rebuild indexes and copy the new versions to the MAIN cluster,
> overwriting the old versions if necessary.
>
> While conceptually simple, I can't help but wonder if it doesn't make
> more sense to simply switch application reads / writes from one
> cluster to another based on which one is NOT currently building
> indexes (but still have the raw data set replicate master-master
> between them).
>
> To be more clear, I'm proposing doing this:
>
> 1.  Have two clusters, call them CLUSTER_1 and CLUSTER_2, and have the
> raw data set replicated master-master between them.
> 2.  if CLUSTER_1 is currently rebuilding indexes, redirect all
> application traffic to CLUSTER_2 including reads from the indexes as
> well as writes to the raw data set (and vise-versa).
>
> I know I'm not addressing a lot of details here but I'm just curious
> if anyone has ever implemented something along these lines.
>
> The main advantage to what I'm proposing would be not having to copy
> potentially massive indexes across the network but at the cost of
> having to deal with having clients not always read from the same
> cluster (seems doable though).
>
> Any advice would be much appreciated!
>
> Thanks
>

Re: Index building process design

Posted by Eric Czech <er...@nextbigsound.com>.

Thank you both for the response!

Michael, I'll elaborate on the use case in response to Amandeep's questions
but I'm pretty clear on what you mean with regards to using inverted
indexes built from a base table.

Amandeep, I think I can answer all of your questions with a better
explanation of what I'm trying to do.

Essentially, I need to serve up low latency, random reads on a
*nearly*immutable data set while still allowing for offline analytics
and very high
write throughput.  It's not exactly a groundbreaking use case, I know, but
the biggest problem with my existing design (that doesn't use HBase) is
that its current, synchronous indexing process is very difficult to
refactor or manipulate.  It might seem odd that I would need to refactor
indexes frequently but the reality is that a big part of the challenge I'm
facing is standardizing data from many different sources and while those
standardization processes are generally pretty accurate, there are still
some critical errors that occur (and they need to be fixed quickly).

For example, we record transactions from Spotify and iTunes and if we get a
record from Spotify saying that a certain song was streamed from "Soho,
NYC" and then we get a record from iTunes saying that the same song was
downloaded in "Manhattan", we need to index both records using the fact
that they were associated with the state of NY.  If, however, we later find
that "Manhattan" actually meant Manhattan, Kansas (a real city I swear)
then the index needs to be updated to reflect that new understanding.  This
type of problem occurs very frequently and running targeted, one-off
mapreduce jobs to fix it is quickly becoming infeasible.

To answer your question then Amandeep, I'm asking mostly about how to build
asynchronous indexes and serve reads on them all with HBase (no
Lucene/Solr/Elastic Search).  I'd like to rebuild these consistently to
include changes like the one I mentioned and I'd also like to support high
write throughput to the raw data set and serve low latency, random reads on
the indexes from the same cluster.  This is why I thought an alternating
cluster might make sense to support Hadoop and a random read/write workload
at the same time.

I've also thought about adding a much smaller, synchronous data store with
a configuration heavily optimized for random reads and writes to help
bridge the time gap between when new data would be added to the system and
when it would be used within a full index rebuild.  I'm less concerned with
the particulars here but I mention it because I'm trying to emulate, with
HBase, the design philosophy here:
http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html.  I think it
suits our use case well and I'm just trying to decide to what extent HBase
could be utilized within a system of the like.

Thank you again for the responses and I'm looking forward to hearing how
others are solving problems like this.

On Mon, Jul 23, 2012 at 9:15 PM, Amandeep Khurana <am...@gmail.com> wrote:

> You have an interesting question. I haven't come across anyone trying this
> approach yet. Replication is a relatively new feature so that might be one
> reason. But I think you are over engineering the solution here. I'll try to
> break it down to the best of my abilities.
>
> 1. First question to answer is - where are the indices going to be stored?
> Are they going to be other HBase tables (something like secondary indices)?
> Or are they going to be in an independent indexing system like
> Lucene/Solr/Elastic Search/Blurr? The number of fields you want to index on
> will determine that answer IMO.
>
> 2. Does the index building happen synchronously as the writes come in or
> is it an asynchronous process? The answer to this question will depend on
> the expected throughput and the cluster size. Also, the answer to this
> question will determine your indexing process.
>
> a) If you are going to build indices synchronously, you have two options.
> First being - get your client to write to HBase as well as create the
> index. Second option - use coprocessors. Coprocs are also a new feature and
> I'm not certain yet that they'll solve your problem. Try them out though,
> they are pretty neat.
>
> b) If you are going to build indices asynchronously, you'll likely be
> using MR for it. MR can run on the same cluster but if you pound your
> cluster with full table scans, your latencies for real time serving will
> get affected. Now, if you have relatively low throughput for data ingests,
> you might be able to get away with running fewer tasks in your MR job. So,
> data serving plus index creation can happen on the same cluster. If your
> throughput is high and you need MR jobs to run full force, separate the
> clusters out. Have a cluster to run the MR jobs and a separate cluster to
> serve the data. My approach in the second case would be to dump new data
> into HDFS directly as flat files, run MR over them to create the index and
> also put them into HBase for serving.
>
> Hope that gives you some more ideas to think about.
>
> -ak
>
>
> On Wednesday, July 11, 2012 at 10:26 PM, Eric Czech wrote:
>
> > Hi everyone,
> >
> > I have a general design question (apologies in advanced if this has
> > been asked before).
> >
> > I'd like to build indexes off of a raw data store and I'm trying to
> > think of the best way to control processing so some part of my cluster
> > can still serve reads and writes without being affected heavily by the
> > index building process.
> >
> > I get the sense that the typical process for this involves something
> > like the following:
> >
> > 1. Dedicate one cluster for index building (let's call it the INDEX
> > cluster) and one for serving application reads on the indexes as well
> > as writes/reads on the raw data set (let's call it the MAIN cluster).
> > 2. Have the raw data set replicated from the MAIN cluster to the INDEX
> cluster.
> > 3. On the INDEX cluster, use the replicated raw data to constantly
> > rebuild indexes and copy the new versions to the MAIN cluster,
> > overwriting the old versions if necessary.
> >
> > While conceptually simple, I can't help but wonder if it doesn't make
> > more sense to simply switch application reads / writes from one
> > cluster to another based on which one is NOT currently building
> > indexes (but still have the raw data set replicate master-master
> > between them).
> >
> > To be more clear, I'm proposing doing this:
> >
> > 1. Have two clusters, call them CLUSTER_1 and CLUSTER_2, and have the
> > raw data set replicated master-master between them.
> > 2. if CLUSTER_1 is currently rebuilding indexes, redirect all
> > application traffic to CLUSTER_2 including reads from the indexes as
> > well as writes to the raw data set (and vise-versa).
> >
> > I know I'm not addressing a lot of details here but I'm just curious
> > if anyone has ever implemented something along these lines.
> >
> > The main advantage to what I'm proposing would be not having to copy
> > potentially massive indexes across the network but at the cost of
> > having to deal with having clients not always read from the same
> > cluster (seems doable though).
> >
> > Any advice would be much appreciated!
> >
> > Thanks
>
>

Re: Index building process design

Posted by Amandeep Khurana <am...@gmail.com>.

You have an interesting question. I haven't come across anyone trying this approach yet. Replication is a relatively new feature so that might be one reason. But I think you are over engineering the solution here. I'll try to break it down to the best of my abilities.

1. First question to answer is - where are the indices going to be stored? Are they going to be other HBase tables (something like secondary indices)? Or are they going to be in an independent indexing system like Lucene/Solr/Elastic Search/Blurr? The number of fields you want to index on will determine that answer IMO.

2. Does the index building happen synchronously as the writes come in or is it an asynchronous process? The answer to this question will depend on the expected throughput and the cluster size. Also, the answer to this question will determine your indexing process.

a) If you are going to build indices synchronously, you have two options. First being - get your client to write to HBase as well as create the index. Second option - use coprocessors. Coprocs are also a new feature and I'm not certain yet that they'll solve your problem. Try them out though, they are pretty neat.

b) If you are going to build indices asynchronously, you'll likely be using MR for it. MR can run on the same cluster but if you pound your cluster with full table scans, your latencies for real time serving will get affected. Now, if you have relatively low throughput for data ingests, you might be able to get away with running fewer tasks in your MR job. So, data serving plus index creation can happen on the same cluster. If your throughput is high and you need MR jobs to run full force, separate the clusters out. Have a cluster to run the MR jobs and a separate cluster to serve the data. My approach in the second case would be to dump new data into HDFS directly as flat files, run MR over them to create the index and also put them into HBase for serving.

Hope that gives you some more ideas to think about.

-ak 

On Wednesday, July 11, 2012 at 10:26 PM, Eric Czech wrote:

> Hi everyone,
> 
> I have a general design question (apologies in advanced if this has
> been asked before).
> 
> I'd like to build indexes off of a raw data store and I'm trying to
> think of the best way to control processing so some part of my cluster
> can still serve reads and writes without being affected heavily by the
> index building process.
> 
> I get the sense that the typical process for this involves something
> like the following:
> 
> 1. Dedicate one cluster for index building (let's call it the INDEX
> cluster) and one for serving application reads on the indexes as well
> as writes/reads on the raw data set (let's call it the MAIN cluster).
> 2. Have the raw data set replicated from the MAIN cluster to the INDEX cluster.
> 3. On the INDEX cluster, use the replicated raw data to constantly
> rebuild indexes and copy the new versions to the MAIN cluster,
> overwriting the old versions if necessary.
> 
> While conceptually simple, I can't help but wonder if it doesn't make
> more sense to simply switch application reads / writes from one
> cluster to another based on which one is NOT currently building
> indexes (but still have the raw data set replicate master-master
> between them).
> 
> To be more clear, I'm proposing doing this:
> 
> 1. Have two clusters, call them CLUSTER_1 and CLUSTER_2, and have the
> raw data set replicated master-master between them.
> 2. if CLUSTER_1 is currently rebuilding indexes, redirect all
> application traffic to CLUSTER_2 including reads from the indexes as
> well as writes to the raw data set (and vise-versa).
> 
> I know I'm not addressing a lot of details here but I'm just curious
> if anyone has ever implemented something along these lines.
> 
> The main advantage to what I'm proposing would be not having to copy
> potentially massive indexes across the network but at the cost of
> having to deal with having clients not always read from the same
> cluster (seems doable though).
> 
> Any advice would be much appreciated!
> 
> Thanks