You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-commits@hadoop.apache.org by Apache Wiki <wi...@apache.org> on 2011/04/11 19:42:58 UTC
[Hadoop Wiki] Trivial Update of "Hbase/SecondaryIndexing" by Eugene Koontz

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Hbase/SecondaryIndexing" page has been changed by Eugene Koontz.
The comment on this change is: fix item lists.
http://wiki.apache.org/hadoop/Hbase/SecondaryIndexing?action=diff&rev1=4&rev2=5

--------------------------------------------------

  = HBase Secondary Indexing =
- 
  This is a design document around different approaches to secondary indexing in HBase.
  
  == Eventually Consistent Secondary Indexes using Coprocessors ==
- 
  The basic idea is to use an additional (secondary) table for each index on the main (primary) table.  A coprocessor binding to a family would be used to define a given secondary index on that family (or specific column(s) within it).  The WAL would be used to ensure durability and a shared queue makes the secondary update async from the callers POV.  Normal HBase timestamps would be used for any conflict resolution and to make operations idempotent.
  
  When a Put comes in to the primary table, the following would happen (assuming a single index update to a single secondary table for now):
@@ -22, +20 @@

  
  6. Return to client
  
- 
  The shared queue would be a thread or threadpool that picks up these secondary table edit jobs and applies them using a normal Put operation to the secondary table.
  
  On failover of primary table, primary edits would be replayed normally, and secondary edits would be applied to the secondary table/server as is done with the shared queue.
@@ -37, +34 @@

  
  Or we could tie secondary edits to each memstore, and the flushing of a memstore can only happen if its secondary edits have all been applied, which would tie in with the existing semantics around log eviction... but that has other implications and won't really help with preventing too much over replay.
  
- 
  Other open questions:
  
- * Creation of secondary tables (auto-bootstrapped?  part of coprocessor init?  manual?)
+  * Creation of secondary tables (auto-bootstrapped?  part of coprocessor init?  manual?) 
- * Read API
+  * Read API
  
  Future work:
  
- * Declaration of indexes via API or shell syntax rather than programatically with a coprocessor-per-index
+  * Declaration of indexes via API or shell syntax rather than programatically with a coprocessor-per-index
- * Creation of indexes on existing tables (build of indexes based on current data and kept up to date)
+  * Creation of indexes on existing tables (build of indexes based on current data and kept up to date)
- * Option to apply secondary update in a synchronous fashion (if you want to take performance hit and have stronger consistency of the index)
+  * Option to apply secondary update in a synchronous fashion (if you want to take performance hit and have stronger consistency of the index)
- * Storing of primary table data in secondary table to provide single-lookup denormalized join
+  * Storing of primary table data in secondary table to provide single-lookup denormalized join
  
  == Secondary Indexes using Optimistic Concurrency Control ==
  
@@ -56, +52 @@

  
  Currently this lives here:  https://github.com/hbase-trx/hbase-transactional-tableindexed
  
- 
  == In-memory Secondary Indexes for Indexed Scans ==
- 
  This was implemented once but I'm not sure where it lives anymore.