You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-commits@hadoop.apache.org by Apache Wiki <wi...@apache.org> on 2011/04/11 19:02:42 UTC

Hbase/PoweredBy reverted to revision 66 on Hadoop Wiki

Dear wiki user,

You have subscribed to a wiki page "Hadoop Wiki" for change notification.

The page Hbase/PoweredBy has been reverted to revision 66 by OtisGospodnetic.
The comment on this change is: This is not PoweredBy material.
http://wiki.apache.org/hadoop/Hbase/PoweredBy?action=diff&rev1=67&rev2=68

--------------------------------------------------

  
  [[http://www.tokenizer.org|Shopping Engine at Tokenizer]] is a web crawler; it uses HBase to store URLs and Outlinks (!AnchorText + LinkedURL): more than a billion. It was initially designed as Nutch-Hadoop extension, then (due to very specific 'shopping' scenario) moved to SOLR + MySQL(InnoDB) (ten thousands queries per second), and now - to HBase. HBase is significantly faster due to: no need for huge transaction logs, column-oriented design exactly matches 'lazy' business logic, data compression, !MapReduce support. Number of mutable 'indexes' (term from RDBMS) significantly reduced due to the fact that each 'row::column' structure is physically sorted by 'row'. MySQL InnoDB engine is best DB choice for highly-concurrent updates. However, necessity to flash a block of data to harddrive even if we changed only few bytes is obvious bottleneck. HBase greatly helps: not-so-popular in modern DBMS 'delete-insert', 'mutable primary key', and 'natural primary key' patterns become a big advantage with HBase.
  
- [[http://treasuryofideas.com/|Treasury of Ideas]] consults on HBase, Hadoop, and NoSQL technologies.
- 
  [[http://trendmicro.com/|Trend Micro]] uses HBase as a foundation for cloud scale storage for a variety of applications. We have been developing with HBase since version 0.1 and production since version 0.20.0.
  
  [[http://www.twitter.com|Twitter]] runs HBase across its entire Hadoop cluster.  HBase provides a distributed, read/write backup of all  mysql tables in Twitter's production backend, allowing engineers to run MapReduce jobs over the data while maintaining the ability to apply periodic row updates (something that is more difficult to do with vanilla HDFS).  A number of applications including people search rely on HBase internally for data generation. Additionally, the operations team uses HBase as a timeseries database for cluster-wide monitoring/performance data.