You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-commits@hadoop.apache.org by Apache Wiki <wi...@apache.org> on 2008/11/10 20:15:23 UTC

[Hadoop Wiki] Update of "Hbase/PoweredBy" by jgray

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The following page has been changed by jgray:
http://wiki.apache.org/hadoop/Hbase/PoweredBy

------------------------------------------------------------------------------
[http://www.mahalo.com Mahalo], "...the world's first human-powered search engine". All the markup that powers the wiki is stored in HBase. It's been in use for a few months now. !MediaWiki - the same software that power Wikipedia - has version/revision control. Mahalo's in-house editors produce a lot of revisions per day, which was not working well in a RDBMS. An hbase-based solution for this was built and tested, and the data migrated out of MySQL and into HBase. Right now it's at something like 6 million items in HBase. The upload tool runs every hour from a shell script to back up that data, and on 6 nodes takes about 5-10 minutes to run - and does not slow down production at all.

[http://www.powerset.com/ Powerset (a Microsoft company)] uses HBase to store raw documents. We have a ~70 node hadoop cluster running DFS, mapreduce, and hbase. In our wikipedia hbase table, we have one row for each wikipedia page (~2.5M pages and climbing). We use this as input to our indexing jobs, which are run in hadoop mapreduce. Uploading the entire wikipedia dump to our cluster takes a couple hours. Scanning the table inside mapreduce is very fast -- the latency is in the noise compared to everything else we do.
+
+ [http://www.streamy.com/ Streamy] is a recently launched realtime social news site. We use HBase for all of our data storage, query, and analysis needs, replacing an existing SQL-based system. This includes hundreds of millions of documents, sparse matrices, logs, and everything else once done in the relational system. We perform significant in-memory caching of query results similar to a traditional Memcached/SQL setup as well as other external components to perform joining and sorting. We also run thousands of daily MapReduce jobs using HBase tables for log analysis, attention data processing, and feed crawling. HBase has helped us scale and distribute in ways we could not otherwise, and the community has provided consistent and invaluable assistance.

[http://www.subrecord.org SubRecord Project] is an Open Source project that is using HBase as a repository of records (persisted map-like data) for the aspects it provides like logging, tracing or metrics. HBase and Lucene index both constitute a repo/storage for this platform.