You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-commits@hadoop.apache.org by Apache Wiki <wi...@apache.org> on 2010/08/12 16:00:02 UTC

[Hadoop Wiki] Update of "PoweredBy" by Netzer

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "PoweredBy" page has been changed by Netzer.
http://wiki.apache.org/hadoop/PoweredBy?action=diff&rev1=214&rev2=215

--------------------------------------------------

    * We use Hadoop and HBase in several areas from social services to structured data storage and processing for internal use.
    * We currently have about 30 nodes running HDFS, Hadoop and HBase in clusters ranging from 5 to 14 nodes on both production and development. We plan a deployment on an 80 nodes cluster.
    * We constantly write data to HBase and run MapReduce jobs to process then store it back to HBase or external systems.
+ 
    * Our production cluster has been running since Oct 2008.
  
   * [[http://www.ablegrape.com/|Able Grape]] - Vertical search engine for trustworthy wine information
@@ -35, +36 @@

  
   * [[http://aws.amazon.com/|Amazon Web Services]]
    * We provide [[http://aws.amazon.com/elasticmapreduce|Amazon Elastic MapReduce]]. It's a web service that provides a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).
+ 
    * Our customers can instantly provision as much or as little capacity as they like to perform data-intensive tasks for applications such as web indexing, data mining, log file analysis, machine learning, financial analysis, scientific simulation, and bioinformatics research.
  
   * [[http://aol.com/|AOL]]
@@ -43, +45 @@

  
   * [[http://atbrox.com/|Atbrox]]
    * We use hadoop for information extraction & search, and data analysis consulting
+ 
    * Cluster: we primarily use Amazon's Elastic Mapreduce
  
   * [[http://www.babacar.org/|BabaCar]]
@@ -63, +66 @@

  
   * [[http://www.benipaltechnologies.com|Benipal Technologies]] - Outsourcing, Consulting, Innovation
    * 35 Node Cluster (Core2Quad Q9400 Processor, 4-8 GB RAM, 500 GB HDD)
+ 
    * Largest Data Node with Xeon E5420*2 Processors, 64GB RAM, 3.5 TB HDD
    * Total Cluster capacity of around 20 TB on a gigabit network with failover and redundancy
    * Hadoop is used for internal data crunching, application development, testing and getting around I/O limitations
@@ -70, +74 @@

   * [[http://bixolabs.com/|Bixo Labs]] - Elastic web mining
    * The Bixolabs elastic web mining platform uses Hadoop + Cascading to quickly build scalable web mining applications.
    * We're doing a 200M page/5TB crawl as part of the [[http://bixolabs.com/datasets/public-terabyte-dataset-project/|public terabyte dataset project]].
+ 
    * This runs as a 20 machine [[http://aws.amazon.com/elasticmapreduce/|Elastic MapReduce]] cluster.
  
   * [[http://www.brainpad.co.jp|BrainPad]] - Data mining and analysis
@@ -78, +83 @@

  
   * [[http://www.cascading.org/|Cascading]] - Cascading is a feature rich API for defining and executing complex and fault tolerant data processing workflows on a Hadoop cluster.
  
+  * [[http://www.skycheck.com/de/|Cheap flights with Skycheck]] - we use a smal  Hadoop cluster to view and organize our vast database each night.
+ 
   * [[http://www.cloudera.com|Cloudera, Inc]] - Cloudera provides commercial support and professional training for Hadoop.
    * We provide [[http://www.cloudera.com/hadoop|Cloudera's Distribution for Hadoop]]. Stable packages for Redhat and Ubuntu (rpms / debs), EC2 Images and web based configuration.
+ 
    * Check out our [[http://www.cloudera.com/blog|Hadoop and Big Data Blog]]
+ 
    * Get [[http://oreilly.com/catalog/9780596521998/index.html|"Hadoop: The Definitive Guide"]] (Tom White/O'Reilly)
  
   * [[http://www.contextweb.com/|Contextweb]] - ADSDAQ Ad Excange
@@ -96, +105 @@

  
   * [[http://datagraph.org/|Datagraph]]
    * We use Hadoop for batch-processing large [[http://www.w3.org/RDF/|RDF]] datasets, in particular for indexing RDF data.
+ 
    * We also use Hadoop for executing long-running offline [[http://en.wikipedia.org/wiki/SPARQL|SPARQL]] queries for clients.
+ 
    * We use Amazon S3 and Cassandra to store input RDF datasets and output files.
    * We've developed [[http://rdfgrid.rubyforge.org/|RDFgrid]], a Ruby framework for map/reduce-based processing of RDF data.
+ 
    * We primarily use Ruby, [[http://rdf.rubyforge.org/|RDF.rb]] and RDFgrid to process RDF data with Hadoop Streaming.
+ 
    * We primarily run Hadoop jobs on Amazon Elastic MapReduce, with cluster sizes of 1 to 20 nodes depending on the size of the dataset (hundreds of millions to billions of RDF statements).
  
   * [[http://www.datameer.com|Datameer]]
@@ -122, +135 @@

   * [[http://www.ebay.com|EBay]]
    * 532 nodes cluster (8 * 532 cores, 5.3PB).
    * Heavy usage of Java MapReduce, Pig, Hive, HBase
+ 
    * Using it for Search optimization and Research.
  
   * [[http://www.enormo.com/|Enormo]]
@@ -136, +150 @@

  
   * [[http://www.systems.ethz.ch/education/courses/hs08/map-reduce/|ETH Zurich Systems Group]]
    * We are using Hadoop in a course that we are currently teaching: "Massively Parallel Data Analysis with MapReduce". The course projects are based on real use-cases from biological data analysis.
+ 
    * Cluster hardware: 16 x (Quad-core Intel Xeon, 8GB RAM, 1.5 TB Hard-Disk)
  
   * [[http://www.eyealike.com/|Eyealike]] - Visual Media Search Platform
@@ -149, +164 @@

     * A 1100-machine cluster with 8800 cores and about 12 PB raw storage.
     * A 300-machine cluster with 2400 cores and about 3 PB raw storage.
     * Each (commodity) node has 8 cores and 12 TB of storage.
+ 
    * We are heavy users of both streaming as well as the Java apis. We have built a higher level data warehousing framework using these features called Hive (see the http://hadoop.apache.org/hive/). We have also developed a FUSE implementation over hdfs.
  
   * [[http://www.foxaudiencenetwork.com|FOX Audience Network]]
@@ -161, +177 @@

    * 5 machine cluster (8 cores/machine, 5TB/machine storage)
    * Existing 19 virtual machine cluster (2 cores/machine 30TB storage)
    * Predominantly Hive and Streaming API based jobs (~20,000 jobs a week) using [[http://github.com/trafficbroker/mandy|our Ruby library]], or see the [[http://oobaloo.co.uk/articles/2010/1/12/mapreduce-with-hadoop-and-ruby.html|canonical WordCount example]].
+ 
    * Daily batch ETL with a slightly modified [[http://github.com/pingles/clojure-hadoop|clojure-hadoop]]
+ 
    * Log analysis
    * Data mining
    * Machine learning
@@ -178, +196 @@

    * 30 machine cluster (4 cores, 1TB~2TB/machine storage)
    * storage for blog data and web documents
    * used for data indexing by MapReduce
+ 
    * link analyzing and Machine Learning by MapReduce
  
   * [[http://gumgum.com|GumGum]]
    * 20+ node cluster (Amazon EC2 c1.medium)
    * Nightly MapReduce jobs on [[http://aws.amazon.com/elasticmapreduce/|Amazon Elastic MapReduce]] process data stored in S3
+ 
    * MapReduce jobs written in [[http://groovy.codehaus.org/|Groovy]] use Hadoop Java APIs
+ 
    * Image and advertising analytics
  
   * [[http://www.hadoop.co.kr/|Hadoop Korean User Group]], a Korean Local Community Team Page.
    * 50 node cluster In the Korea university network environment.
     * Pentium 4 PC, HDFS 4TB Storage
+ 
    * Used for development projects
     * Retrieving and Analyzing Biomedical Knowledge
     * Latent Semantic Analysis, Collaborative Filtering
@@ -210, +232 @@

  
   * [[http://www.ibm.com|IBM]]
    * [[http://www-03.ibm.com/press/us/en/pressrelease/22613.wss|Blue Cloud Computing Clusters]]
+ 
    * [[http://www-03.ibm.com/press/us/en/pressrelease/22414.wss|University Initiative to Address Internet-Scale Computing Challenges]]
  
   * [[http://www.iccs.informatics.ed.ac.uk/|ICCS]]
@@ -229, +252 @@

  
   * [[http://infochimps.org|Infochimps]]
    * 30 node AWS EC2 cluster (varying instance size, currently EBS-backed) managed by Chef & Poolparty running Hadoop 0.20.2+228, Pig 0.5.0+30, Azkaban 0.04, [[http://github.com/infochimps/wukong|Wukong]]
+ 
    * Used for ETL & data analysis on terascale datasets, especially social network data (on [[http://api.infochimps.com|api.infochimps.com]])
  
   * [[http://www.iterend.com/|Iterend]]
@@ -293, +317 @@

    * We use Hadoop to develop MapReduce algorithms:
     * Information retrival and analytics
     * Machine generated content - documents, text, audio, & video
+ 
     * Natural Language Processing
+ 
    * Project portfolio includes:
     * Natural Language Processing
     * Mobile Social Network Hacking
     * Web Crawlers/Page scrapping
     * Text to Speech
     * Machine generated Audio & Video with remuxing
+ 
     * Automatic PDF creation & IR
+ 
    * 2 node cluster (Windows Vista/CYGWIN, & CentOS) for developing MapReduce programs.
  
   * [[http://www.mylife.com/|MyLife]]
@@ -318, +346 @@

  
   * [[http://www.netseer.com|NetSeer]] -
    * Up to 1000 instances on [[http://www.amazon.com/b/ref=sc_fe_l_2/002-1156069-5604805?ie=UTF8&node=201590011&no=3435361&me=A36L942TSJ2AJA|Amazon EC2]]
+ 
    * Data storage in [[http://www.amazon.com/S3-AWS-home-page-Money/b/ref=sc_fe_l_2/002-1156069-5604805?ie=UTF8&node=16427261&no=3435361&me=A36L942TSJ2AJA|Amazon S3]]
+ 
    * 50 node cluster in Coloc
    * Used for crawling, processing, serving and log analysis
  
   * [[http://nytimes.com|The New York Times]]
    * [[http://open.blogs.nytimes.com/2007/11/01/self-service-prorated-super-computing-fun/|Large scale image conversions]]
+ 
    * Used EC2 to run hadoop on a large virtual cluster
  
   * [[http://www.ning.com|Ning]]
    * We use Hadoop to store and process our log files
    * We rely on Apache Pig for reporting, analytics, Cascading for machine learning, and on a proprietary JavaScript API for ad-hoc queries
+ 
    * We use commodity hardware, with 8 cores and 16 GB of RAM per machine
  
   * [[http://lucene.apache.org/nutch|Nutch]] - flexible web search engine software
@@ -350, +382 @@

  
   * [[http://www.powerset.com|Powerset / Microsoft]] - Natural Language Search
    * up to 400 instances on [[http://www.amazon.com/b/ref=sc_fe_l_2/002-1156069-5604805?ie=UTF8&node=201590011&no=3435361&me=A36L942TSJ2AJA|Amazon EC2]]
+ 
    * data storage in [[http://www.amazon.com/S3-AWS-home-page-Money/b/ref=sc_fe_l_2/002-1156069-5604805?ie=UTF8&node=16427261&no=3435361&me=A36L942TSJ2AJA|Amazon S3]]
+ 
    * Microsoft is now contributing to HBase, a Hadoop subproject ( [[http://port25.technet.com/archive/2008/10/14/microsoft-s-powerset-team-resumes-hbase-contributions.aspx|announcement]]).
  
   * [[http://pressflip.com|Pressflip]] - Personalized Persistent Search
@@ -397, +431 @@

  
   * [[http://www.slcsecurity.com/|SLC Security Services LLC]]
    * 18 node cluster (each node has: 4 dual core CPUs, 1TB storage, 4GB RAM, RedHat OS)
+ 
    * We use Hadoop for our high speed data mining applications
  
   * [[http://www.socialmedia.com/|Socialmedia.com]]
@@ -405, +440 @@

  
   * [[http://www.spadac.com/|Spadac.com]]
    * We are developing the MrGeo (Map/Reduce Geospatial) application to allow our users to bring cloud computing to geospatial processing.
+ 
    * We use HDFS and MapReduce to store, process, and index geospatial imagery and vector data.
+ 
    * MrGeo is soon to be open sourced as well.
  
   * [[http://stampedehost.com/|Stampede Data Solutions (Stampedehost.com)]]
@@ -433, +470 @@

   * [[http://www.twitter.com|Twitter]]
    * We use Hadoop to store and process tweets, log files, and many other types of data generated across Twitter. We use Cloudera's CDH2 distribution of Hadoop, and store all data as compressed LZO files.
    * We use both Scala and Java to access Hadoop's MapReduce APIs
+ 
    * We use Pig heavily for both scheduled and ad-hoc jobs, due to its ability to accomplish a lot with few statements.
    * We employ committers on Pig, Avro, Hive, and Cassandra, and contribute much of our internal Hadoop work to opensource (see [[http://github.com/kevinweil/hadoop-lzo|hadoop-lzo]])
+ 
    * For more on our use of hadoop, see the following presentations: [[http://www.slideshare.net/kevinweil/hadoop-pig-and-twitter-nosql-east-2009|Hadoop and Pig at Twitter]] and [[http://www.slideshare.net/kevinweil/protocol-buffers-and-hadoop-at-twitter|Protocol Buffers and Hadoop at Twitter]]
  
   * [[http://tynt.com|Tynt]]
@@ -459, +498 @@

  
   * [[http://www.vksolutions.com/|VK Solutions]]
    * We use a small Hadoop cluster in the scope of our general research activities at [[http://www.vklabs.com|VK Labs]] to get a faster data access from web applications.
+ 
    * We also use Hadoop for filtering and indexing listing, processing log analysis, and for recommendation data.
  
   * [[http://www.worldlingo.com/|WorldLingo]]
@@ -471, +511 @@

  
   * [[http://www.yahoo.com/|Yahoo!]]
    * More than 100,000 CPUs in >36,000 computers running Hadoop
+ 
    * Our biggest cluster: 4000 nodes (2*4cpu boxes w 4*1TB disk & 16GB RAM)
     * Used to support research for Ad Systems and Web Search
     * Also used to do scaling tests to support development of Hadoop on larger clusters
+ 
    * [[http://developer.yahoo.com/blogs/hadoop|Our Blog]] - Learn more about how we use Hadoop.
+ 
    * >60% of Hadoop Jobs within Yahoo are Pig jobs.
  
   * [[http://www.zvents.com/|Zvents]]