You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by Apache Wiki <wi...@apache.org> on 2010/06/17 00:57:19 UTC
[Cassandra Wiki] Trivial Update of "HadoopSupport" by jeremyhanna

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for change notification.

The "HadoopSupport" page has been changed by jeremyhanna.
http://wiki.apache.org/cassandra/HadoopSupport?action=diff&rev1=9&rev2=10

--------------------------------------------------

  == Overview ==
- Cassandra version 0.6 and later enable certain Hadoop functionality against Cassandra's data store.  Specifically, support has been added for MapReduce and Pig.
+ Cassandra version 0.6 and later enable certain Hadoop functionality against Cassandra's data store.  Specifically, support has been added for !MapReduce and Pig.
  
  == MapReduce ==
- While writing output to Cassandra has always been possible by implementing certain interfaces from the Hadoop library, version 0.6 of Cassandra added support for retrieving data from Cassandra.  Cassandra 0.6 adds implementations of InputSplit, InputFormat, and RecordReader so that Hadoop MapReduce jobs can retrieve data from Cassandra.  For an example of how this works, see the contrib/word_count example in 0.6 or later.  Cassandra rows or row  fragments (that is, pairs of key + `SortedMap`  of columns) are input to Map tasks for  processing by your job, as specified by a `SlicePredicate`  that describes which columns to fetch from each row.
+ While writing output to Cassandra has always been possible by implementing certain interfaces from the Hadoop library, version 0.6 of Cassandra added support for retrieving data from Cassandra.  Cassandra 0.6 adds implementations of !InputSplit, !InputFormat, and !RecordReader so that Hadoop !MapReduce jobs can retrieve data from Cassandra.  For an example of how this works, see the contrib/word_count example in 0.6 or later.  Cassandra rows or row  fragments (that is, pairs of key + `SortedMap`  of columns) are input to Map tasks for  processing by your job, as specified by a `SlicePredicate`  that describes which columns to fetch from each row.
  
  Here's how this looks in the word_count example, which selects just one  configurable columnName from each row:
  
@@ -13, +13 @@

              SlicePredicate predicate = new SlicePredicate().setColumn_names(Arrays.asList(columnName.getBytes()));
              ConfigHelper.setSlicePredicate(job.getConfiguration(), predicate);
  }}}
- Cassandra's splits are location-aware (this is the nature of the Hadoop InputSplit design).  Cassandra  gives the Hadoop JobTracker a list of locations with each split of data.  That way, the JobTracker can try to preserve data locality when  assigning tasks to TaskTrackers.  Therefore, when using Hadoop alongside  Cassandra, it is best to have a TaskTracker running on the same node as  the Cassandra nodes, if data locality while processing is desired and to  minimize copying data between Cassandra and Hadoop nodes.
+ Cassandra's splits are location-aware (this is the nature of the Hadoop InputSplit design).  Cassandra  gives the Hadoop !JobTracker a list of locations with each split of data.  That way, the !JobTracker can try to preserve data locality when  assigning tasks to !TaskTrackers.  Therefore, when using Hadoop alongside  Cassandra, it is best to have a !TaskTracker running on the same node as  the Cassandra nodes, if data locality while processing is desired and to  minimize copying data between Cassandra and Hadoop nodes.
  
  As of 0.7, there will be a basic mechanism included in Cassandra for  outputting data to cassandra.  See [[https://issues.apache.org/jira/browse/CASSANDRA-1101|CASSANDRA-1101]]  for details.