You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by Apache Wiki <wi...@apache.org> on 2011/09/26 19:43:23 UTC

[Cassandra Wiki] Update of "HadoopSupport" by jeremyhanna

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for change notification.

The "HadoopSupport" page has been changed by jeremyhanna:
http://wiki.apache.org/cassandra/HadoopSupport?action=diff&rev1=39&rev2=40

Comment:
Adding more troubleshooting information and a caveat to OSS Brisk in the main description

  == Overview ==
  Cassandra 0.6+ enables certain [[http://hadoop.apache.org/|Hadoop]] functionality against Cassandra's data store.  Specifically, support has been added for [[http://hadoop.apache.org/mapreduce/|MapReduce]], [[http://pig.apache.org|Pig]] and [[http://hive.apache.org/|Hive]].
  
- [[http://datastax.com|DataStax]] has open-sourced a Cassandra based Hadoop distribution called Brisk. ([[http://www.datastax.com/docs/0.8/brisk/index|Documentation]]) ([[http://github.com/riptano/brisk|Code]]) 
+ [[http://datastax.com|DataStax]] has open-sourced a Cassandra based Hadoop distribution called Brisk. ([[http://www.datastax.com/docs/0.8/brisk/index|Documentation]]) ([[http://github.com/riptano/brisk|Code]]) However this code is no longer going to be maintained by DataStax.  Future development of Brisk is now part of a pay-for offering.
  
  [[#Top|Top]]
  
@@ -92, +92 @@

   * '''cassandra.range.batch.size''' - the default is 4096, but you may need to lower this depending on your data.  This is either specified in your hadoop configuration or using `org.apache.cassandra.hadoop.ConfigHelper.setRangeBatchSize`.
   * '''rpc_timeout_in_ms''' - this is set in your `cassandra.yaml` (in 0.6 it's `RpcTimeoutInMillis` in `storage-conf.xml`).  The rpc timeout is not for timing out from the client but between nodes.  This can be increased to reduce chances of timing out.
  
+ If you still see timeout exceptions with resultant failed jobs and/or blacklisted tasktrackers, there are settings that can give Cassandra more latitude before failing the jobs.  An example of usage (in either the job configuration or taskracker mapred-site.xml):
+ {{{
+ <property>
+   <name>mapred.max.tracker.failures</name>
+   <value>20</value>
+ </property>
+ <property>
+   <name>mapred.max.tracker.failures</name>
+   <value>20</value>
+ </property>
+ <property>
+   <name>mapred.map.max.attempts</name>
+   <value>20</value>
+ </property>
+ <property>
+   <name>mapred.reduce.max.attempts</name>
+   <value>20</value>
+ </property>
+ }}}
+ The settings normally default to 4 each, but some find that too conservative.  If you set it too low, you might have blacklisted tasktrackers and failed jobs because of occasional timeout exceptions.  If you set them too high, jobs that would otherwise fail quickly take a long time to fail, sacrificing efficiency.  Keep in mind that this can just cover a problem.  It may be that you always want these settings to be higher when operating against Cassandra.  However, if you run into these exceptions too frequently, there may be a problem with your Cassandra or Hadoop configuration.
+ 
  If you are seeing inconsistent data coming back, consider the consistency level that you are reading and writing at.  The two relevant properties are:
   * '''cassandra.consistencylevel.read''' - defaults to !ConsistencyLevel.ONE.
   * '''cassandra.consistencylevel.write''' - defaults to !ConsistencyLevel.ONE.