You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Jeff Hodges (JIRA)" <ji...@apache.org> on 2009/08/17 11:18:14 UTC

[jira] Issue Comment Edited: (CASSANDRA-342) hadoop integration

    [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744001#action_12744001 ] 

Jeff Hodges edited comment on CASSANDRA-342 at 8/17/09 2:16 AM:
----------------------------------------------------------------

This patch adds the ability for Cassandra databases to be read from in
a Hadoop setting. This is the "stupid" version of said support
(c.f. "not-stupid" discussion
http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E
).

This patch is only working when run in the non-distributed `./bin/hadoop
jar` environment of Hadoop.

.h2 Building

Building the patched Cassandra requires including
`hadoop-0.20.0-core.jar` that is distributed with Hadoop 0.20.0
(obviously) in Cassandra's `$CLASSPATH`. (Which is easiest to do by
simply copying the file to Cassandra's `lib` directory.)

An example of the adapter's use can be found in
`src/examples/org/apache/cassandra/examples/WordCount.java`. You can
run `WordCount.java` with Hadoop by first editing the
`conf/hadoop-env.sh` file and adding all of the jars in Cassandra's
lib directory to `$HADOOP_CLASSPATH`. Building the examples is then
just a matter of running `ant examples`.

.h2 Running

Running the example is straightforward after that. Assuming you've
added some tweets to the Tweets column family with a column called
`text` filled with, you know, text:

{code}
./bin/hadoop jar \
  /path/to/cassandra/build/apache-cassandra-incubating-examples-0.4.0-dev.jar\
  org.apache.cassandra.examples.WordCount -inputspace Twitter \
  -inputfamily Tweets -outputdir outie \
  -confdir /path/to/cassandra/src/examples/conf/wordcount/
{/code}

.h2 Changes External to `cassandra.hadoop`

This patch makes two changes in the Cassandra project that are outside
of the new `hadoop` package.

 # Makes `StorageProxy.getKeyRange()` public.
 
 # `RowSerializer` is now a public class and outside of Row. This was
    done so I didn't have to rewrite the serialization code for
    writing the `RowWritable` class.

.h2 Issues

This patch does have some issues. Specifically:

 # Has no tests.

 # Cannot split up the key ranges beyond what the entire key range
   that exists on each individual node. This means we cannot delegate
   to more Map tasks than there are Cassandra nodes. As we move to
   billions of keys per node, this is even more of an
   issue. (c.f. CASSANDRA-242)

 # Cassandra currently must be booted by this Hadoop-facing code in
   order to work as a side effect of needing certain internal calls in
   odd places and the onus put upon this project to keep everything
   working internally. There is currently no way to hook into an
   external Cassandra process.

 # Only has been tested and only works (due to the above boot code
   issues) on one Cassandra node, with one Hadoop Map task.

 # Cannot take key ranges that cross over multiple nodes. This is
   a problem with how we (can't) divvy up the keys instead of any other
   problem (such as the ones described in CASSANDRA-348).

 # The current API for selecting what keys to grab cannot take
   anything more than the table/keyspace to search in and the name of
   a top-level super column.

 # Because of the lack of true "multiget" support, the reads from the
   database have a round trip cost for each key desired.

 # The API is not well-fleshed out for grabbing data from a RowWritable.

 # `KeyspaceRecordReader#getProgress()` is nothing more than a stub.

 # `RowWritable` does not implement `WritableComparable`, which would
   allow its use as a key and not just a value in a MapReduce job. 

 # `RowWritable` uses `RowSerializer` which encodes way too much
   information about the column families and columns through
   `ColumnFamilySerializer`.

 # Has a (likely inescapable) dependency on the hadoop 0.20
   core jar.

 # Really, really has no tests.

I could go into more detail about some of these issues, but this is
already too long and the discussion adds even more text.


      was (Author: jmhodges):
    This patch adds the ability for Cassandra databases to be read from in
a Hadoop setting. This is the "stupid" version of said support
(c.f. "not-stupid" discussion
http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E
).

This patch is only working when run in the non-distributed `./bin/hadoop
jar` environment of Hadoop.

.h2 Building

Building the patched Cassandra requires including
`hadoop-0.20.0-core.jar` that is distributed with Hadoop 0.20.0
(obviously) in Cassandra's `$CLASSPATH`. (Which is easiest to do by
simply copying the file to Cassandra's `lib` directory.)

An example of the adapter's use can be found in
`src/examples/org/apache/cassandra/examples/WordCount.java`. You can
run `WordCount.java` with Hadoop by first editing the
`conf/hadoop-env.sh` file and adding all of the jars in Cassandra's
lib directory to `$HADOOP_CLASSPATH`. Building the examples is then
just a matter of running `ant examples`.

.h2 Running

Running the example is straightforward after that. Assuming you've
added some tweets to the Tweets column family with a column called
`text` filled with, you know, text:

{code}
./bin/hadoop jar \
  /path/to/cassandra/build/apache-cassandra-incubating-examples-0.4.0-dev.jar\
  org.apache.cassandra.examples.WordCount -inputspace Twitter \
  -inputfamily Tweets -outputdir outie \
  -confdir /path/to/cassandra/src/examples/conf/wordcount/
{/code}

.h2 Changes External to `cassandra.hadoop`

This patch makes two changes in the Cassandra project that are outside
of the new `hadoop` package.

 # Makes `StorageProxy.getKeyRange()` public.
 
 # `RowSerializer` is now a public class and outside of Row. This was
    done so I didn't have to rewrite the serialization code for
    writing the `RowWritable` class.

.h2 Issues

This patch does have some issues. Specifically:

 # Has no tests.

 # Cannot split up the key ranges beyond what the entire key range
   that exists on each individual node. This means we cannot delegate
   to more Map tasks than there are Cassandra nodes. As we move to
   billions of keys per node, this is even more of an
   issue. (c.f. CASSANDRA-242)

 # Cassandra currently must be booted by this Hadoop-facing code in
   order to work as a side effect of needing certain internal calls in
   odd places and the onus put upon this project to keep everything
   working internally. There is currently no way to hook into an
   external Cassandra process.

 # Only has been tested and only works (due to the above boot code
   issues) on one Cassandra node, with one Hadoop Map task.

 # Cannot take key ranges that cross over multiple nodes. This is
   a problem with how we (can't) divvy up the keys instead of any other
   problem (such as the ones described in CASSANDRA-348).

 # The current API for selecting what keys to grab cannot take
   anything more than the table/keyspace to search in and the name of
   a top-level super column.

 # The API is not well-fleshed out for grabbing data from a RowWritable.

 # `KeyspaceRecordReader#getProgress()` is nothing more than a stub.

 # `RowWritable` does not implement `WritableComparable`, which would
   allow its use as a key and not just a value in a MapReduce job. 

 # `RowWritable` uses `RowSerializer` which encodes way too much
   information about the column families and columns through
   `ColumnFamilySerializer`.

 # Has a (likely inescapable) dependency on the hadoop 0.20
   core jar.

 # Really, really has no tests.

I could go into more detail about some of these issues, but this is
already too long and the discussion adds even more text.

  
> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>         Attachments: 0001-the-stupid-version-of-hadoop-support.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.