You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@cassandra.apache.org by "Jeff Hodges (JIRA)" <ji...@apache.org> on 2009/08/17 11:18:14 UTC

[jira] Issue Comment Edited: (CASSANDRA-342) hadoop integration

[ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744001#action_12744001 ]

Jeff Hodges edited comment on CASSANDRA-342 at 8/17/09 2:16 AM:
----------------------------------------------------------------

This patch adds the ability for Cassandra databases to be read from in
a Hadoop setting. This is the "stupid" version of said support
(c.f. "not-stupid" discussion
http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E
).

This patch is only working when run in the non-distributed `./bin/hadoop
jar` environment of Hadoop.

.h2 Building

Building the patched Cassandra requires including
`hadoop-0.20.0-core.jar` that is distributed with Hadoop 0.20.0
(obviously) in Cassandra's `$CLASSPATH`. (Which is easiest to do by
simply copying the file to Cassandra's `lib` directory.)

An example of the adapter's use can be found in
`src/examples/org/apache/cassandra/examples/WordCount.java`. You can
run `WordCount.java` with Hadoop by first editing the
`conf/hadoop-env.sh` file and adding all of the jars in Cassandra's
lib directory to `$HADOOP_CLASSPATH`. Building the examples is then
just a matter of running `ant examples`.

.h2 Running

Running the example is straightforward after that. Assuming you've
added some tweets to the Tweets column family with a column called
`text` filled with, you know, text:

{code}
./bin/hadoop jar \
/path/to/cassandra/build/apache-cassandra-incubating-examples-0.4.0-dev.jar\
org.apache.cassandra.examples.WordCount -inputspace Twitter \
-inputfamily Tweets -outputdir outie \
-confdir /path/to/cassandra/src/examples/conf/wordcount/
{/code}

.h2 Changes External to `cassandra.hadoop`

This patch makes two changes in the Cassandra project that are outside
of the new `hadoop` package.

# Makes `StorageProxy.getKeyRange()` public.

# `RowSerializer` is now a public class and outside of Row. This was
done so I didn't have to rewrite the serialization code for
writing the `RowWritable` class.

.h2 Issues

This patch does have some issues. Specifically:

# Has no tests.

# Cannot split up the key ranges beyond what the entire key range
that exists on each individual node. This means we cannot delegate
to more Map tasks than there are Cassandra nodes. As we move to
billions of keys per node, this is even more of an
issue. (c.f. CASSANDRA-242)

# Cassandra currently must be booted by this Hadoop-facing code in
order to work as a side effect of needing certain internal calls in
odd places and the onus put upon this project to keep everything
working internally. There is currently no way to hook into an
external Cassandra process.

# Only has been tested and only works (due to the above boot code
issues) on one Cassandra node, with one Hadoop Map task.

# Cannot take key ranges that cross over multiple nodes. This is
a problem with how we (can't) divvy up the keys instead of any other
problem (such as the ones described in CASSANDRA-348).

# The current API for selecting what keys to grab cannot take
anything more than the table/keyspace to search in and the name of
a top-level super column.

# Because of the lack of true "multiget" support, the reads from the
database have a round trip cost for each key desired.

# The API is not well-fleshed out for grabbing data from a RowWritable.

# `KeyspaceRecordReader#getProgress()` is nothing more than a stub.

# `RowWritable` does not implement `WritableComparable`, which would
allow its use as a key and not just a value in a MapReduce job.

# `RowWritable` uses `RowSerializer` which encodes way too much
information about the column families and columns through
`ColumnFamilySerializer`.

# Has a (likely inescapable) dependency on the hadoop 0.20
core jar.

# Really, really has no tests.

I could go into more detail about some of these issues, but this is
already too long and the discussion adds even more text.

was (Author: jmhodges):
This patch adds the ability for Cassandra databases to be read from in
a Hadoop setting. This is the "stupid" version of said support
(c.f. "not-stupid" discussion
http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E
).