You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@cassandra.apache.org by "Jonathan Ellis (JIRA)" <ji...@apache.org> on 2009/08/05 00:42:14 UTC

[jira] Created: (CASSANDRA-342) hadoop integration

hadoop integration
------------------

                 Key: CASSANDRA-342
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
             Project: Cassandra
          Issue Type: New Feature
          Components: Core
            Reporter: Jonathan Ellis


Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-342) hadoop integration

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799514#action_12799514 ] 

Todd Lipcon commented on CASSANDRA-342:
---------------------------------------

I'd be more worried about lack of memory sandboxing - they could easily OOM (which may end up killing a Cassandra thread rather than the Task), or they could create tons of objects and end up taking over the GC.

I guess it could be *an* option, but it does seem awfully scary.

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Jeff Hodges
>         Attachments: 0001-v3-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0002-v3-CASSANDRA-342.-Working-hadoop-support.patch, 0003-v3-CASSANDRA-342.-Adding-the-WordCount-example.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-342) hadoop integration

Posted by "Vijay (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830805#action_12830805 ] 

Vijay commented on CASSANDRA-342:
---------------------------------

Probably it is only me... i am still not comfortable in predicting the exact time when the server state is in sync with the cluster (Ghossip) it is somewhere around 10 - 30 seconds... There must be a way to predict or atleast double-check before returning the instance?

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Jonathan Ellis
>             Fix For: 0.6
>
>         Attachments: 0001-v3-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0001-v4-add-basic-hadoop-support-one-split-per-node.txt, 0001-v5-add-basic-hadoop-support-one-split-per-node.txt, 0002-v3-CASSANDRA-342.-Working-hadoop-support.patch, 0002-v4-add-wordcount-hadoop-example.txt, 0002-v5-add-wordcount-hadoop-example.txt, 0003-v3-CASSANDRA-342.-Adding-the-WordCount-example.patch, 0003-v4-add-WordCountSetup-multiple-tests.txt, 0003-v5-make-predicate-configurable.txt, 0004-v5-sub-splits.txt, 0005-v5-jar-packaging.txt, 0006-prevent-multiple-client-initializations.txt
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-342) hadoop integration

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-342:
-------------------------------------

    Attachment:     (was: 0001-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch)

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Jeff Hodges
>         Attachments: 0001-v3-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0002-v3-CASSANDRA-342.-Working-hadoop-support.patch, 0003-v3-CASSANDRA-342.-Adding-the-WordCount-example.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (CASSANDRA-342) hadoop integration

Posted by "Jeff Hodges (JIRA)" <ji...@apache.org>.

[ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744001#action_12744001 ]

Jeff Hodges edited comment on CASSANDRA-342 at 8/17/09 2:34 AM:
----------------------------------------------------------------

This patch adds the ability for Cassandra databases to be read from in
a Hadoop setting. This is the "stupid" version of said support
(c.f. "not-stupid" discussion
http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E
).

This patch is only working when run in the non-distributed `./bin/hadoop
jar` environment of Hadoop.

.h2 Building

Building the patched Cassandra requires including
`hadoop-0.20.0-core.jar` that is distributed with Hadoop 0.20.0
(obviously) in Cassandra's `$CLASSPATH`. (Which is easiest to do by
simply copying the file to Cassandra's `lib` directory.)

An example of the adapter's use can be found in
`src/examples/org/apache/cassandra/examples/WordCount.java`. You can
run `WordCount.java` with Hadoop by first editing the
`conf/hadoop-env.sh` file and adding all of the jars in Cassandra's
lib directory to `$HADOOP_CLASSPATH`. Building the examples is then
just a matter of running `ant examples`.

.h2 Running

Running the example is straightforward after that. Assuming you've
added some tweets to the Tweets column family with a column called
`text` filled with, you know, text:

{code}
./bin/hadoop jar \
/path/to/cassandra/build/apache-cassandra-incubating-examples-0.4.0-dev.jar\
org.apache.cassandra.examples.WordCount -inputspace Twitter \
-inputfamily Tweets -outputdir outie \
-confdir /path/to/cassandra/src/examples/conf/wordcount/
{/code}

.h2 Changes External to `cassandra.hadoop`

This patch makes two changes in the Cassandra project that are outside
of the new `hadoop` package.

# Makes `StorageProxy.getKeyRange()` public.

# `RowSerializer` is now a public class and outside of Row. This was
done so I didn't have to rewrite the serialization code for
writing the `RowWritable` class.

# Adds the `examples` ant task for building the jar file of cassandra
example code.

.h2 Issues

This patch does have some issues. Specifically:

# Has no tests.

# Cannot split up the key ranges beyond what the entire key range
that exists on each individual node. This means we cannot delegate
to more Map tasks than there are Cassandra nodes. As we move to
billions of keys per node, this is even more of an
issue. (c.f. CASSANDRA-242)

# Cassandra currently must be booted by this Hadoop-facing code in
order to work as a side effect of needing certain internal calls in
odd places and the onus put upon this project to keep everything
working internally. There is currently no way to hook into an
external Cassandra process.

# Only has been tested and only works (due to the above boot code
issues) on one Cassandra node, with one Hadoop Map task.

# Cannot take key ranges that cross over multiple nodes. This is
a problem with how we (can't) divvy up the keys instead of any other
problem (such as the ones described in CASSANDRA-348).

# The current API for selecting what keys to grab cannot take
anything more than the table/keyspace to search in and the name of
a top-level super column.

# Because of the lack of true "multiget" support, the reads from the
database have a round trip cost for each key desired.

# The API is not well-fleshed out for grabbing data from a RowWritable.

# Only provides read capability, no writing.

# `KeyspaceRecordReader#getProgress()` is nothing more than a stub.

# `RowWritable` does not implement `WritableComparable`, which would
allow its use as a key and not just a value in a MapReduce job.

# `RowWritable` uses `RowSerializer` which encodes way too much
information about the column families and columns through
`ColumnFamilySerializer`.

# Has a (likely inescapable) dependency on the hadoop 0.20
core jar.

# Really, really has no tests.

I could go into more detail about some of these issues, but this is
already too long and the discussion adds even more text.

was (Author: jmhodges):
This patch adds the ability for Cassandra databases to be read from in
a Hadoop setting. This is the "stupid" version of said support
(c.f. "not-stupid" discussion
http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E
).

This patch is only working when run in the non-distributed `./bin/hadoop
jar` environment of Hadoop.

.h2 Building

.h2 Running

Running the example is straightforward after that. Assuming you've
added some tweets to the Tweets column family with a column called
`text` filled with, you know, text:

.h2 Changes External to `cassandra.hadoop`

This patch makes two changes in the Cassandra project that are outside
of the new `hadoop` package.

# Makes `StorageProxy.getKeyRange()` public.

# `RowSerializer` is now a public class and outside of Row. This was
done so I didn't have to rewrite the serialization code for
writing the `RowWritable` class.

.h2 Issues

This patch does have some issues. Specifically:

# Has no tests.

# Only has been tested and only works (due to the above boot code
issues) on one Cassandra node, with one Hadoop Map task.

# Cannot take key ranges that cross over multiple nodes. This is
a problem with how we (can't) divvy up the keys instead of any other
problem (such as the ones described in CASSANDRA-348).

# The current API for selecting what keys to grab cannot take
anything more than the table/keyspace to search in and the name of
a top-level super column.

# Because of the lack of true "multiget" support, the reads from the
database have a round trip cost for each key desired.

# The API is not well-fleshed out for grabbing data from a RowWritable.

# Only provides read capability, no writing.

# `KeyspaceRecordReader#getProgress()` is nothing more than a stub.

# `RowWritable` does not implement `WritableComparable`, which would
allow its use as a key and not just a value in a MapReduce job.

# `RowWritable` uses `RowSerializer` which encodes way too much
information about the column families and columns through
`ColumnFamilySerializer`.

# Has a (likely inescapable) dependency on the hadoop 0.20
core jar.

# Really, really has no tests.

I could go into more detail about some of these issues, but this is
already too long and the discussion adds even more text.

> hadoop integration
> ------------------
>
> Key: CASSANDRA-342
> URL: https://issues.apache.org/jira/browse/CASSANDRA-342
> Project: Cassandra
> Issue Type: New Feature
> Components: Core
> Reporter: Jonathan Ellis
> Attachments: 0001-the-stupid-version-of-hadoop-support.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-342) hadoop integration

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stu Hood updated CASSANDRA-342:
-------------------------------

    Attachment: 0006-prevent-multiple-client-initializations.txt

This patch runs initClient in the JVM that is executing the map task, and adds a check to prevent multiple initializations by the same VM.

Even with this patch though, the fat client can't connect from multiple JVMs on the same machine: we have the fatClient VM using an address of 127.0.0.2, and multiple JVMs with the same address will fail to join the gossip. The fatClient needs to be refactored to not need to listen on a port (no gossip)

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Jonathan Ellis
>             Fix For: 0.6
>
>         Attachments: 0001-v3-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0001-v4-add-basic-hadoop-support-one-split-per-node.txt, 0001-v5-add-basic-hadoop-support-one-split-per-node.txt, 0002-v3-CASSANDRA-342.-Working-hadoop-support.patch, 0002-v4-add-wordcount-hadoop-example.txt, 0002-v5-add-wordcount-hadoop-example.txt, 0003-v3-CASSANDRA-342.-Adding-the-WordCount-example.patch, 0003-v4-add-WordCountSetup-multiple-tests.txt, 0003-v5-make-predicate-configurable.txt, 0004-v5-sub-splits.txt, 0005-v5-jar-packaging.txt, 0006-prevent-multiple-client-initializations.txt
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-342) hadoop integration

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830188#action_12830188 ] 

Jonathan Ellis commented on CASSANDRA-342:
------------------------------------------

replying to myself: 

> if you really need to run more than one hadoop jvm per machine

it looks like the minimum task jvms you can restrict hadoop to is two (one map, one reduce), and i don't think there is a sane way to give them different classpaths.  so we need to add some kind of kludge for this, short term.  (long term, i think running the tasktracker and jobs inside the cassandra jvm still makes the most sense for us.)

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Jonathan Ellis
>             Fix For: 0.6
>
>         Attachments: 0001-v3-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0001-v4-add-basic-hadoop-support-one-split-per-node.txt, 0001-v5-add-basic-hadoop-support-one-split-per-node.txt, 0002-v3-CASSANDRA-342.-Working-hadoop-support.patch, 0002-v4-add-wordcount-hadoop-example.txt, 0002-v5-add-wordcount-hadoop-example.txt, 0003-v3-CASSANDRA-342.-Adding-the-WordCount-example.patch, 0003-v4-add-WordCountSetup-multiple-tests.txt, 0003-v5-make-predicate-configurable.txt, 0004-v5-sub-splits.txt, 0005-v5-jar-packaging.txt, 0006-prevent-multiple-client-initializations.txt
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-342) hadoop integration

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-342:
-------------------------------------

    Attachment:     (was: 0002-CASSANDRA-342.-Working-hadoop-support.patch)

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Jeff Hodges
>         Attachments: 0001-v3-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0002-v3-CASSANDRA-342.-Working-hadoop-support.patch, 0003-v3-CASSANDRA-342.-Adding-the-WordCount-example.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (CASSANDRA-342) hadoop integration

Posted by "Jeff Hodges (JIRA)" <ji...@apache.org>.

[ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744001#action_12744001 ]

Jeff Hodges edited comment on CASSANDRA-342 at 8/17/09 2:16 AM:
----------------------------------------------------------------