You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Jonathan Ellis (JIRA)" <ji...@apache.org> on 2009/08/05 00:42:14 UTC

[jira] Created: (CASSANDRA-342) hadoop integration

hadoop integration
------------------

                 Key: CASSANDRA-342
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
             Project: Cassandra
          Issue Type: New Feature
          Components: Core
            Reporter: Jonathan Ellis


Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-342) hadoop integration

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799514#action_12799514 ] 

Todd Lipcon commented on CASSANDRA-342:
---------------------------------------

I'd be more worried about lack of memory sandboxing - they could easily OOM (which may end up killing a Cassandra thread rather than the Task), or they could create tons of objects and end up taking over the GC.

I guess it could be *an* option, but it does seem awfully scary.

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Jeff Hodges
>         Attachments: 0001-v3-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0002-v3-CASSANDRA-342.-Working-hadoop-support.patch, 0003-v3-CASSANDRA-342.-Adding-the-WordCount-example.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-342) hadoop integration

Posted by "Vijay (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830805#action_12830805 ] 

Vijay commented on CASSANDRA-342:
---------------------------------

Probably it is only me... i am still not comfortable in predicting the exact time when the server state is in sync with the cluster (Ghossip) it is somewhere around 10 - 30 seconds... There must be a way to predict or atleast double-check before returning the instance?

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Jonathan Ellis
>             Fix For: 0.6
>
>         Attachments: 0001-v3-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0001-v4-add-basic-hadoop-support-one-split-per-node.txt, 0001-v5-add-basic-hadoop-support-one-split-per-node.txt, 0002-v3-CASSANDRA-342.-Working-hadoop-support.patch, 0002-v4-add-wordcount-hadoop-example.txt, 0002-v5-add-wordcount-hadoop-example.txt, 0003-v3-CASSANDRA-342.-Adding-the-WordCount-example.patch, 0003-v4-add-WordCountSetup-multiple-tests.txt, 0003-v5-make-predicate-configurable.txt, 0004-v5-sub-splits.txt, 0005-v5-jar-packaging.txt, 0006-prevent-multiple-client-initializations.txt
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (CASSANDRA-342) hadoop integration

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-342:
-------------------------------------

    Attachment:     (was: 0001-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch)

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Jeff Hodges
>         Attachments: 0001-v3-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0002-v3-CASSANDRA-342.-Working-hadoop-support.patch, 0003-v3-CASSANDRA-342.-Adding-the-WordCount-example.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (CASSANDRA-342) hadoop integration

Posted by "Jeff Hodges (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744001#action_12744001 ] 

Jeff Hodges edited comment on CASSANDRA-342 at 8/17/09 2:34 AM:
----------------------------------------------------------------

This patch adds the ability for Cassandra databases to be read from in
a Hadoop setting. This is the "stupid" version of said support
(c.f. "not-stupid" discussion
http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E
).

This patch is only working when run in the non-distributed `./bin/hadoop
jar` environment of Hadoop.

.h2 Building

Building the patched Cassandra requires including
`hadoop-0.20.0-core.jar` that is distributed with Hadoop 0.20.0
(obviously) in Cassandra's `$CLASSPATH`. (Which is easiest to do by
simply copying the file to Cassandra's `lib` directory.)

An example of the adapter's use can be found in
`src/examples/org/apache/cassandra/examples/WordCount.java`. You can
run `WordCount.java` with Hadoop by first editing the
`conf/hadoop-env.sh` file and adding all of the jars in Cassandra's
lib directory to `$HADOOP_CLASSPATH`. Building the examples is then
just a matter of running `ant examples`.

.h2 Running

Running the example is straightforward after that. Assuming you've
added some tweets to the Tweets column family with a column called
`text` filled with, you know, text:

{code}
./bin/hadoop jar \
  /path/to/cassandra/build/apache-cassandra-incubating-examples-0.4.0-dev.jar\
  org.apache.cassandra.examples.WordCount -inputspace Twitter \
  -inputfamily Tweets -outputdir outie \
  -confdir /path/to/cassandra/src/examples/conf/wordcount/
{/code}

.h2 Changes External to `cassandra.hadoop`

This patch makes two changes in the Cassandra project that are outside
of the new `hadoop` package.

 # Makes `StorageProxy.getKeyRange()` public.
 
 # `RowSerializer` is now a public class and outside of Row. This was
    done so I didn't have to rewrite the serialization code for
    writing the `RowWritable` class.

 # Adds the `examples` ant task for building the jar file of cassandra
   example code.

.h2 Issues

This patch does have some issues. Specifically:

 # Has no tests.

 # Cannot split up the key ranges beyond what the entire key range
   that exists on each individual node. This means we cannot delegate
   to more Map tasks than there are Cassandra nodes. As we move to
   billions of keys per node, this is even more of an
   issue. (c.f. CASSANDRA-242)

 # Cassandra currently must be booted by this Hadoop-facing code in
   order to work as a side effect of needing certain internal calls in
   odd places and the onus put upon this project to keep everything
   working internally. There is currently no way to hook into an
   external Cassandra process.

 # Only has been tested and only works (due to the above boot code
   issues) on one Cassandra node, with one Hadoop Map task.

 # Cannot take key ranges that cross over multiple nodes. This is
   a problem with how we (can't) divvy up the keys instead of any other
   problem (such as the ones described in CASSANDRA-348).

 # The current API for selecting what keys to grab cannot take
   anything more than the table/keyspace to search in and the name of
   a top-level super column.

 # Because of the lack of true "multiget" support, the reads from the
   database have a round trip cost for each key desired.

 # The API is not well-fleshed out for grabbing data from a RowWritable.

 # Only provides read capability, no writing.

 # `KeyspaceRecordReader#getProgress()` is nothing more than a stub.

 # `RowWritable` does not implement `WritableComparable`, which would
   allow its use as a key and not just a value in a MapReduce job. 

 # `RowWritable` uses `RowSerializer` which encodes way too much
   information about the column families and columns through
   `ColumnFamilySerializer`.

 # Has a (likely inescapable) dependency on the hadoop 0.20
   core jar.

 # Really, really has no tests.

I could go into more detail about some of these issues, but this is
already too long and the discussion adds even more text.


      was (Author: jmhodges):
    This patch adds the ability for Cassandra databases to be read from in
a Hadoop setting. This is the "stupid" version of said support
(c.f. "not-stupid" discussion
http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E
).

This patch is only working when run in the non-distributed `./bin/hadoop
jar` environment of Hadoop.

.h2 Building

Building the patched Cassandra requires including
`hadoop-0.20.0-core.jar` that is distributed with Hadoop 0.20.0
(obviously) in Cassandra's `$CLASSPATH`. (Which is easiest to do by
simply copying the file to Cassandra's `lib` directory.)

An example of the adapter's use can be found in
`src/examples/org/apache/cassandra/examples/WordCount.java`. You can
run `WordCount.java` with Hadoop by first editing the
`conf/hadoop-env.sh` file and adding all of the jars in Cassandra's
lib directory to `$HADOOP_CLASSPATH`. Building the examples is then
just a matter of running `ant examples`.

.h2 Running

Running the example is straightforward after that. Assuming you've
added some tweets to the Tweets column family with a column called
`text` filled with, you know, text:

{code}
./bin/hadoop jar \
  /path/to/cassandra/build/apache-cassandra-incubating-examples-0.4.0-dev.jar\
  org.apache.cassandra.examples.WordCount -inputspace Twitter \
  -inputfamily Tweets -outputdir outie \
  -confdir /path/to/cassandra/src/examples/conf/wordcount/
{/code}

.h2 Changes External to `cassandra.hadoop`

This patch makes two changes in the Cassandra project that are outside
of the new `hadoop` package.

 # Makes `StorageProxy.getKeyRange()` public.
 
 # `RowSerializer` is now a public class and outside of Row. This was
    done so I didn't have to rewrite the serialization code for
    writing the `RowWritable` class.

.h2 Issues

This patch does have some issues. Specifically:

 # Has no tests.

 # Cannot split up the key ranges beyond what the entire key range
   that exists on each individual node. This means we cannot delegate
   to more Map tasks than there are Cassandra nodes. As we move to
   billions of keys per node, this is even more of an
   issue. (c.f. CASSANDRA-242)

 # Cassandra currently must be booted by this Hadoop-facing code in
   order to work as a side effect of needing certain internal calls in
   odd places and the onus put upon this project to keep everything
   working internally. There is currently no way to hook into an
   external Cassandra process.

 # Only has been tested and only works (due to the above boot code
   issues) on one Cassandra node, with one Hadoop Map task.

 # Cannot take key ranges that cross over multiple nodes. This is
   a problem with how we (can't) divvy up the keys instead of any other
   problem (such as the ones described in CASSANDRA-348).

 # The current API for selecting what keys to grab cannot take
   anything more than the table/keyspace to search in and the name of
   a top-level super column.

 # Because of the lack of true "multiget" support, the reads from the
   database have a round trip cost for each key desired.

 # The API is not well-fleshed out for grabbing data from a RowWritable.

 # Only provides read capability, no writing.

 # `KeyspaceRecordReader#getProgress()` is nothing more than a stub.

 # `RowWritable` does not implement `WritableComparable`, which would
   allow its use as a key and not just a value in a MapReduce job. 

 # `RowWritable` uses `RowSerializer` which encodes way too much
   information about the column families and columns through
   `ColumnFamilySerializer`.

 # Has a (likely inescapable) dependency on the hadoop 0.20
   core jar.

 # Really, really has no tests.

I could go into more detail about some of these issues, but this is
already too long and the discussion adds even more text.

  
> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>         Attachments: 0001-the-stupid-version-of-hadoop-support.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (CASSANDRA-342) hadoop integration

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stu Hood updated CASSANDRA-342:
-------------------------------

    Attachment: 0006-prevent-multiple-client-initializations.txt

This patch runs initClient in the JVM that is executing the map task, and adds a check to prevent multiple initializations by the same VM.

Even with this patch though, the fat client can't connect from multiple JVMs on the same machine: we have the fatClient VM using an address of 127.0.0.2, and multiple JVMs with the same address will fail to join the gossip. The fatClient needs to be refactored to not need to listen on a port (no gossip)

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Jonathan Ellis
>             Fix For: 0.6
>
>         Attachments: 0001-v3-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0001-v4-add-basic-hadoop-support-one-split-per-node.txt, 0001-v5-add-basic-hadoop-support-one-split-per-node.txt, 0002-v3-CASSANDRA-342.-Working-hadoop-support.patch, 0002-v4-add-wordcount-hadoop-example.txt, 0002-v5-add-wordcount-hadoop-example.txt, 0003-v3-CASSANDRA-342.-Adding-the-WordCount-example.patch, 0003-v4-add-WordCountSetup-multiple-tests.txt, 0003-v5-make-predicate-configurable.txt, 0004-v5-sub-splits.txt, 0005-v5-jar-packaging.txt, 0006-prevent-multiple-client-initializations.txt
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-342) hadoop integration

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830188#action_12830188 ] 

Jonathan Ellis commented on CASSANDRA-342:
------------------------------------------

replying to myself: 

> if you really need to run more than one hadoop jvm per machine

it looks like the minimum task jvms you can restrict hadoop to is two (one map, one reduce), and i don't think there is a sane way to give them different classpaths.  so we need to add some kind of kludge for this, short term.  (long term, i think running the tasktracker and jobs inside the cassandra jvm still makes the most sense for us.)

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Jonathan Ellis
>             Fix For: 0.6
>
>         Attachments: 0001-v3-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0001-v4-add-basic-hadoop-support-one-split-per-node.txt, 0001-v5-add-basic-hadoop-support-one-split-per-node.txt, 0002-v3-CASSANDRA-342.-Working-hadoop-support.patch, 0002-v4-add-wordcount-hadoop-example.txt, 0002-v5-add-wordcount-hadoop-example.txt, 0003-v3-CASSANDRA-342.-Adding-the-WordCount-example.patch, 0003-v4-add-WordCountSetup-multiple-tests.txt, 0003-v5-make-predicate-configurable.txt, 0004-v5-sub-splits.txt, 0005-v5-jar-packaging.txt, 0006-prevent-multiple-client-initializations.txt
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (CASSANDRA-342) hadoop integration

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-342:
-------------------------------------

    Attachment:     (was: 0002-CASSANDRA-342.-Working-hadoop-support.patch)

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Jeff Hodges
>         Attachments: 0001-v3-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0002-v3-CASSANDRA-342.-Working-hadoop-support.patch, 0003-v3-CASSANDRA-342.-Adding-the-WordCount-example.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (CASSANDRA-342) hadoop integration

Posted by "Jeff Hodges (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744001#action_12744001 ] 

Jeff Hodges edited comment on CASSANDRA-342 at 8/17/09 2:16 AM:
----------------------------------------------------------------

This patch adds the ability for Cassandra databases to be read from in
a Hadoop setting. This is the "stupid" version of said support
(c.f. "not-stupid" discussion
http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E
).

This patch is only working when run in the non-distributed `./bin/hadoop
jar` environment of Hadoop.

.h2 Building

Building the patched Cassandra requires including
`hadoop-0.20.0-core.jar` that is distributed with Hadoop 0.20.0
(obviously) in Cassandra's `$CLASSPATH`. (Which is easiest to do by
simply copying the file to Cassandra's `lib` directory.)

An example of the adapter's use can be found in
`src/examples/org/apache/cassandra/examples/WordCount.java`. You can
run `WordCount.java` with Hadoop by first editing the
`conf/hadoop-env.sh` file and adding all of the jars in Cassandra's
lib directory to `$HADOOP_CLASSPATH`. Building the examples is then
just a matter of running `ant examples`.

.h2 Running

Running the example is straightforward after that. Assuming you've
added some tweets to the Tweets column family with a column called
`text` filled with, you know, text:

{code}
./bin/hadoop jar \
  /path/to/cassandra/build/apache-cassandra-incubating-examples-0.4.0-dev.jar\
  org.apache.cassandra.examples.WordCount -inputspace Twitter \
  -inputfamily Tweets -outputdir outie \
  -confdir /path/to/cassandra/src/examples/conf/wordcount/
{/code}

.h2 Changes External to `cassandra.hadoop`

This patch makes two changes in the Cassandra project that are outside
of the new `hadoop` package.

 # Makes `StorageProxy.getKeyRange()` public.
 
 # `RowSerializer` is now a public class and outside of Row. This was
    done so I didn't have to rewrite the serialization code for
    writing the `RowWritable` class.

.h2 Issues

This patch does have some issues. Specifically:

 # Has no tests.

 # Cannot split up the key ranges beyond what the entire key range
   that exists on each individual node. This means we cannot delegate
   to more Map tasks than there are Cassandra nodes. As we move to
   billions of keys per node, this is even more of an
   issue. (c.f. CASSANDRA-242)

 # Cassandra currently must be booted by this Hadoop-facing code in
   order to work as a side effect of needing certain internal calls in
   odd places and the onus put upon this project to keep everything
   working internally. There is currently no way to hook into an
   external Cassandra process.

 # Only has been tested and only works (due to the above boot code
   issues) on one Cassandra node, with one Hadoop Map task.

 # Cannot take key ranges that cross over multiple nodes. This is
   a problem with how we (can't) divvy up the keys instead of any other
   problem (such as the ones described in CASSANDRA-348).

 # The current API for selecting what keys to grab cannot take
   anything more than the table/keyspace to search in and the name of
   a top-level super column.

 # Because of the lack of true "multiget" support, the reads from the
   database have a round trip cost for each key desired.

 # The API is not well-fleshed out for grabbing data from a RowWritable.

 # `KeyspaceRecordReader#getProgress()` is nothing more than a stub.

 # `RowWritable` does not implement `WritableComparable`, which would
   allow its use as a key and not just a value in a MapReduce job. 

 # `RowWritable` uses `RowSerializer` which encodes way too much
   information about the column families and columns through
   `ColumnFamilySerializer`.

 # Has a (likely inescapable) dependency on the hadoop 0.20
   core jar.

 # Really, really has no tests.

I could go into more detail about some of these issues, but this is
already too long and the discussion adds even more text.


      was (Author: jmhodges):
    This patch adds the ability for Cassandra databases to be read from in
a Hadoop setting. This is the "stupid" version of said support
(c.f. "not-stupid" discussion
http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E
).

This patch is only working when run in the non-distributed `./bin/hadoop
jar` environment of Hadoop.

.h2 Building

Building the patched Cassandra requires including
`hadoop-0.20.0-core.jar` that is distributed with Hadoop 0.20.0
(obviously) in Cassandra's `$CLASSPATH`. (Which is easiest to do by
simply copying the file to Cassandra's `lib` directory.)

An example of the adapter's use can be found in
`src/examples/org/apache/cassandra/examples/WordCount.java`. You can
run `WordCount.java` with Hadoop by first editing the
`conf/hadoop-env.sh` file and adding all of the jars in Cassandra's
lib directory to `$HADOOP_CLASSPATH`. Building the examples is then
just a matter of running `ant examples`.

.h2 Running

Running the example is straightforward after that. Assuming you've
added some tweets to the Tweets column family with a column called
`text` filled with, you know, text:

{code}
./bin/hadoop jar \
  /path/to/cassandra/build/apache-cassandra-incubating-examples-0.4.0-dev.jar\
  org.apache.cassandra.examples.WordCount -inputspace Twitter \
  -inputfamily Tweets -outputdir outie \
  -confdir /path/to/cassandra/src/examples/conf/wordcount/
{/code}

.h2 Changes External to `cassandra.hadoop`

This patch makes two changes in the Cassandra project that are outside
of the new `hadoop` package.

 # Makes `StorageProxy.getKeyRange()` public.
 
 # `RowSerializer` is now a public class and outside of Row. This was
    done so I didn't have to rewrite the serialization code for
    writing the `RowWritable` class.

.h2 Issues

This patch does have some issues. Specifically:

 # Has no tests.

 # Cannot split up the key ranges beyond what the entire key range
   that exists on each individual node. This means we cannot delegate
   to more Map tasks than there are Cassandra nodes. As we move to
   billions of keys per node, this is even more of an
   issue. (c.f. CASSANDRA-242)

 # Cassandra currently must be booted by this Hadoop-facing code in
   order to work as a side effect of needing certain internal calls in
   odd places and the onus put upon this project to keep everything
   working internally. There is currently no way to hook into an
   external Cassandra process.

 # Only has been tested and only works (due to the above boot code
   issues) on one Cassandra node, with one Hadoop Map task.

 # Cannot take key ranges that cross over multiple nodes. This is
   a problem with how we (can't) divvy up the keys instead of any other
   problem (such as the ones described in CASSANDRA-348).

 # The current API for selecting what keys to grab cannot take
   anything more than the table/keyspace to search in and the name of
   a top-level super column.

 # The API is not well-fleshed out for grabbing data from a RowWritable.

 # `KeyspaceRecordReader#getProgress()` is nothing more than a stub.

 # `RowWritable` does not implement `WritableComparable`, which would
   allow its use as a key and not just a value in a MapReduce job. 

 # `RowWritable` uses `RowSerializer` which encodes way too much
   information about the column families and columns through
   `ColumnFamilySerializer`.

 # Has a (likely inescapable) dependency on the hadoop 0.20
   core jar.

 # Really, really has no tests.

I could go into more detail about some of these issues, but this is
already too long and the discussion adds even more text.

  
> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>         Attachments: 0001-the-stupid-version-of-hadoop-support.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (CASSANDRA-342) hadoop integration

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis reassigned CASSANDRA-342:
----------------------------------------

    Assignee: Jonathan Ellis  (was: Jeff Hodges)

Here's my first stab at hadoop support.  I took Jeff's patches as a starting point, but the many chnages we've made to Cassandra's internals since then mean the results are pretty different.
 - BootUp is no longer required; instead we use the Fat Client api
 - Switched to ColumnFamily as the unit for InputFormat, rather than KeySpace
 - Use get_range_slice instead of get_key_range
 - Use Tokens instead of Strings for range splitting
 - Add build.xml and bin/ scripts for WordCount demo

The combination of all this means we get RandomPartitioner support for free.  We also get InputSplit location information for free.

My patch 01 and 02 correspond to Jeff's 02 and 03 (no changes to Cassandra internals have been required so far).  Then my 03 is just more changes to the WordCount example (I should probably squash that...)

Still todo: breaking a node's range into multiple InputSplits (this will require minor changes to Cassandra)

Also: as I have said before, I don't really know Hadoop, so quite possibly I did something stupid here.  (For instance, Jeff's InputFormat used Writeable subclasses for both key and value; mine uses just String and ColumnFamily since that is more natural, and the IF contract does not require Writeable-ness.  Is this Bad Hadoop Form?)

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Jonathan Ellis
>         Attachments: 0001-v3-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0001-v4-add-basic-hadoop-support-one-split-per-node.txt, 0002-v3-CASSANDRA-342.-Working-hadoop-support.patch, 0002-v4-add-wordcount-hadoop-example.txt, 0003-v3-CASSANDRA-342.-Adding-the-WordCount-example.patch, 0003-v4-add-WordCountSetup-multiple-tests.txt
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-342) hadoop integration

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745247#action_12745247 ] 

Jonathan Ellis commented on CASSANDRA-342:
------------------------------------------

awesome!

now could you squash a little? :)

bear with me -- what we want to see is separate functionality in different patches (e.g. 02, 10) but not the evolution within those (04 should be squashed onto 01, i think).

(what I do with large patchsets is, i keep the original "raw" patches, then branch that for rebasing stuff around.  sometimes I end up with 7 or 8 branches before it's done. :)

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>         Attachments: 0001-CASSANDRA-342.-Initial-commit-of-hadoop-support.-Doe.patch, 0001-the-stupid-version-of-hadoop-support.patch, 0002-CASSANDRA-342.-Breaking-RowSerializer-out-as-a-publi.patch, 0003-CASSANDRA-342.-Creating-RowWritable-with-required-Wr.patch, 0004-CASSANDRA-342.-Make-the-hadoop-classes-public.patch, 0005-CASSANDRA-342.-Adding-how-to-set-a-table-and-column-.patch, 0006-CASSANDRA-342.-Start-up-cassandra-s-data-connections.patch, 0007-CASSANDRA-342.-Breaking-out-the-boot-up-code-separat.patch, 0008-CASSANDRA-342.-Handle-empty-key-ranges-correctly.patch, 0009-CASSANDRA-342.-Adding-the-WordCount-hadoop-example.patch, 0010-CASSANDRA-342.-add-public-T-originalToken-to-Token-f.patch, 0011-CASSANDRA-342.-Hadoop-integration-with-one-cassandra.patch, 0012-CASSANDRA-342.-Adding-the-required-confdir-flag-to-t.patch, 0013-CASSANDRA-342.-Removing-the-unneeded-DataInput-and-D.patch, 0014-CASSANDRA-342.-SystemTable.initMetadata-no-longer-NP.patch, 0015-CASSANDRA-342.-new-conf-file-format.patch, 0016-CASSANDRA-342.-When-rewriting-history-be-sure-to-rew.patch, 0017-CASSANDRA-342.-Fixed-version-of-updated-example-conf.patch, v2-squashed-commits-for-hadoop-stupid.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (CASSANDRA-342) hadoop integration

Posted by "Jeff Hodges (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Hodges updated CASSANDRA-342:
----------------------------------

    Attachment: 0017-CASSANDRA-342.-Fixed-version-of-updated-example-conf.patch
                0016-CASSANDRA-342.-When-rewriting-history-be-sure-to-rew.patch

The unsquashed version of the hadoop commits.

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>         Attachments: 0001-CASSANDRA-342.-Initial-commit-of-hadoop-support.-Doe.patch, 0001-the-stupid-version-of-hadoop-support.patch, 0002-CASSANDRA-342.-Breaking-RowSerializer-out-as-a-publi.patch, 0003-CASSANDRA-342.-Creating-RowWritable-with-required-Wr.patch, 0004-CASSANDRA-342.-Make-the-hadoop-classes-public.patch, 0005-CASSANDRA-342.-Adding-how-to-set-a-table-and-column-.patch, 0006-CASSANDRA-342.-Start-up-cassandra-s-data-connections.patch, 0007-CASSANDRA-342.-Breaking-out-the-boot-up-code-separat.patch, 0008-CASSANDRA-342.-Handle-empty-key-ranges-correctly.patch, 0009-CASSANDRA-342.-Adding-the-WordCount-hadoop-example.patch, 0010-CASSANDRA-342.-add-public-T-originalToken-to-Token-f.patch, 0011-CASSANDRA-342.-Hadoop-integration-with-one-cassandra.patch, 0012-CASSANDRA-342.-Adding-the-required-confdir-flag-to-t.patch, 0013-CASSANDRA-342.-Removing-the-unneeded-DataInput-and-D.patch, 0014-CASSANDRA-342.-SystemTable.initMetadata-no-longer-NP.patch, 0015-CASSANDRA-342.-new-conf-file-format.patch, 0016-CASSANDRA-342.-When-rewriting-history-be-sure-to-rew.patch, 0017-CASSANDRA-342.-Fixed-version-of-updated-example-conf.patch, v2-squashed-commits-for-hadoop-stupid.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (CASSANDRA-342) hadoop integration

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-342:
-------------------------------------

    Attachment:     (was: 0001-the-stupid-version-of-hadoop-support.patch)

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Jeff Hodges
>         Attachments: 0001-v3-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0002-v3-CASSANDRA-342.-Working-hadoop-support.patch, 0003-v3-CASSANDRA-342.-Adding-the-WordCount-example.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (CASSANDRA-342) hadoop integration

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-342:
-------------------------------------

    Attachment: 0003-v4-add-WordCountSetup-multiple-tests.txt
                0002-v4-add-wordcount-hadoop-example.txt
                0001-v4-add-basic-hadoop-support-one-split-per-node.txt

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Jeff Hodges
>         Attachments: 0001-v3-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0001-v4-add-basic-hadoop-support-one-split-per-node.txt, 0002-v3-CASSANDRA-342.-Working-hadoop-support.patch, 0002-v4-add-wordcount-hadoop-example.txt, 0003-v3-CASSANDRA-342.-Adding-the-WordCount-example.patch, 0003-v4-add-WordCountSetup-multiple-tests.txt
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-342) hadoop integration

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830144#action_12830144 ] 

Jonathan Ellis commented on CASSANDRA-342:
------------------------------------------

> The fatClient needs to be refactored to not need to listen on a port (no gossip) 

you need to listen on a port to run messagingservice, whether or not you include gossip (remember gossip is tcp on the same port as the rest now).  and w/o messagingservice you can't get responses to reads or writes.  this is a non-starter.

if you really need to run more than one hadoop jvm per machine you can do it by putting a storage-conf.xml on each hadoop's classpath.  imo this is better than bundling it into the job jar anyway.

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Jonathan Ellis
>             Fix For: 0.6
>
>         Attachments: 0001-v3-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0001-v4-add-basic-hadoop-support-one-split-per-node.txt, 0001-v5-add-basic-hadoop-support-one-split-per-node.txt, 0002-v3-CASSANDRA-342.-Working-hadoop-support.patch, 0002-v4-add-wordcount-hadoop-example.txt, 0002-v5-add-wordcount-hadoop-example.txt, 0003-v3-CASSANDRA-342.-Adding-the-WordCount-example.patch, 0003-v4-add-WordCountSetup-multiple-tests.txt, 0003-v5-make-predicate-configurable.txt, 0004-v5-sub-splits.txt, 0005-v5-jar-packaging.txt, 0006-prevent-multiple-client-initializations.txt
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (CASSANDRA-342) hadoop integration

Posted by "Jeff Hodges (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Hodges updated CASSANDRA-342:
----------------------------------

    Attachment:     (was: 0005-CASSANDRA-342.-Adding-how-to-set-a-table-and-column-.patch)

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>         Attachments: 0001-the-stupid-version-of-hadoop-support.patch, v2-squashed-commits-for-hadoop-stupid.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-342) hadoop integration

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12746042#action_12746042 ] 

Jonathan Ellis commented on CASSANDRA-342:
------------------------------------------

We already have a proposal from Jun over in #197 to expose the ring as a string property, and (presumably) load it into a storageservice.  Which would more or less take care of things, although the separation could be cleaner.

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>         Attachments: 0001-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0001-the-stupid-version-of-hadoop-support.patch, 0001-v3-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0002-CASSANDRA-342.-Working-hadoop-support.patch, 0002-v3-CASSANDRA-342.-Working-hadoop-support.patch, 0003-CASSANDRA-342.-Adding-the-WordCount-example.patch, 0003-v3-CASSANDRA-342.-Adding-the-WordCount-example.patch, v2-squashed-commits-for-hadoop-stupid.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-342) hadoop integration

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12753822#action_12753822 ] 

Jonathan Ellis commented on CASSANDRA-342:
------------------------------------------

So, should we commit this as a useful first step?

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Jeff Hodges
>         Attachments: 0001-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0001-the-stupid-version-of-hadoop-support.patch, 0001-v3-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0002-CASSANDRA-342.-Working-hadoop-support.patch, 0002-v3-CASSANDRA-342.-Working-hadoop-support.patch, 0003-CASSANDRA-342.-Adding-the-WordCount-example.patch, 0003-v3-CASSANDRA-342.-Adding-the-WordCount-example.patch, v2-squashed-commits-for-hadoop-stupid.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (CASSANDRA-342) hadoop integration

Posted by "Jeff Hodges (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Hodges updated CASSANDRA-342:
----------------------------------

    Attachment: 0015-CASSANDRA-342.-new-conf-file-format.patch
                0014-CASSANDRA-342.-SystemTable.initMetadata-no-longer-NP.patch
                0013-CASSANDRA-342.-Removing-the-unneeded-DataInput-and-D.patch

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>         Attachments: 0001-CASSANDRA-342.-Initial-commit-of-hadoop-support.-Doe.patch, 0001-the-stupid-version-of-hadoop-support.patch, 0002-CASSANDRA-342.-Breaking-RowSerializer-out-as-a-publi.patch, 0003-CASSANDRA-342.-Creating-RowWritable-with-required-Wr.patch, 0004-CASSANDRA-342.-Make-the-hadoop-classes-public.patch, 0005-CASSANDRA-342.-Adding-how-to-set-a-table-and-column-.patch, 0006-CASSANDRA-342.-Start-up-cassandra-s-data-connections.patch, 0007-CASSANDRA-342.-Breaking-out-the-boot-up-code-separat.patch, 0008-CASSANDRA-342.-Handle-empty-key-ranges-correctly.patch, 0009-CASSANDRA-342.-Adding-the-WordCount-hadoop-example.patch, 0010-CASSANDRA-342.-add-public-T-originalToken-to-Token-f.patch, 0011-CASSANDRA-342.-Hadoop-integration-with-one-cassandra.patch, 0012-CASSANDRA-342.-Adding-the-required-confdir-flag-to-t.patch, 0013-CASSANDRA-342.-Removing-the-unneeded-DataInput-and-D.patch, 0014-CASSANDRA-342.-SystemTable.initMetadata-no-longer-NP.patch, 0015-CASSANDRA-342.-new-conf-file-format.patch, v2-squashed-commits-for-hadoop-stupid.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (CASSANDRA-342) hadoop integration

Posted by "Jeff Hodges (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Hodges updated CASSANDRA-342:
----------------------------------

    Attachment:     (was: 0001-CASSANDRA-342.-Initial-commit-of-hadoop-support.-Doe.patch)

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>         Attachments: 0001-the-stupid-version-of-hadoop-support.patch, v2-squashed-commits-for-hadoop-stupid.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (CASSANDRA-342) hadoop integration

Posted by "Jeff Hodges (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Hodges updated CASSANDRA-342:
----------------------------------

    Attachment:     (was: 0010-CASSANDRA-342.-add-public-T-originalToken-to-Token-f.patch)

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>         Attachments: 0001-the-stupid-version-of-hadoop-support.patch, v2-squashed-commits-for-hadoop-stupid.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-342) hadoop integration

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745974#action_12745974 ] 

Jonathan Ellis commented on CASSANDRA-342:
------------------------------------------

thanks, all of that makes sense now.

I'm +1 on applying what Jeff has here as a first step and refining in another ticket.  Anyone else have feedback?

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>         Attachments: 0001-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0001-the-stupid-version-of-hadoop-support.patch, 0001-v3-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0002-CASSANDRA-342.-Working-hadoop-support.patch, 0002-v3-CASSANDRA-342.-Working-hadoop-support.patch, 0003-CASSANDRA-342.-Adding-the-WordCount-example.patch, 0003-v3-CASSANDRA-342.-Adding-the-WordCount-example.patch, v2-squashed-commits-for-hadoop-stupid.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (CASSANDRA-342) hadoop integration

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829214#action_12829214 ] 

Jonathan Ellis edited comment on CASSANDRA-342 at 2/3/10 7:35 PM:
------------------------------------------------------------------

No. That would complicate the StorageProxy model unnecessarily.

If 10s or 30s to use the StorageProxy api is too long then you should probably start working on getting Hadoop to support jobs in existing [i.e., cassandra] jvms. :)

      was (Author: jbellis):
    No.
  
> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Jonathan Ellis
>         Attachments: 0001-v3-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0001-v4-add-basic-hadoop-support-one-split-per-node.txt, 0002-v3-CASSANDRA-342.-Working-hadoop-support.patch, 0002-v4-add-wordcount-hadoop-example.txt, 0003-v3-CASSANDRA-342.-Adding-the-WordCount-example.patch, 0003-v4-add-WordCountSetup-multiple-tests.txt
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-342) hadoop integration

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12834866#action_12834866 ] 

Hudson commented on CASSANDRA-342:
----------------------------------

Integrated in Cassandra #357 (See [http://hudson.zones.apache.org/hudson/job/Cassandra/357/])
    

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Jonathan Ellis
>             Fix For: 0.6
>
>         Attachments: 0001-v3-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0001-v4-add-basic-hadoop-support-one-split-per-node.txt, 0001-v5-add-basic-hadoop-support-one-split-per-node.txt, 0001-v6-add-basic-hadoop-support-using-Thrift-one-split-per.txt, 0002-v3-CASSANDRA-342.-Working-hadoop-support.patch, 0002-v4-add-wordcount-hadoop-example.txt, 0002-v5-add-wordcount-hadoop-example.txt, 0002-v6-add-wordcount-hadoop-example.txt, 0003-v3-CASSANDRA-342.-Adding-the-WordCount-example.patch, 0003-v4-add-WordCountSetup-multiple-tests.txt, 0003-v5-make-predicate-configurable.txt, 0003-v6-make-predicate-configurable.txt, 0004-v5-sub-splits.txt, 0004-v6-sub-splits.txt, 0005-v5-jar-packaging.txt, 0005-v6-use-conf-for-inputs-and-relative-temp.txt, 0006-prevent-multiple-client-initializations.txt
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-342) hadoop integration

Posted by "Jeff Hodges (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745809#action_12745809 ] 

Jeff Hodges commented on CASSANDRA-342:
---------------------------------------


Okay, before we talk about the boot code, let me address some of the
confusion about Hadoop.

In Hadoop, there are things called Jobs, which are a combination of a
Map and a Reduce operation and the InputFormat configuration you
specify which are then run across a bunch of machines.

A Task is an individual Map or Reduce operation run on one of those
machines (so every Job has many Tasks). For every new Task needed, a
new JVM is booted up.[1]

This is actually okay, distributed-systems-wise, because it keeps all
the Tasks from interfering with one another.

It does, however, make our jobs harder. There is no way for a Task
(and thus this Hadoop code in these patches) to access the runtime of
a Cassandra node already on the machine because they will be in
separate JVMs!

HBase, as I mentioned above, solves this problem by first starting up
HBase on those remote machines, and then having each Task create an
HTable object from the InputSplit handed to it. This HTable object
connects to the local HBase process. (Of course, this same thing
happens in the JVM that creates the InputSplits.)

So, here's my deal. There is no way for this currently designed system
to work efficiently in a distributed system. This is because we have
to boot a brand new Cassandra process on machines that might already
have (and need if hardware is limited) one running already. The boot
up time for Cassandra alone is a big time sink. And consider how these
nodes would interoperate with the "stable", non-Hadoop nodes that
would start sending them data. Ugh.

We can avoid all of this boot time drama if we can come up with a
good way of remotely accessing all of the internal information we need
from the Cassandra node already running. I have not been able to come
up with an alternative solution.

Comments?

[1] There is something called "Task reuse" that can be configured into
a Hadoop deployment. However, the "reuse" only means that a Task can
be used more than once for the same Job. So, it's basically just
another complicating factor in our boot loading code (one of the
reasons there is BootUp.boot() and BootUp.bootUnsafe()) but doesn't
help us with our problem.

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>         Attachments: 0001-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0001-the-stupid-version-of-hadoop-support.patch, 0001-v3-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0002-CASSANDRA-342.-Working-hadoop-support.patch, 0002-v3-CASSANDRA-342.-Working-hadoop-support.patch, 0003-CASSANDRA-342.-Adding-the-WordCount-example.patch, 0003-v3-CASSANDRA-342.-Adding-the-WordCount-example.patch, v2-squashed-commits-for-hadoop-stupid.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-342) hadoop integration

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745641#action_12745641 ] 

Jonathan Ellis commented on CASSANDRA-342:
------------------------------------------

only comment on the example is you should probably strip out the comments from the sample xml copy.

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>         Attachments: 0001-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0001-the-stupid-version-of-hadoop-support.patch, 0002-CASSANDRA-342.-Working-hadoop-support.patch, 0003-CASSANDRA-342.-Adding-the-WordCount-example.patch, v2-squashed-commits-for-hadoop-stupid.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (CASSANDRA-342) hadoop integration

Posted by "Jeff Hodges (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Hodges updated CASSANDRA-342:
----------------------------------

    Attachment: 0009-CASSANDRA-342.-Adding-the-WordCount-hadoop-example.patch
                0008-CASSANDRA-342.-Handle-empty-key-ranges-correctly.patch
                0007-CASSANDRA-342.-Breaking-out-the-boot-up-code-separat.patch

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>         Attachments: 0001-CASSANDRA-342.-Initial-commit-of-hadoop-support.-Doe.patch, 0001-the-stupid-version-of-hadoop-support.patch, 0002-CASSANDRA-342.-Breaking-RowSerializer-out-as-a-publi.patch, 0003-CASSANDRA-342.-Creating-RowWritable-with-required-Wr.patch, 0004-CASSANDRA-342.-Make-the-hadoop-classes-public.patch, 0005-CASSANDRA-342.-Adding-how-to-set-a-table-and-column-.patch, 0006-CASSANDRA-342.-Start-up-cassandra-s-data-connections.patch, 0007-CASSANDRA-342.-Breaking-out-the-boot-up-code-separat.patch, 0008-CASSANDRA-342.-Handle-empty-key-ranges-correctly.patch, 0009-CASSANDRA-342.-Adding-the-WordCount-hadoop-example.patch, 0010-CASSANDRA-342.-add-public-T-originalToken-to-Token-f.patch, 0011-CASSANDRA-342.-Hadoop-integration-with-one-cassandra.patch, 0012-CASSANDRA-342.-Adding-the-required-confdir-flag-to-t.patch, 0013-CASSANDRA-342.-Removing-the-unneeded-DataInput-and-D.patch, 0014-CASSANDRA-342.-SystemTable.initMetadata-no-longer-NP.patch, 0015-CASSANDRA-342.-new-conf-file-format.patch, v2-squashed-commits-for-hadoop-stupid.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-342) hadoop integration

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832879#action_12832879 ] 

Stu Hood commented on CASSANDRA-342:
------------------------------------

Thanks a lot! I'm +1 on getting this example in, but I still want to discuss 775.

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Jonathan Ellis
>             Fix For: 0.6
>
>         Attachments: 0001-v3-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0001-v4-add-basic-hadoop-support-one-split-per-node.txt, 0001-v5-add-basic-hadoop-support-one-split-per-node.txt, 0001-v6-add-basic-hadoop-support-using-Thrift-one-split-per.txt, 0002-v3-CASSANDRA-342.-Working-hadoop-support.patch, 0002-v4-add-wordcount-hadoop-example.txt, 0002-v5-add-wordcount-hadoop-example.txt, 0002-v6-add-wordcount-hadoop-example.txt, 0003-v3-CASSANDRA-342.-Adding-the-WordCount-example.patch, 0003-v4-add-WordCountSetup-multiple-tests.txt, 0003-v5-make-predicate-configurable.txt, 0003-v6-make-predicate-configurable.txt, 0004-v5-sub-splits.txt, 0004-v6-sub-splits.txt, 0005-v5-jar-packaging.txt, 0005-v6-use-conf-for-inputs-and-relative-temp.txt, 0006-prevent-multiple-client-initializations.txt
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (CASSANDRA-342) hadoop integration

Posted by "Jeff Hodges (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Hodges updated CASSANDRA-342:
----------------------------------

    Attachment: 0003-v3-CASSANDRA-342.-Adding-the-WordCount-example.patch
                0002-v3-CASSANDRA-342.-Working-hadoop-support.patch
                0001-v3-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch

Okay, so here's a new set of patches that we'll call v3.

ICompactSerializer2 is now used by RowSerializer (I had forgotten it
existed!). The throw-away variables have been tossed. The WordCount
storage-conf.xml has been edited down. And the call to CalloutManager
has been removed as it no longer exists in trunk.

The boot code I'm going to discuss in my next comment.

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>         Attachments: 0001-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0001-the-stupid-version-of-hadoop-support.patch, 0001-v3-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0002-CASSANDRA-342.-Working-hadoop-support.patch, 0002-v3-CASSANDRA-342.-Working-hadoop-support.patch, 0003-CASSANDRA-342.-Adding-the-WordCount-example.patch, 0003-v3-CASSANDRA-342.-Adding-the-WordCount-example.patch, v2-squashed-commits-for-hadoop-stupid.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-342) hadoop integration

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745625#action_12745625 ] 

Jonathan Ellis commented on CASSANDRA-342:
------------------------------------------

DataOutputStream implements DataOutput -- can you just switch RowSerializer to ICompactSerializer2, which only requires the latter?  (goal is eventually get rid of ICompactSerializer, then rename 2)

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>         Attachments: 0001-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0001-the-stupid-version-of-hadoop-support.patch, 0002-CASSANDRA-342.-Working-hadoop-support.patch, 0003-CASSANDRA-342.-Adding-the-WordCount-example.patch, v2-squashed-commits-for-hadoop-stupid.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-342) hadoop integration

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829308#action_12829308 ] 

Jonathan Ellis commented on CASSANDRA-342:
------------------------------------------

New patchset attached.  Old patches 2 and 3 were squashed together as predicted, and some fixes were made to 01.  New patches are

04
    sub splits: split node ranges into smaller groups of keys (currently hardcoded to 4096)

03
    make predicate configurable

I'd like to get whatever else falls under the heading of "the least we can possibly do and be useful" done and call this ticket good, and add enhancements in future tickets.

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Jonathan Ellis
>         Attachments: 0001-v3-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0001-v4-add-basic-hadoop-support-one-split-per-node.txt, 0001-v5-add-basic-hadoop-support-one-split-per-node.txt, 0002-v3-CASSANDRA-342.-Working-hadoop-support.patch, 0002-v4-add-wordcount-hadoop-example.txt, 0002-v5-add-wordcount-hadoop-example.txt, 0003-v3-CASSANDRA-342.-Adding-the-WordCount-example.patch, 0003-v4-add-WordCountSetup-multiple-tests.txt, 0003-v5-make-predicate-configurable.txt, 0004-v5-sub-splits.txt
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (CASSANDRA-342) hadoop integration

Posted by "Jeff Hodges (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Hodges updated CASSANDRA-342:
----------------------------------

    Attachment:     (was: 0015-CASSANDRA-342.-new-conf-file-format.patch)

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>         Attachments: 0001-the-stupid-version-of-hadoop-support.patch, v2-squashed-commits-for-hadoop-stupid.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (CASSANDRA-342) hadoop integration

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-342:
-------------------------------------

    Attachment:     (was: 0003-CASSANDRA-342.-Adding-the-WordCount-example.patch)

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Jeff Hodges
>         Attachments: 0001-v3-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0002-v3-CASSANDRA-342.-Working-hadoop-support.patch, 0003-v3-CASSANDRA-342.-Adding-the-WordCount-example.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-342) hadoop integration

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830168#action_12830168 ] 

Jonathan Ellis commented on CASSANDRA-342:
------------------------------------------

committed part of 06 as r907005.  I'm working the rest of 05 and 06 into my next patchset.

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Jonathan Ellis
>             Fix For: 0.6
>
>         Attachments: 0001-v3-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0001-v4-add-basic-hadoop-support-one-split-per-node.txt, 0001-v5-add-basic-hadoop-support-one-split-per-node.txt, 0002-v3-CASSANDRA-342.-Working-hadoop-support.patch, 0002-v4-add-wordcount-hadoop-example.txt, 0002-v5-add-wordcount-hadoop-example.txt, 0003-v3-CASSANDRA-342.-Adding-the-WordCount-example.patch, 0003-v4-add-WordCountSetup-multiple-tests.txt, 0003-v5-make-predicate-configurable.txt, 0004-v5-sub-splits.txt, 0005-v5-jar-packaging.txt, 0006-prevent-multiple-client-initializations.txt
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-342) hadoop integration

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829216#action_12829216 ] 

Stu Hood commented on CASSANDRA-342:
------------------------------------

> probably we need to provide a API to sync the cluster state from one of the server?
I don't think that blocks this patch, but we can create another issue once this one is closed.

It would probably be pretty easy to add a mode to the Gossiper where it was especially promiscuous, and asked a few nodes in the cluster for their full token list to get into a stable state quicker.

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Jonathan Ellis
>         Attachments: 0001-v3-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0001-v4-add-basic-hadoop-support-one-split-per-node.txt, 0002-v3-CASSANDRA-342.-Working-hadoop-support.patch, 0002-v4-add-wordcount-hadoop-example.txt, 0003-v3-CASSANDRA-342.-Adding-the-WordCount-example.patch, 0003-v4-add-WordCountSetup-multiple-tests.txt
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (CASSANDRA-342) hadoop integration

Posted by "Jeff Hodges (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Hodges updated CASSANDRA-342:
----------------------------------

    Attachment: 0012-CASSANDRA-342.-Adding-the-required-confdir-flag-to-t.patch
                0011-CASSANDRA-342.-Hadoop-integration-with-one-cassandra.patch
                0010-CASSANDRA-342.-add-public-T-originalToken-to-Token-f.patch

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>         Attachments: 0001-CASSANDRA-342.-Initial-commit-of-hadoop-support.-Doe.patch, 0001-the-stupid-version-of-hadoop-support.patch, 0002-CASSANDRA-342.-Breaking-RowSerializer-out-as-a-publi.patch, 0003-CASSANDRA-342.-Creating-RowWritable-with-required-Wr.patch, 0004-CASSANDRA-342.-Make-the-hadoop-classes-public.patch, 0005-CASSANDRA-342.-Adding-how-to-set-a-table-and-column-.patch, 0006-CASSANDRA-342.-Start-up-cassandra-s-data-connections.patch, 0007-CASSANDRA-342.-Breaking-out-the-boot-up-code-separat.patch, 0008-CASSANDRA-342.-Handle-empty-key-ranges-correctly.patch, 0009-CASSANDRA-342.-Adding-the-WordCount-hadoop-example.patch, 0010-CASSANDRA-342.-add-public-T-originalToken-to-Token-f.patch, 0011-CASSANDRA-342.-Hadoop-integration-with-one-cassandra.patch, 0012-CASSANDRA-342.-Adding-the-required-confdir-flag-to-t.patch, 0013-CASSANDRA-342.-Removing-the-unneeded-DataInput-and-D.patch, 0014-CASSANDRA-342.-SystemTable.initMetadata-no-longer-NP.patch, 0015-CASSANDRA-342.-new-conf-file-format.patch, v2-squashed-commits-for-hadoop-stupid.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-342) hadoop integration

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829214#action_12829214 ] 

Jonathan Ellis commented on CASSANDRA-342:
------------------------------------------

No.

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Jonathan Ellis
>         Attachments: 0001-v3-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0001-v4-add-basic-hadoop-support-one-split-per-node.txt, 0002-v3-CASSANDRA-342.-Working-hadoop-support.patch, 0002-v4-add-wordcount-hadoop-example.txt, 0003-v3-CASSANDRA-342.-Adding-the-WordCount-example.patch, 0003-v4-add-WordCountSetup-multiple-tests.txt
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (CASSANDRA-342) hadoop integration

Posted by "Jeff Hodges (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Hodges updated CASSANDRA-342:
----------------------------------

    Attachment:     (was: 0017-CASSANDRA-342.-Fixed-version-of-updated-example-conf.patch)

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>         Attachments: 0001-the-stupid-version-of-hadoop-support.patch, v2-squashed-commits-for-hadoop-stupid.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (CASSANDRA-342) hadoop integration

Posted by "Jeff Hodges (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744001#action_12744001 ] 

Jeff Hodges edited comment on CASSANDRA-342 at 8/17/09 2:21 AM:
----------------------------------------------------------------

This patch adds the ability for Cassandra databases to be read from in
a Hadoop setting. This is the "stupid" version of said support
(c.f. "not-stupid" discussion
http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E
).

This patch is only working when run in the non-distributed `./bin/hadoop
jar` environment of Hadoop.

.h2 Building

Building the patched Cassandra requires including
`hadoop-0.20.0-core.jar` that is distributed with Hadoop 0.20.0
(obviously) in Cassandra's `$CLASSPATH`. (Which is easiest to do by
simply copying the file to Cassandra's `lib` directory.)

An example of the adapter's use can be found in
`src/examples/org/apache/cassandra/examples/WordCount.java`. You can
run `WordCount.java` with Hadoop by first editing the
`conf/hadoop-env.sh` file and adding all of the jars in Cassandra's
lib directory to `$HADOOP_CLASSPATH`. Building the examples is then
just a matter of running `ant examples`.

.h2 Running

Running the example is straightforward after that. Assuming you've
added some tweets to the Tweets column family with a column called
`text` filled with, you know, text:

{code}
./bin/hadoop jar \
  /path/to/cassandra/build/apache-cassandra-incubating-examples-0.4.0-dev.jar\
  org.apache.cassandra.examples.WordCount -inputspace Twitter \
  -inputfamily Tweets -outputdir outie \
  -confdir /path/to/cassandra/src/examples/conf/wordcount/
{/code}

.h2 Changes External to `cassandra.hadoop`

This patch makes two changes in the Cassandra project that are outside
of the new `hadoop` package.

 # Makes `StorageProxy.getKeyRange()` public.
 
 # `RowSerializer` is now a public class and outside of Row. This was
    done so I didn't have to rewrite the serialization code for
    writing the `RowWritable` class.

.h2 Issues

This patch does have some issues. Specifically:

 # Has no tests.

 # Cannot split up the key ranges beyond what the entire key range
   that exists on each individual node. This means we cannot delegate
   to more Map tasks than there are Cassandra nodes. As we move to
   billions of keys per node, this is even more of an
   issue. (c.f. CASSANDRA-242)

 # Cassandra currently must be booted by this Hadoop-facing code in
   order to work as a side effect of needing certain internal calls in
   odd places and the onus put upon this project to keep everything
   working internally. There is currently no way to hook into an
   external Cassandra process.

 # Only has been tested and only works (due to the above boot code
   issues) on one Cassandra node, with one Hadoop Map task.

 # Cannot take key ranges that cross over multiple nodes. This is
   a problem with how we (can't) divvy up the keys instead of any other
   problem (such as the ones described in CASSANDRA-348).

 # The current API for selecting what keys to grab cannot take
   anything more than the table/keyspace to search in and the name of
   a top-level super column.

 # Because of the lack of true "multiget" support, the reads from the
   database have a round trip cost for each key desired.

 # The API is not well-fleshed out for grabbing data from a RowWritable.

 # Only provides read capability, no writing.

 # `KeyspaceRecordReader#getProgress()` is nothing more than a stub.

 # `RowWritable` does not implement `WritableComparable`, which would
   allow its use as a key and not just a value in a MapReduce job. 

 # `RowWritable` uses `RowSerializer` which encodes way too much
   information about the column families and columns through
   `ColumnFamilySerializer`.

 # Has a (likely inescapable) dependency on the hadoop 0.20
   core jar.

 # Really, really has no tests.

I could go into more detail about some of these issues, but this is
already too long and the discussion adds even more text.


      was (Author: jmhodges):
    This patch adds the ability for Cassandra databases to be read from in
a Hadoop setting. This is the "stupid" version of said support
(c.f. "not-stupid" discussion
http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E
).

This patch is only working when run in the non-distributed `./bin/hadoop
jar` environment of Hadoop.

.h2 Building

Building the patched Cassandra requires including
`hadoop-0.20.0-core.jar` that is distributed with Hadoop 0.20.0
(obviously) in Cassandra's `$CLASSPATH`. (Which is easiest to do by
simply copying the file to Cassandra's `lib` directory.)

An example of the adapter's use can be found in
`src/examples/org/apache/cassandra/examples/WordCount.java`. You can
run `WordCount.java` with Hadoop by first editing the
`conf/hadoop-env.sh` file and adding all of the jars in Cassandra's
lib directory to `$HADOOP_CLASSPATH`. Building the examples is then
just a matter of running `ant examples`.

.h2 Running

Running the example is straightforward after that. Assuming you've
added some tweets to the Tweets column family with a column called
`text` filled with, you know, text:

{code}
./bin/hadoop jar \
  /path/to/cassandra/build/apache-cassandra-incubating-examples-0.4.0-dev.jar\
  org.apache.cassandra.examples.WordCount -inputspace Twitter \
  -inputfamily Tweets -outputdir outie \
  -confdir /path/to/cassandra/src/examples/conf/wordcount/
{/code}

.h2 Changes External to `cassandra.hadoop`

This patch makes two changes in the Cassandra project that are outside
of the new `hadoop` package.

 # Makes `StorageProxy.getKeyRange()` public.
 
 # `RowSerializer` is now a public class and outside of Row. This was
    done so I didn't have to rewrite the serialization code for
    writing the `RowWritable` class.

.h2 Issues

This patch does have some issues. Specifically:

 # Has no tests.

 # Cannot split up the key ranges beyond what the entire key range
   that exists on each individual node. This means we cannot delegate
   to more Map tasks than there are Cassandra nodes. As we move to
   billions of keys per node, this is even more of an
   issue. (c.f. CASSANDRA-242)

 # Cassandra currently must be booted by this Hadoop-facing code in
   order to work as a side effect of needing certain internal calls in
   odd places and the onus put upon this project to keep everything
   working internally. There is currently no way to hook into an
   external Cassandra process.

 # Only has been tested and only works (due to the above boot code
   issues) on one Cassandra node, with one Hadoop Map task.

 # Cannot take key ranges that cross over multiple nodes. This is
   a problem with how we (can't) divvy up the keys instead of any other
   problem (such as the ones described in CASSANDRA-348).

 # The current API for selecting what keys to grab cannot take
   anything more than the table/keyspace to search in and the name of
   a top-level super column.

 # Because of the lack of true "multiget" support, the reads from the
   database have a round trip cost for each key desired.

 # The API is not well-fleshed out for grabbing data from a RowWritable.

 # `KeyspaceRecordReader#getProgress()` is nothing more than a stub.

 # `RowWritable` does not implement `WritableComparable`, which would
   allow its use as a key and not just a value in a MapReduce job. 

 # `RowWritable` uses `RowSerializer` which encodes way too much
   information about the column families and columns through
   `ColumnFamilySerializer`.

 # Has a (likely inescapable) dependency on the hadoop 0.20
   core jar.

 # Really, really has no tests.

I could go into more detail about some of these issues, but this is
already too long and the discussion adds even more text.

  
> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>         Attachments: 0001-the-stupid-version-of-hadoop-support.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (CASSANDRA-342) hadoop integration

Posted by "Jeff Hodges (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Hodges updated CASSANDRA-342:
----------------------------------

    Attachment: 0003-CASSANDRA-342.-Creating-RowWritable-with-required-Wr.patch
                0002-CASSANDRA-342.-Breaking-RowSerializer-out-as-a-publi.patch
                0001-CASSANDRA-342.-Initial-commit-of-hadoop-support.-Doe.patch

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>         Attachments: 0001-CASSANDRA-342.-Initial-commit-of-hadoop-support.-Doe.patch, 0001-the-stupid-version-of-hadoop-support.patch, 0002-CASSANDRA-342.-Breaking-RowSerializer-out-as-a-publi.patch, 0003-CASSANDRA-342.-Creating-RowWritable-with-required-Wr.patch, 0004-CASSANDRA-342.-Make-the-hadoop-classes-public.patch, 0005-CASSANDRA-342.-Adding-how-to-set-a-table-and-column-.patch, 0006-CASSANDRA-342.-Start-up-cassandra-s-data-connections.patch, 0007-CASSANDRA-342.-Breaking-out-the-boot-up-code-separat.patch, 0008-CASSANDRA-342.-Handle-empty-key-ranges-correctly.patch, 0009-CASSANDRA-342.-Adding-the-WordCount-hadoop-example.patch, 0010-CASSANDRA-342.-add-public-T-originalToken-to-Token-f.patch, 0011-CASSANDRA-342.-Hadoop-integration-with-one-cassandra.patch, 0012-CASSANDRA-342.-Adding-the-required-confdir-flag-to-t.patch, 0013-CASSANDRA-342.-Removing-the-unneeded-DataInput-and-D.patch, 0014-CASSANDRA-342.-SystemTable.initMetadata-no-longer-NP.patch, 0015-CASSANDRA-342.-new-conf-file-format.patch, v2-squashed-commits-for-hadoop-stupid.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (CASSANDRA-342) hadoop integration

Posted by "Jeff Hodges (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Hodges updated CASSANDRA-342:
----------------------------------

    Attachment: 0003-CASSANDRA-342.-Adding-the-WordCount-example.patch

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>         Attachments: 0001-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0001-the-stupid-version-of-hadoop-support.patch, 0002-CASSANDRA-342.-Working-hadoop-support.patch, 0003-CASSANDRA-342.-Adding-the-WordCount-example.patch, v2-squashed-commits-for-hadoop-stupid.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-342) hadoop integration

Posted by "Vijay (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829205#action_12829205 ] 

Vijay commented on CASSANDRA-342:
---------------------------------

Starting the fat client itself takes a lot of time (Around 30 Sec) ... the time to sync the cluster state is unknown too. probably we need to provide a API to sync the cluster state from one of the server?

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Jonathan Ellis
>         Attachments: 0001-v3-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0001-v4-add-basic-hadoop-support-one-split-per-node.txt, 0002-v3-CASSANDRA-342.-Working-hadoop-support.patch, 0002-v4-add-wordcount-hadoop-example.txt, 0003-v3-CASSANDRA-342.-Adding-the-WordCount-example.patch, 0003-v4-add-WordCountSetup-multiple-tests.txt
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (CASSANDRA-342) hadoop integration

Posted by "Jeff Hodges (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Hodges updated CASSANDRA-342:
----------------------------------

    Attachment:     (was: 0004-CASSANDRA-342.-Make-the-hadoop-classes-public.patch)

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>         Attachments: 0001-the-stupid-version-of-hadoop-support.patch, v2-squashed-commits-for-hadoop-stupid.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-342) hadoop integration

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829752#action_12829752 ] 

Jonathan Ellis commented on CASSANDRA-342:
------------------------------------------

it looks like your 05 squashes everything together?

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Jonathan Ellis
>             Fix For: 0.6
>
>         Attachments: 0001-v3-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0001-v4-add-basic-hadoop-support-one-split-per-node.txt, 0001-v5-add-basic-hadoop-support-one-split-per-node.txt, 0002-v3-CASSANDRA-342.-Working-hadoop-support.patch, 0002-v4-add-wordcount-hadoop-example.txt, 0002-v5-add-wordcount-hadoop-example.txt, 0003-v3-CASSANDRA-342.-Adding-the-WordCount-example.patch, 0003-v4-add-WordCountSetup-multiple-tests.txt, 0003-v5-make-predicate-configurable.txt, 0004-v5-sub-splits.txt, 0005-v5-jar-packaging.txt
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-342) hadoop integration

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829466#action_12829466 ] 

Stu Hood commented on CASSANDRA-342:
------------------------------------

 * The storage-conf.xml file for contrib/word_count could probably be a diff that is applied during the build process
 * Why does getBootstrapToken use getRandomToken still, as opposed to midpoint?

About to play around with this on a cluster... I'll let you know how it goes.

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Jonathan Ellis
>             Fix For: 0.6
>
>         Attachments: 0001-v3-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0001-v4-add-basic-hadoop-support-one-split-per-node.txt, 0001-v5-add-basic-hadoop-support-one-split-per-node.txt, 0002-v3-CASSANDRA-342.-Working-hadoop-support.patch, 0002-v4-add-wordcount-hadoop-example.txt, 0002-v5-add-wordcount-hadoop-example.txt, 0003-v3-CASSANDRA-342.-Adding-the-WordCount-example.patch, 0003-v4-add-WordCountSetup-multiple-tests.txt, 0003-v5-make-predicate-configurable.txt, 0004-v5-sub-splits.txt
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (CASSANDRA-342) hadoop integration

Posted by "Jeff Hodges (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Hodges updated CASSANDRA-342:
----------------------------------

    Attachment:     (was: 0009-CASSANDRA-342.-Adding-the-WordCount-hadoop-example.patch)

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>         Attachments: 0001-the-stupid-version-of-hadoop-support.patch, v2-squashed-commits-for-hadoop-stupid.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (CASSANDRA-342) hadoop integration

Posted by "Jeff Hodges (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Hodges updated CASSANDRA-342:
----------------------------------

    Attachment:     (was: 0008-CASSANDRA-342.-Handle-empty-key-ranges-correctly.patch)

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>         Attachments: 0001-the-stupid-version-of-hadoop-support.patch, v2-squashed-commits-for-hadoop-stupid.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (CASSANDRA-342) hadoop integration

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-342:
-------------------------------------

    Attachment:     (was: v2-squashed-commits-for-hadoop-stupid.patch)

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Jeff Hodges
>         Attachments: 0001-v3-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0002-v3-CASSANDRA-342.-Working-hadoop-support.patch, 0003-v3-CASSANDRA-342.-Adding-the-WordCount-example.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-342) hadoop integration

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12770775#action_12770775 ] 

Jonathan Ellis commented on CASSANDRA-342:
------------------------------------------

For posterity, a link to Jeff's longer explanation of where he was going here: http://markmail.org/message/5qou35zzdv7uzup6

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Jeff Hodges
>         Attachments: 0001-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0001-the-stupid-version-of-hadoop-support.patch, 0001-v3-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0002-CASSANDRA-342.-Working-hadoop-support.patch, 0002-v3-CASSANDRA-342.-Working-hadoop-support.patch, 0003-CASSANDRA-342.-Adding-the-WordCount-example.patch, 0003-v3-CASSANDRA-342.-Adding-the-WordCount-example.patch, v2-squashed-commits-for-hadoop-stupid.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (CASSANDRA-342) hadoop integration

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12753822#action_12753822 ] 

Jonathan Ellis edited comment on CASSANDRA-342 at 10/28/09 4:10 AM:
--------------------------------------------------------------------

So, should we commit this as a useful first step?

(Edit: I think the answer is yes, but the question has become academic as the latest patches no longer apply.)

      was (Author: jbellis):
    So, should we commit this as a useful first step?
  
> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Jeff Hodges
>         Attachments: 0001-v3-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0002-v3-CASSANDRA-342.-Working-hadoop-support.patch, 0003-v3-CASSANDRA-342.-Adding-the-WordCount-example.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-342) hadoop integration

Posted by "Jeff Hodges (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745349#action_12745349 ] 

Jeff Hodges commented on CASSANDRA-342:
---------------------------------------

So, my biggest problem with this patch right now is the boot up code and the way it combines with the local-only query code. It forces us into booting a brand new cassandra instance that assumes the data is already there and ready for the taking but only when a MapReduce task is being done. This is all sorts of bad news. 

There does not seem to be a way of getting to the internals of Cassandra we need (reading from and writing to the disk and memtable, figuring out what keys are on what nodes, etc.) without also having to boot all of the various Cassandra services. 

I'm looking for input on how we can get around that. 

FYI, the HBase way is to have HBase running on the machine already and throw up a connection to it from another process that is created with the information from the InputSplit (on the map task machines) and from the config files (on the initial machine that creates the InputSplits).

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>         Attachments: 0001-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0001-the-stupid-version-of-hadoop-support.patch, 0002-CASSANDRA-342.-Working-hadoop-support.patch, 0003-CASSNADRA-342.-Adding-the-WordCount-example.patch, v2-squashed-commits-for-hadoop-stupid.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-342) hadoop integration

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12746066#action_12746066 ] 

Stu Hood commented on CASSANDRA-342:
------------------------------------

Good call... the RingCache mentioned on CASSANDRA-197 is exactly what this ticket needs.

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>         Attachments: 0001-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0001-the-stupid-version-of-hadoop-support.patch, 0001-v3-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0002-CASSANDRA-342.-Working-hadoop-support.patch, 0002-v3-CASSANDRA-342.-Working-hadoop-support.patch, 0003-CASSANDRA-342.-Adding-the-WordCount-example.patch, 0003-v3-CASSANDRA-342.-Adding-the-WordCount-example.patch, v2-squashed-commits-for-hadoop-stupid.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (CASSANDRA-342) hadoop integration

Posted by "Chris Goffinet (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Goffinet reassigned CASSANDRA-342:
----------------------------------------

    Assignee: Jeff Hodges

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Jeff Hodges
>         Attachments: 0001-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0001-the-stupid-version-of-hadoop-support.patch, 0001-v3-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0002-CASSANDRA-342.-Working-hadoop-support.patch, 0002-v3-CASSANDRA-342.-Working-hadoop-support.patch, 0003-CASSANDRA-342.-Adding-the-WordCount-example.patch, 0003-v3-CASSANDRA-342.-Adding-the-WordCount-example.patch, v2-squashed-commits-for-hadoop-stupid.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (CASSANDRA-342) hadoop integration

Posted by "Jeff Hodges (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Hodges updated CASSANDRA-342:
----------------------------------

    Attachment:     (was: 0003-CASSNADRA-342.-Adding-the-WordCount-example.patch)

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>         Attachments: 0001-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0001-the-stupid-version-of-hadoop-support.patch, 0002-CASSANDRA-342.-Working-hadoop-support.patch, 0003-CASSANDRA-342.-Adding-the-WordCount-example.patch, v2-squashed-commits-for-hadoop-stupid.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (CASSANDRA-342) hadoop integration

Posted by "Jeff Hodges (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Hodges updated CASSANDRA-342:
----------------------------------

    Attachment:     (was: 0002-CASSANDRA-342.-Breaking-RowSerializer-out-as-a-publi.patch)

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>         Attachments: 0001-the-stupid-version-of-hadoop-support.patch, v2-squashed-commits-for-hadoop-stupid.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (CASSANDRA-342) hadoop integration

Posted by "Jeff Hodges (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745809#action_12745809 ] 

Jeff Hodges edited comment on CASSANDRA-342 at 8/20/09 10:55 PM:
-----------------------------------------------------------------


Okay, before we talk about the boot code, let me address some of the confusion about Hadoop.

In Hadoop, there are things called Jobs, which are a combination of a Map and a Reduce operation and the InputFormat configuration you specify which are then run across a bunch of machines.

A Task is an individual Map or Reduce operation run on one of those machines (so every Job has many Tasks). For every new Task needed, a new JVM is booted up.[1]

This is actually okay, distributed-systems-wise, because it keeps all the Tasks from interfering with one another.

It does, however, make our jobs harder. There is no way for a Task (and thus this Hadoop code in these patches) to access the runtime of a Cassandra node already on the machine because they will be in separate JVMs!

HBase, as I mentioned above, solves this problem by first starting up HBase on those remote machines, and then having each Task create an HTable object from the InputSplit handed to it. This HTable object connects to the local HBase process. (Of course, this same thing happens in the JVM that creates the InputSplits.)

So, here's my deal. There is no way for this currently designed system to work efficiently in a distributed system. This is because we have to boot a brand new Cassandra process on machines that might already have (and need if hardware is limited) one running already. The boot up time for Cassandra alone is a big time sink. And consider how these nodes would interoperate with the "stable", non-Hadoop nodes that would start sending them data. Ugh.

We can avoid all of this boot time drama if we can come up with a good way of remotely accessing all of the internal information we need from the Cassandra node already running. I have not been able to come up with an alternative solution.

Comments?

[1] There is something called "Task reuse" that can be configured into a Hadoop deployment. However, the "reuse" only means that a Task can be used more than once for the same Job. So, it's basically just
another complicating factor in our boot loading code (one of the reasons there is BootUp.boot() and BootUp.bootUnsafe()) but doesn't help us with our problem.

      was (Author: jmhodges):
    
Okay, before we talk about the boot code, let me address some of the
confusion about Hadoop.

In Hadoop, there are things called Jobs, which are a combination of a
Map and a Reduce operation and the InputFormat configuration you
specify which are then run across a bunch of machines.

A Task is an individual Map or Reduce operation run on one of those
machines (so every Job has many Tasks). For every new Task needed, a
new JVM is booted up.[1]

This is actually okay, distributed-systems-wise, because it keeps all
the Tasks from interfering with one another.

It does, however, make our jobs harder. There is no way for a Task
(and thus this Hadoop code in these patches) to access the runtime of
a Cassandra node already on the machine because they will be in
separate JVMs!

HBase, as I mentioned above, solves this problem by first starting up
HBase on those remote machines, and then having each Task create an
HTable object from the InputSplit handed to it. This HTable object
connects to the local HBase process. (Of course, this same thing
happens in the JVM that creates the InputSplits.)

So, here's my deal. There is no way for this currently designed system
to work efficiently in a distributed system. This is because we have
to boot a brand new Cassandra process on machines that might already
have (and need if hardware is limited) one running already. The boot
up time for Cassandra alone is a big time sink. And consider how these
nodes would interoperate with the "stable", non-Hadoop nodes that
would start sending them data. Ugh.

We can avoid all of this boot time drama if we can come up with a
good way of remotely accessing all of the internal information we need
from the Cassandra node already running. I have not been able to come
up with an alternative solution.

Comments?

[1] There is something called "Task reuse" that can be configured into
a Hadoop deployment. However, the "reuse" only means that a Task can
be used more than once for the same Job. So, it's basically just
another complicating factor in our boot loading code (one of the
reasons there is BootUp.boot() and BootUp.bootUnsafe()) but doesn't
help us with our problem.
  
> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>         Attachments: 0001-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0001-the-stupid-version-of-hadoop-support.patch, 0001-v3-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0002-CASSANDRA-342.-Working-hadoop-support.patch, 0002-v3-CASSANDRA-342.-Working-hadoop-support.patch, 0003-CASSANDRA-342.-Adding-the-WordCount-example.patch, 0003-v3-CASSANDRA-342.-Adding-the-WordCount-example.patch, v2-squashed-commits-for-hadoop-stupid.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (CASSANDRA-342) hadoop integration

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829214#action_12829214 ] 

Jonathan Ellis edited comment on CASSANDRA-342 at 2/3/10 7:36 PM:
------------------------------------------------------------------

No. That would complicate the StorageProxy model unnecessarily.

If 10s or 30s to use the StorageProxy api is too long then you should probably start working on getting Hadoop to support jobs in existing [i.e., cassandra] jvms. :)

Alternatively you could write your own query parallelization code that uses the same hooks we are building Hadoop support.  There's nothing magic about Hadoop per se.

      was (Author: jbellis):
    No. That would complicate the StorageProxy model unnecessarily.

If 10s or 30s to use the StorageProxy api is too long then you should probably start working on getting Hadoop to support jobs in existing [i.e., cassandra] jvms. :)
  
> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Jonathan Ellis
>         Attachments: 0001-v3-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0001-v4-add-basic-hadoop-support-one-split-per-node.txt, 0002-v3-CASSANDRA-342.-Working-hadoop-support.patch, 0002-v4-add-wordcount-hadoop-example.txt, 0003-v3-CASSANDRA-342.-Adding-the-WordCount-example.patch, 0003-v4-add-WordCountSetup-multiple-tests.txt
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (CASSANDRA-342) hadoop integration

Posted by "Jeff Hodges (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Hodges updated CASSANDRA-342:
----------------------------------

    Attachment:     (was: 0003-CASSANDRA-342.-Creating-RowWritable-with-required-Wr.patch)

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>         Attachments: 0001-the-stupid-version-of-hadoop-support.patch, v2-squashed-commits-for-hadoop-stupid.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (CASSANDRA-342) hadoop integration

Posted by "Jeff Hodges (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Hodges updated CASSANDRA-342:
----------------------------------

    Attachment:     (was: 0006-CASSANDRA-342.-Start-up-cassandra-s-data-connections.patch)

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>         Attachments: 0001-the-stupid-version-of-hadoop-support.patch, v2-squashed-commits-for-hadoop-stupid.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-342) hadoop integration

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12746046#action_12746046 ] 

Jonathan Ellis commented on CASSANDRA-342:
------------------------------------------

sorry, CASSANDRA-197

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>         Attachments: 0001-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0001-the-stupid-version-of-hadoop-support.patch, 0001-v3-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0002-CASSANDRA-342.-Working-hadoop-support.patch, 0002-v3-CASSANDRA-342.-Working-hadoop-support.patch, 0003-CASSANDRA-342.-Adding-the-WordCount-example.patch, 0003-v3-CASSANDRA-342.-Adding-the-WordCount-example.patch, v2-squashed-commits-for-hadoop-stupid.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-342) hadoop integration

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12831015#action_12831015 ] 

Jonathan Ellis commented on CASSANDRA-342:
------------------------------------------

v6 attached.  this switches to using Thrift to get range splits (and assumes CASSANDRA-775), and incorporates stu's other v5 feedback.

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Jonathan Ellis
>             Fix For: 0.6
>
>         Attachments: 0001-v3-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0001-v4-add-basic-hadoop-support-one-split-per-node.txt, 0001-v5-add-basic-hadoop-support-one-split-per-node.txt, 0001-v6-add-basic-hadoop-support-using-Thrift-one-split-per.txt, 0002-v3-CASSANDRA-342.-Working-hadoop-support.patch, 0002-v4-add-wordcount-hadoop-example.txt, 0002-v5-add-wordcount-hadoop-example.txt, 0002-v6-add-wordcount-hadoop-example.txt, 0003-v3-CASSANDRA-342.-Adding-the-WordCount-example.patch, 0003-v4-add-WordCountSetup-multiple-tests.txt, 0003-v5-make-predicate-configurable.txt, 0003-v6-make-predicate-configurable.txt, 0004-v5-sub-splits.txt, 0004-v6-sub-splits.txt, 0005-v5-jar-packaging.txt, 0006-prevent-multiple-client-initializations.txt
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-342) hadoop integration

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12747554#action_12747554 ] 

Stu Hood commented on CASSANDRA-342:
------------------------------------

I take back what I said about RingCache: you'll still need to be able to send a RangeCommand, which requires a MessagingService. I suggest we take the approach of embedding a MessagingService like CASSANDRA-337 does.

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>         Attachments: 0001-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0001-the-stupid-version-of-hadoop-support.patch, 0001-v3-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0002-CASSANDRA-342.-Working-hadoop-support.patch, 0002-v3-CASSANDRA-342.-Working-hadoop-support.patch, 0003-CASSANDRA-342.-Adding-the-WordCount-example.patch, 0003-v3-CASSANDRA-342.-Adding-the-WordCount-example.patch, v2-squashed-commits-for-hadoop-stupid.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (CASSANDRA-342) hadoop integration

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stu Hood updated CASSANDRA-342:
-------------------------------

    Attachment: 0005-v5-jar-packaging.txt

 * The files in contrib/word_count/bin don't have the execute bit
 * The caller needs to include their Hadoop configuration on the classpath,
 * The WordCount job and all dependencies need to be wrapped in a Jar for submission
 * storage-conf.xml needs to be packaged in the Jar, and configurable via the classpath

After applying the attached patch, I managed to get this example running on the cluster, but gossip isn't kicking in correctly yet, because the InputFormat isn't initializing the fat client.

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Jonathan Ellis
>             Fix For: 0.6
>
>         Attachments: 0001-v3-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0001-v4-add-basic-hadoop-support-one-split-per-node.txt, 0001-v5-add-basic-hadoop-support-one-split-per-node.txt, 0002-v3-CASSANDRA-342.-Working-hadoop-support.patch, 0002-v4-add-wordcount-hadoop-example.txt, 0002-v5-add-wordcount-hadoop-example.txt, 0003-v3-CASSANDRA-342.-Adding-the-WordCount-example.patch, 0003-v4-add-WordCountSetup-multiple-tests.txt, 0003-v5-make-predicate-configurable.txt, 0004-v5-sub-splits.txt, 0005-v5-jar-packaging.txt
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-342) hadoop integration

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799513#action_12799513 ] 

Jonathan Ellis commented on CASSANDRA-342:
------------------------------------------

Not a whole lot worse than allowing arbitrary code into a different JVM on the same node, really.  What are they going to do, read data they shouldn't?  Remember we don't even have auth yet, it's very much a "power tools can maim" thing.

(Note that I didn't say it should be the *only* option for Hadoop but it should definitely be *an* option.)

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Jeff Hodges
>         Attachments: 0001-v3-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0002-v3-CASSANDRA-342.-Working-hadoop-support.patch, 0003-v3-CASSANDRA-342.-Adding-the-WordCount-example.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (CASSANDRA-342) hadoop integration

Posted by "Jeff Hodges (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Hodges updated CASSANDRA-342:
----------------------------------

    Attachment: 0006-CASSANDRA-342.-Start-up-cassandra-s-data-connections.patch
                0005-CASSANDRA-342.-Adding-how-to-set-a-table-and-column-.patch
                0004-CASSANDRA-342.-Make-the-hadoop-classes-public.patch

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>         Attachments: 0001-CASSANDRA-342.-Initial-commit-of-hadoop-support.-Doe.patch, 0001-the-stupid-version-of-hadoop-support.patch, 0002-CASSANDRA-342.-Breaking-RowSerializer-out-as-a-publi.patch, 0003-CASSANDRA-342.-Creating-RowWritable-with-required-Wr.patch, 0004-CASSANDRA-342.-Make-the-hadoop-classes-public.patch, 0005-CASSANDRA-342.-Adding-how-to-set-a-table-and-column-.patch, 0006-CASSANDRA-342.-Start-up-cassandra-s-data-connections.patch, 0007-CASSANDRA-342.-Breaking-out-the-boot-up-code-separat.patch, 0008-CASSANDRA-342.-Handle-empty-key-ranges-correctly.patch, 0009-CASSANDRA-342.-Adding-the-WordCount-hadoop-example.patch, 0010-CASSANDRA-342.-add-public-T-originalToken-to-Token-f.patch, 0011-CASSANDRA-342.-Hadoop-integration-with-one-cassandra.patch, 0012-CASSANDRA-342.-Adding-the-required-confdir-flag-to-t.patch, 0013-CASSANDRA-342.-Removing-the-unneeded-DataInput-and-D.patch, 0014-CASSANDRA-342.-SystemTable.initMetadata-no-longer-NP.patch, 0015-CASSANDRA-342.-new-conf-file-format.patch, v2-squashed-commits-for-hadoop-stupid.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (CASSANDRA-342) hadoop integration

Posted by "Jeff Hodges (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Hodges updated CASSANDRA-342:
----------------------------------

    Attachment:     (was: 0007-CASSANDRA-342.-Breaking-out-the-boot-up-code-separat.patch)

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>         Attachments: 0001-the-stupid-version-of-hadoop-support.patch, v2-squashed-commits-for-hadoop-stupid.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (CASSANDRA-342) hadoop integration

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stu Hood updated CASSANDRA-342:
-------------------------------

    Attachment: 0005-v6-use-conf-for-inputs-and-relative-temp.txt

With this patch, I was able to run the job on a Hadoop cluster.

 * Changes to a relative tmp directory, since most users won't have access to the Hadoop /tmp
 * Uses the Configuration object to pass in the columnName, since the mutable static var doesn't survive distribution,
 * Ensures that an un-jar'd version of storage-conf.xml is first on the classpath for the benefit of the local JVM.

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Jonathan Ellis
>             Fix For: 0.6
>
>         Attachments: 0001-v3-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0001-v4-add-basic-hadoop-support-one-split-per-node.txt, 0001-v5-add-basic-hadoop-support-one-split-per-node.txt, 0001-v6-add-basic-hadoop-support-using-Thrift-one-split-per.txt, 0002-v3-CASSANDRA-342.-Working-hadoop-support.patch, 0002-v4-add-wordcount-hadoop-example.txt, 0002-v5-add-wordcount-hadoop-example.txt, 0002-v6-add-wordcount-hadoop-example.txt, 0003-v3-CASSANDRA-342.-Adding-the-WordCount-example.patch, 0003-v4-add-WordCountSetup-multiple-tests.txt, 0003-v5-make-predicate-configurable.txt, 0003-v6-make-predicate-configurable.txt, 0004-v5-sub-splits.txt, 0004-v6-sub-splits.txt, 0005-v5-jar-packaging.txt, 0005-v6-use-conf-for-inputs-and-relative-temp.txt, 0006-prevent-multiple-client-initializations.txt
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-342) hadoop integration

Posted by "Jeff Hodges (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744002#action_12744002 ] 

Jeff Hodges commented on CASSANDRA-342:
---------------------------------------

I could also provide an unsquashed version of the commits that led to this patch, if needed. I didn't do it because the task seem a bit arduous for the 14 commits involved and git-jira-attacher seems to not work at the moment.

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>         Attachments: 0001-the-stupid-version-of-hadoop-support.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-342) hadoop integration

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745635#action_12745635 ] 

Jonathan Ellis commented on CASSANDRA-342:
------------------------------------------

we can clean the bootup stuff a lot.  let's take the bootupunsafe method that shares code w/ cassandra daemon, make it a public static method there, move the LogUtil.init call into StorageService.init, and remove CassandraServer.init.  that will be useful for other people looking to embed cassandra too.

style note: inline throw-away variables like String ll.

+        keyspace = DatabaseDescriptor.getTable(keyspace);
+        if (keyspace == null)

inline that too rather than re-binding keyspace to a different meaning (although of the same type)

there are a lot of FIXME but it looks like a good start :)

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>         Attachments: 0001-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0001-the-stupid-version-of-hadoop-support.patch, 0002-CASSANDRA-342.-Working-hadoop-support.patch, 0003-CASSANDRA-342.-Adding-the-WordCount-example.patch, v2-squashed-commits-for-hadoop-stupid.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-342) hadoop integration

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12746026#action_12746026 ] 

Stu Hood commented on CASSANDRA-342:
------------------------------------

Since the Hadoop InputFormat can't possibly be in the same process as the Cassandra server on the local machine, perhaps the cleanest interface between them would be a 'private' client library (not using Thrift) that allows for the calls you need? The only call I see to an internal API is StorageService.getRangeToEndPointMap, so we could add a Verb that queries a remote/local running node for the same information.

It doesn't look like it would be too difficult to extract just the net.MessagingService portion of the Cassandra server, and then use that as the private client interface to send messages to other nodes.

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>         Attachments: 0001-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0001-the-stupid-version-of-hadoop-support.patch, 0001-v3-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0002-CASSANDRA-342.-Working-hadoop-support.patch, 0002-v3-CASSANDRA-342.-Working-hadoop-support.patch, 0003-CASSANDRA-342.-Adding-the-WordCount-example.patch, 0003-v3-CASSANDRA-342.-Adding-the-WordCount-example.patch, v2-squashed-commits-for-hadoop-stupid.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (CASSANDRA-342) hadoop integration

Posted by "Jeff Hodges (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Hodges updated CASSANDRA-342:
----------------------------------

    Attachment:     (was: 0011-CASSANDRA-342.-Hadoop-integration-with-one-cassandra.patch)

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>         Attachments: 0001-the-stupid-version-of-hadoop-support.patch, v2-squashed-commits-for-hadoop-stupid.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (CASSANDRA-342) hadoop integration

Posted by "Jeff Hodges (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Hodges updated CASSANDRA-342:
----------------------------------

    Attachment: v2-squashed-commits-for-hadoop-stupid.patch

Corrected version of earlier squashed patch. Unsquashed will happen as soon as git-jira-attacher.py works for me.

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>         Attachments: 0001-the-stupid-version-of-hadoop-support.patch, v2-squashed-commits-for-hadoop-stupid.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (CASSANDRA-342) hadoop integration

Posted by "Jeff Hodges (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Hodges updated CASSANDRA-342:
----------------------------------

    Attachment: 0003-CASSNADRA-342.-Adding-the-WordCount-example.patch
                0002-CASSANDRA-342.-Working-hadoop-support.patch
                0001-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch

0001) Provides a few changes needed to support the new Hadoop bridge. Specifically, StorageProxy.getKeyRange() becomes public, RowSerializer becomes accessible to the hadoop package classes, and Token#originalToken() is added.

0002) This adds the actual Hadoop bridge code. It includes subclasses of InputFormat, RecordReader, and InputSplit. It also provides a nice RowWritable class to be used as the value passed to a map. This is also the commit that adds the really nasty boot up code in the class BootUp.

0003) This just adds a simple WordCount example of the Cassandra/Hadoop bridge.

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>         Attachments: 0001-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0001-the-stupid-version-of-hadoop-support.patch, 0002-CASSANDRA-342.-Working-hadoop-support.patch, 0003-CASSNADRA-342.-Adding-the-WordCount-example.patch, v2-squashed-commits-for-hadoop-stupid.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (CASSANDRA-342) hadoop integration

Posted by "Jeff Hodges (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Hodges updated CASSANDRA-342:
----------------------------------

    Attachment: 0001-the-stupid-version-of-hadoop-support.patch

This patch adds the ability for Cassandra databases to be read from in
a Hadoop setting. This is the "stupid" version of said support
(c.f. "not-stupid" discussion
http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E
).

This patch is only working when run in the non-distributed `./bin/hadoop
jar` environment of Hadoop.

.h2 Building

Building the patched Cassandra requires including
`hadoop-0.20.0-core.jar` that is distributed with Hadoop 0.20.0
(obviously) in Cassandra's `$CLASSPATH`. (Which is easiest to do by
simply copying the file to Cassandra's `lib` directory.)

An example of the adapter's use can be found in
`src/examples/org/apache/cassandra/examples/WordCount.java`. You can
run `WordCount.java` with Hadoop by first editing the
`conf/hadoop-env.sh` file and adding all of the jars in Cassandra's
lib directory to `$HADOOP_CLASSPATH`. Building the examples is then
just a matter of running `ant examples`.

.h2 Running

Running the example is straightforward after that. Assuming you've
added some tweets to the Tweets column family with a column called
`text` filled with, you know, text:

{code}
./bin/hadoop jar \
  /path/to/cassandra/build/apache-cassandra-incubating-examples-0.4.0-dev.jar\
  org.apache.cassandra.examples.WordCount -inputspace Twitter \
  -inputfamily Tweets -outputdir outie \
  -confdir /path/to/cassandra/src/examples/conf/wordcount/
{/code}

.h2 Changes External to `cassandra.hadoop`

This patch makes two changes in the Cassandra project that are outside
of the new `hadoop` package.

 # Makes `StorageProxy.getKeyRange()` public.
 
 # `RowSerializer` is now a public class and outside of Row. This was
    done so I didn't have to rewrite the serialization code for
    writing the `RowWritable` class.

.h2 Issues

This patch does have some issues. Specifically:

 # Has no tests.

 # Cannot split up the key ranges beyond what the entire key range
   that exists on each individual node. This means we cannot delegate
   to more Map tasks than there are Cassandra nodes. As we move to
   billions of keys per node, this is even more of an
   issue. (c.f. CASSANDRA-242)

 # Cassandra currently must be booted by this Hadoop-facing code in
   order to work as a side effect of needing certain internal calls in
   odd places and the onus put upon this project to keep everything
   working internally. There is currently no way to hook into an
   external Cassandra process.

 # Only has been tested and only works (due to the above boot code
   issues) on one Cassandra node, with one Hadoop Map task.

 # Cannot take key ranges that cross over multiple nodes. This is
   a problem with how we (can't) divvy up the keys instead of any other
   problem (such as the ones described in CASSANDRA-348).

 # The current API for selecting what keys to grab cannot take
   anything more than the table/keyspace to search in and the name of
   a top-level super column.

 # The API is not well-fleshed out for grabbing data from a RowWritable.

 # `KeyspaceRecordReader#getProgress()` is nothing more than a stub.

 # `RowWritable` does not implement `WritableComparable`, which would
   allow its use as a key and not just a value in a MapReduce job. 

 # `RowWritable` uses `RowSerializer` which encodes way too much
   information about the column families and columns through
   `ColumnFamilySerializer`.

 # Has a (likely inescapable) dependency on the hadoop 0.20
   core jar.

 # Really, really has no tests.

I could go into more detail about some of these issues, but this is
already too long and the discussion adds even more text.


> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>         Attachments: 0001-the-stupid-version-of-hadoop-support.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (CASSANDRA-342) hadoop integration

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-342:
-------------------------------------

    Attachment: 0004-v6-sub-splits.txt
                0003-v6-make-predicate-configurable.txt
                0002-v6-add-wordcount-hadoop-example.txt
                0001-v6-add-basic-hadoop-support-using-Thrift-one-split-per.txt

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Jonathan Ellis
>             Fix For: 0.6
>
>         Attachments: 0001-v3-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0001-v4-add-basic-hadoop-support-one-split-per-node.txt, 0001-v5-add-basic-hadoop-support-one-split-per-node.txt, 0001-v6-add-basic-hadoop-support-using-Thrift-one-split-per.txt, 0002-v3-CASSANDRA-342.-Working-hadoop-support.patch, 0002-v4-add-wordcount-hadoop-example.txt, 0002-v5-add-wordcount-hadoop-example.txt, 0002-v6-add-wordcount-hadoop-example.txt, 0003-v3-CASSANDRA-342.-Adding-the-WordCount-example.patch, 0003-v4-add-WordCountSetup-multiple-tests.txt, 0003-v5-make-predicate-configurable.txt, 0003-v6-make-predicate-configurable.txt, 0004-v5-sub-splits.txt, 0004-v6-sub-splits.txt, 0005-v5-jar-packaging.txt, 0006-prevent-multiple-client-initializations.txt
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (CASSANDRA-342) hadoop integration

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stu Hood updated CASSANDRA-342:
-------------------------------

    Attachment:     (was: 0005-v5-jar-packaging.txt)

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Jonathan Ellis
>             Fix For: 0.6
>
>         Attachments: 0001-v3-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0001-v4-add-basic-hadoop-support-one-split-per-node.txt, 0001-v5-add-basic-hadoop-support-one-split-per-node.txt, 0002-v3-CASSANDRA-342.-Working-hadoop-support.patch, 0002-v4-add-wordcount-hadoop-example.txt, 0002-v5-add-wordcount-hadoop-example.txt, 0003-v3-CASSANDRA-342.-Adding-the-WordCount-example.patch, 0003-v4-add-WordCountSetup-multiple-tests.txt, 0003-v5-make-predicate-configurable.txt, 0004-v5-sub-splits.txt, 0005-v5-jar-packaging.txt
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-342) hadoop integration

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745639#action_12745639 ] 

Jonathan Ellis commented on CASSANDRA-342:
------------------------------------------

i'm a little confused by the bootup business though -- if this code is running on a cassandra node, can't it just use the already-running server?

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>         Attachments: 0001-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0001-the-stupid-version-of-hadoop-support.patch, 0002-CASSANDRA-342.-Working-hadoop-support.patch, 0003-CASSANDRA-342.-Adding-the-WordCount-example.patch, v2-squashed-commits-for-hadoop-stupid.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-342) hadoop integration

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12839322#action_12839322 ] 

Stu Hood commented on CASSANDRA-342:
------------------------------------

We should add a CHANGES entry for this.

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Jonathan Ellis
>             Fix For: 0.6
>
>         Attachments: 0001-v3-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0001-v4-add-basic-hadoop-support-one-split-per-node.txt, 0001-v5-add-basic-hadoop-support-one-split-per-node.txt, 0001-v6-add-basic-hadoop-support-using-Thrift-one-split-per.txt, 0002-v3-CASSANDRA-342.-Working-hadoop-support.patch, 0002-v4-add-wordcount-hadoop-example.txt, 0002-v5-add-wordcount-hadoop-example.txt, 0002-v6-add-wordcount-hadoop-example.txt, 0003-v3-CASSANDRA-342.-Adding-the-WordCount-example.patch, 0003-v4-add-WordCountSetup-multiple-tests.txt, 0003-v5-make-predicate-configurable.txt, 0003-v6-make-predicate-configurable.txt, 0004-v5-sub-splits.txt, 0004-v6-sub-splits.txt, 0005-v5-jar-packaging.txt, 0005-v6-use-conf-for-inputs-and-relative-temp.txt, 0006-prevent-multiple-client-initializations.txt
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (CASSANDRA-342) hadoop integration

Posted by "Jeff Hodges (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Hodges updated CASSANDRA-342:
----------------------------------

    Attachment:     (was: 0012-CASSANDRA-342.-Adding-the-required-confdir-flag-to-t.patch)

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>         Attachments: 0001-the-stupid-version-of-hadoop-support.patch, v2-squashed-commits-for-hadoop-stupid.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (CASSANDRA-342) hadoop integration

Posted by "Jeff Hodges (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Hodges updated CASSANDRA-342:
----------------------------------

    Attachment:     (was: 0016-CASSANDRA-342.-When-rewriting-history-be-sure-to-rew.patch)

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>         Attachments: 0001-the-stupid-version-of-hadoop-support.patch, v2-squashed-commits-for-hadoop-stupid.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-342) hadoop integration

Posted by "Jun Rao (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12747712#action_12747712 ] 

Jun Rao commented on CASSANDRA-342:
-----------------------------------

Why can't the recordReader get the rows in a range through thrift? This can be done by first calling get_key_range, followed by a bunch of gets, one for each row.

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>         Attachments: 0001-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0001-the-stupid-version-of-hadoop-support.patch, 0001-v3-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0002-CASSANDRA-342.-Working-hadoop-support.patch, 0002-v3-CASSANDRA-342.-Working-hadoop-support.patch, 0003-CASSANDRA-342.-Adding-the-WordCount-example.patch, 0003-v3-CASSANDRA-342.-Adding-the-WordCount-example.patch, v2-squashed-commits-for-hadoop-stupid.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (CASSANDRA-342) hadoop integration

Posted by "Jeff Hodges (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Hodges updated CASSANDRA-342:
----------------------------------

    Attachment:     (was: 0013-CASSANDRA-342.-Removing-the-unneeded-DataInput-and-D.patch)

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>         Attachments: 0001-the-stupid-version-of-hadoop-support.patch, v2-squashed-commits-for-hadoop-stupid.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-342) hadoop integration

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745687#action_12745687 ] 

Jonathan Ellis commented on CASSANDRA-342:
------------------------------------------

btw, i think this was a good patch breakdown.

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>         Attachments: 0001-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0001-the-stupid-version-of-hadoop-support.patch, 0002-CASSANDRA-342.-Working-hadoop-support.patch, 0003-CASSANDRA-342.-Adding-the-WordCount-example.patch, v2-squashed-commits-for-hadoop-stupid.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-342) hadoop integration

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799472#action_12799472 ] 

Todd Lipcon commented on CASSANDRA-342:
---------------------------------------

You really want to allow arbitrary user code into your cassandra JVM? That seems like a recipe for disaster.

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Jeff Hodges
>         Attachments: 0001-v3-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0002-v3-CASSANDRA-342.-Working-hadoop-support.patch, 0003-v3-CASSANDRA-342.-Adding-the-WordCount-example.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-342) hadoop integration

Posted by "Vijay (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12806447#action_12806447 ] 

Vijay commented on CASSANDRA-342:
---------------------------------

Hadoop Integration might need the following..... 

1) API to return the List of splits, given the number of splits. 
Using this tokens we cam span equal number of MR Jobs (Have a configuration in MR Job - This will be according to the complexity in processing), which will say how many map tasks per partition and span those process. 
-- We have getSplit(int count) which will do it for us.

2) Start token to stream.... API 
Input will be Range(String startKey, Token start, Token finish, int limit).... return will be 
    If Startwithkey is empty we will use the token1 as the starting point for the stream, else we will use startwithkey to specify the key to start with? Make sense? 
-- Need additional Method.

So each MR jobs will get the range of data from the Cassandra and will do processing on it, it can also stream the data and doesn't need to get all of it.

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Jeff Hodges
>         Attachments: 0001-v3-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0002-v3-CASSANDRA-342.-Working-hadoop-support.patch, 0003-v3-CASSANDRA-342.-Adding-the-WordCount-example.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (CASSANDRA-342) hadoop integration

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stu Hood updated CASSANDRA-342:
-------------------------------

    Attachment: 0005-v5-jar-packaging.txt

Oops... it was late, and I was tired.

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Jonathan Ellis
>             Fix For: 0.6
>
>         Attachments: 0001-v3-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0001-v4-add-basic-hadoop-support-one-split-per-node.txt, 0001-v5-add-basic-hadoop-support-one-split-per-node.txt, 0002-v3-CASSANDRA-342.-Working-hadoop-support.patch, 0002-v4-add-wordcount-hadoop-example.txt, 0002-v5-add-wordcount-hadoop-example.txt, 0003-v3-CASSANDRA-342.-Adding-the-WordCount-example.patch, 0003-v4-add-WordCountSetup-multiple-tests.txt, 0003-v5-make-predicate-configurable.txt, 0004-v5-sub-splits.txt, 0005-v5-jar-packaging.txt
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (CASSANDRA-342) hadoop integration

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-342:
-------------------------------------

    Attachment: 0004-v5-sub-splits.txt
                0003-v5-make-predicate-configurable.txt
                0002-v5-add-wordcount-hadoop-example.txt
                0001-v5-add-basic-hadoop-support-one-split-per-node.txt

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Jonathan Ellis
>         Attachments: 0001-v3-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0001-v4-add-basic-hadoop-support-one-split-per-node.txt, 0001-v5-add-basic-hadoop-support-one-split-per-node.txt, 0002-v3-CASSANDRA-342.-Working-hadoop-support.patch, 0002-v4-add-wordcount-hadoop-example.txt, 0002-v5-add-wordcount-hadoop-example.txt, 0003-v3-CASSANDRA-342.-Adding-the-WordCount-example.patch, 0003-v4-add-WordCountSetup-multiple-tests.txt, 0003-v5-make-predicate-configurable.txt, 0004-v5-sub-splits.txt
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-342) hadoop integration

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744290#action_12744290 ] 

Jonathan Ellis commented on CASSANDRA-342:
------------------------------------------

as a non-hadoop expert it would help a lot with review to unsquash.

what errors are you getting with g-j-a?  Us python guys can probably help with that :)

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>         Attachments: 0001-the-stupid-version-of-hadoop-support.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (CASSANDRA-342) hadoop integration

Posted by "Jeff Hodges (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Hodges updated CASSANDRA-342:
----------------------------------

    Attachment:     (was: 0014-CASSANDRA-342.-SystemTable.initMetadata-no-longer-NP.patch)

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>         Attachments: 0001-the-stupid-version-of-hadoop-support.patch, v2-squashed-commits-for-hadoop-stupid.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-342) hadoop integration

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799454#action_12799454 ] 

Jonathan Ellis commented on CASSANDRA-342:
------------------------------------------

To get around the hadoop-stuff-has-to-run-in-a-different-JVM problem: what if we had Hadoop operate on Cassandra snapshots?  For the kind of batch oriented, non-latency-sensitive work that Hadoop is a good fit for, that should be perfect: the Hadoop Task can open up ColumnFamilyStore objects on the snapshotted sstables, without having to start a full server which is nasty.

Otherwise IMO we should patch Hadoop to allow Tasks to run on an existing JVM.  I'm surprised HBase didn't do that: doing the copies of *all input* from one jvm to another is not insignificant.  (You could take that approach w/ cassandra to, using getRangeSlice from StorageProxy started with StorageService.initClient -- actually we would want to add initLocalClient probably to mean "I only plan to query the machine I am on" -- but that would be a case of working around bad design instead of fixing it.)


> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Jeff Hodges
>         Attachments: 0001-v3-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 0002-v3-CASSANDRA-342.-Working-hadoop-support.patch, 0003-v3-CASSANDRA-342.-Adding-the-WordCount-example.patch
>
>
> Some discussion on -dev: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.