You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Mark <st...@gmail.com> on 2010/08/19 19:07:23 UTC

Cassandra w/ Hadoop

  Are there any examples/tutorials on the web for reading/writing from 
Cassandra into/from Hadoop?

I found the example in contrib/word_count but I really can't make sense 
of it... a tutorial/explanation would help.

Re: Cassandra w/ Hadoop

Posted by Mark <st...@gmail.com>.

  On 8/19/10 11:14 AM, Mark wrote:
>  On 8/19/10 10:23 AM, Jeremy Hanna wrote:
>> I would check out http://wiki.apache.org/cassandra/HadoopSupport for 
>> more info.  I'll try to explain a bit more here, but I don't think 
>> there's a tutorial out there yet.
>>
>> For input:
>> - configure your main class where you're starting the mapreduce job 
>> the way the word_count is configured (with either storage-conf or in 
>> your code via the ConfigHelper).  It will complain specifically about 
>> stuff you hadn't configured - esp. important is your cassandra server 
>> and port.
>> - the inputs to your mapper are going to be what's coming from 
>> cassandra - so your key with a map of row values
>> - you need to set your column name in your overridden setup method in 
>> your mapper
>> - for the reducer, nothing really changes from a normal map/reduce, 
>> unless you want to output to cassandra
>> - generally cassandra just provides an inputformat and split classes 
>> to read from cassandra - you can find the guts in the 
>> org.apache.cassandra.hadoop package
>>
>> For output:
>> - in your reducer, you could just write to cassandra directly via 
>> thrift.  there is a built-in outputformat coming in 0.7 but it still 
>> might change before 0.7 final - that will queue up changes so it will 
>> write large blocks all at once.
>>
>>
>> On Aug 19, 2010, at 12:07 PM, Mark wrote:
>>
>>> Are there any examples/tutorials on the web for reading/writing from 
>>> Cassandra into/from Hadoop?
>>>
>>> I found the example in contrib/word_count but I really can't make 
>>> sense of it... a tutorial/explanation would help.
> Thanks!
How does batching across all rows work? Does it just take an arbitrary 
start w/ a limit of x and then use the last key from that result as the 
next start? Does this work with RandomPartitioner?

Re: Cassandra w/ Hadoop

Posted by Mark <st...@gmail.com>.

  On 8/19/10 10:23 AM, Jeremy Hanna wrote:
> I would check out http://wiki.apache.org/cassandra/HadoopSupport for more info.  I'll try to explain a bit more here, but I don't think there's a tutorial out there yet.
>
> For input:
> - configure your main class where you're starting the mapreduce job the way the word_count is configured (with either storage-conf or in your code via the ConfigHelper).  It will complain specifically about stuff you hadn't configured - esp. important is your cassandra server and port.
> - the inputs to your mapper are going to be what's coming from cassandra - so your key with a map of row values
> - you need to set your column name in your overridden setup method in your mapper
> - for the reducer, nothing really changes from a normal map/reduce, unless you want to output to cassandra
> - generally cassandra just provides an inputformat and split classes to read from cassandra - you can find the guts in the org.apache.cassandra.hadoop package
>
> For output:
> - in your reducer, you could just write to cassandra directly via thrift.  there is a built-in outputformat coming in 0.7 but it still might change before 0.7 final - that will queue up changes so it will write large blocks all at once.
>
>
> On Aug 19, 2010, at 12:07 PM, Mark wrote:
>
>> Are there any examples/tutorials on the web for reading/writing from Cassandra into/from Hadoop?
>>
>> I found the example in contrib/word_count but I really can't make sense of it... a tutorial/explanation would help.
Thanks!

Re: Cassandra w/ Hadoop

Posted by Mark <st...@gmail.com>.

  On 8/19/10 10:34 AM, Christian Decker wrote:
> If, like me, you prefer to write your jobs on the fly try taking a 
> look at Pig. Cassandra provides a loadfunc under contrib/pig/ in the 
> source package which allows you to load data directly from Cassandra.
> --
> Christian Decker
> Software Architect
> http://blog.snyke.net
>
>
> On Thu, Aug 19, 2010 at 7:23 PM, Jeremy Hanna 
> <jeremy.hanna1234@gmail.com <ma...@gmail.com>> wrote:
>
>     I would check out http://wiki.apache.org/cassandra/HadoopSupport
>     for more info.  I'll try to explain a bit more here, but I don't
>     think there's a tutorial out there yet.
>
>     For input:
>     - configure your main class where you're starting the mapreduce
>     job the way the word_count is configured (with either storage-conf
>     or in your code via the ConfigHelper).  It will complain
>     specifically about stuff you hadn't configured - esp. important is
>     your cassandra server and port.
>     - the inputs to your mapper are going to be what's coming from
>     cassandra - so your key with a map of row values
>     - you need to set your column name in your overridden setup method
>     in your mapper
>     - for the reducer, nothing really changes from a normal
>     map/reduce, unless you want to output to cassandra
>     - generally cassandra just provides an inputformat and split
>     classes to read from cassandra - you can find the guts in the
>     org.apache.cassandra.hadoop package
>
>     For output:
>     - in your reducer, you could just write to cassandra directly via
>     thrift.  there is a built-in outputformat coming in 0.7 but it
>     still might change before 0.7 final - that will queue up changes
>     so it will write large blocks all at once.
>
>
>     On Aug 19, 2010, at 12:07 PM, Mark wrote:
>
>     > Are there any examples/tutorials on the web for reading/writing
>     from Cassandra into/from Hadoop?
>     >
>     > I found the example in contrib/word_count but I really can't
>     make sense of it... a tutorial/explanation would help.
>
>
That's definitely an option and I'll probably lean towards that in the 
near future. I am just trying to get a complete understanding of the 
whole infrastructure before working with higher level features.

Also same problem exists... I need a nice tutorial :)

Re: Cassandra w/ Hadoop

Posted by Christian Decker <de...@gmail.com>.

If, like me, you prefer to write your jobs on the fly try taking a look at
Pig. Cassandra provides a loadfunc under contrib/pig/ in the source package
which allows you to load data directly from Cassandra.
--
Christian Decker
Software Architect
http://blog.snyke.net


On Thu, Aug 19, 2010 at 7:23 PM, Jeremy Hanna <je...@gmail.com>wrote:

> I would check out http://wiki.apache.org/cassandra/HadoopSupport for more
> info.  I'll try to explain a bit more here, but I don't think there's a
> tutorial out there yet.
>
> For input:
> - configure your main class where you're starting the mapreduce job the way
> the word_count is configured (with either storage-conf or in your code via
> the ConfigHelper).  It will complain specifically about stuff you hadn't
> configured - esp. important is your cassandra server and port.
> - the inputs to your mapper are going to be what's coming from cassandra -
> so your key with a map of row values
> - you need to set your column name in your overridden setup method in your
> mapper
> - for the reducer, nothing really changes from a normal map/reduce, unless
> you want to output to cassandra
> - generally cassandra just provides an inputformat and split classes to
> read from cassandra - you can find the guts in the
> org.apache.cassandra.hadoop package
>
> For output:
> - in your reducer, you could just write to cassandra directly via thrift.
>  there is a built-in outputformat coming in 0.7 but it still might change
> before 0.7 final - that will queue up changes so it will write large blocks
> all at once.
>
>
> On Aug 19, 2010, at 12:07 PM, Mark wrote:
>
> > Are there any examples/tutorials on the web for reading/writing from
> Cassandra into/from Hadoop?
> >
> > I found the example in contrib/word_count but I really can't make sense
> of it... a tutorial/explanation would help.
>
>

Re: Cassandra w/ Hadoop

Posted by Jeremy Hanna <je...@gmail.com>.

I would check out http://wiki.apache.org/cassandra/HadoopSupport for more info.  I'll try to explain a bit more here, but I don't think there's a tutorial out there yet.

For input:
- configure your main class where you're starting the mapreduce job the way the word_count is configured (with either storage-conf or in your code via the ConfigHelper).  It will complain specifically about stuff you hadn't configured - esp. important is your cassandra server and port.
- the inputs to your mapper are going to be what's coming from cassandra - so your key with a map of row values
- you need to set your column name in your overridden setup method in your mapper
- for the reducer, nothing really changes from a normal map/reduce, unless you want to output to cassandra
- generally cassandra just provides an inputformat and split classes to read from cassandra - you can find the guts in the org.apache.cassandra.hadoop package

For output:
- in your reducer, you could just write to cassandra directly via thrift.  there is a built-in outputformat coming in 0.7 but it still might change before 0.7 final - that will queue up changes so it will write large blocks all at once.


On Aug 19, 2010, at 12:07 PM, Mark wrote:

> Are there any examples/tutorials on the web for reading/writing from Cassandra into/from Hadoop?
> 
> I found the example in contrib/word_count but I really can't make sense of it... a tutorial/explanation would help.