You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Jonathan Ellis <jb...@gmail.com> on 2010/05/17 21:46:27 UTC

Re: Hadoop over Cassandra

Moving to the user@ list.

http://wiki.apache.org/cassandra/HadoopSupport should be useful.

On Mon, May 17, 2010 at 2:41 PM, Yan Virin <ja...@gmail.com> wrote:
> Hi,
> Can someone explain how this works? As long as I know, there is no execution
> engine in Cassandra alone, so I assume that Hadoop gives the MapReduce
> execution engine which uses Cassandra as the distributed storage? Is data
> locality preserved? How mature this "couple" is? How is the performance of
> this compared to the original Hadoop over HDFS?
>
> Thanks,
>
>
> --
> Jan Virin
> http://www.linkedin.com/in/yanvirin
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Re: JMX metrics for monitoring

Posted by Jonathan Ellis <jb...@gmail.com>.
Here are the basics I discuss in Riptano's training classes:
http://github.com/jbellis/cassandra-munin-plugins

On Mon, May 17, 2010 at 3:02 PM, Maxim Kramarenko
<ma...@trackstudio.com> wrote:
> Hi!
>
> Which JMX metrics do you use for Cassandra monitoring ? Which values can be
> used for alerts ?
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Re: JMX metrics for monitoring

Posted by Brandon Williams <dr...@gmail.com>.
On Tue, Nov 2, 2010 at 1:57 PM, hoivan <ih...@evidentsoftware.com> wrote:

>
> Just as an FYI, Evident ClearStone supports monitoring of Cassandra
> clusters
> in enterprise and EC2  deployments.
>

Sending just one of these emails will suffice in future.

-Brandon

Re: JMX metrics for monitoring

Posted by hoivan <ih...@evidentsoftware.com>.
Just as an FYI, Evident ClearStone supports monitoring of Cassandra clusters
in enterprise and EC2  deployments.

Evident ClearStone supports various NoSQL products (including Cassandra) and
distributed caching technologies. The product supports the aggregation of
metrics (from JMX) across all the nodes in the cluster. We also monitor JVM
stats like CPU, heap, and GC. The user interface is an Adobe Flex
application. We have also include some of the nodetool operations within the
product to allow you to point an click on a node to run some of the nodetool
functions. There's many more features in the product such as alerting and
historical reporting.

You can find more information about Evident ClearStone here:
http://www.evidentsoftware.com/products/clearstone-for-cassandra

Free downloads are available here: http://www.evidentsoftware.com/download



-- 
View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Re-Hadoop-over-Cassandra-tp5066715p5698733.html
Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.

Re: JMX metrics for monitoring

Posted by Ran Tavory <ra...@gmail.com>.
There are many, but here's what I found useful so far:
Per CF you have:
- Recent read/write latency
- PendingTasks
- Read/Write count

Globally you have, for each of the stages
(e.g. org.apache.cassandra.concurrent:type=ROW-READ-STAGE):
- PendingTasks
- ActiveCount

... and as you go you'll find more

On Tue, May 18, 2010 at 1:02 AM, Maxim Kramarenko
<ma...@trackstudio.com>wrote:

> Hi!
>
> Which JMX metrics do you use for Cassandra monitoring ? Which values can be
> used for alerts ?
>

JMX metrics for monitoring

Posted by Maxim Kramarenko <ma...@trackstudio.com>.
Hi!

Which JMX metrics do you use for Cassandra monitoring ? Which values can 
be used for alerts ?

Re: Hadoop over Cassandra

Posted by Ben Browning <be...@gmail.com>.
Maxim,

Check out the getLocation() method from this file:

http://svn.apache.org/repos/asf/cassandra/trunk/src/java/org/apache/cassandra/hadoop/ColumnFamilyRecordReader.java

Basically, it loops over the list of nodes containing this split of
data and if any of them are the local node, it returns that. Otherwise
it returns the first node that contains the data.

The code that creates the splits of data and figures out which node
each split is located on is here:

http://svn.apache.org/repos/asf/cassandra/trunk/src/java/org/apache/cassandra/hadoop/ColumnFamilyInputFormat.java


Ben

On Tue, May 18, 2010 at 3:42 AM, Maxim Grinev <ma...@grinev.net> wrote:
>
> On Tue, May 18, 2010 at 2:23 AM, Jonathan Ellis <jb...@gmail.com> wrote:
>>
>> On Mon, May 17, 2010 at 4:12 PM, Vick Khera <vi...@khera.org> wrote:
>> > On Mon, May 17, 2010 at 3:46 PM, Jonathan Ellis <jb...@gmail.com>
>> > wrote:
>> >> Moving to the user@ list.
>> >>
>> >> http://wiki.apache.org/cassandra/HadoopSupport should be useful.
>> >
>> > That document doesn't really answer the "is data locality preserved"
>> > when running the map phase, but my hunch is "no".
>>
>> The answer is, "yes, as long as you have hadoop on all the cassandra
>> machines." (the case where it's easy to map cassandra locality to
>> hadoop locality :)
>
> Jonathan,
> could you please clarify this. I also cannot understand how it works. Even
> if Hadoop is deployed on all the Cassandra machines, how will Hadoop be
> aware of Cassandra's data placement (partitioning and replication)?
> Maxim
>
>

Re: Hadoop over Cassandra

Posted by Jonathan Ellis <jb...@gmail.com>.
On Tue, May 18, 2010 at 9:40 PM, Mark Schnitzius
<ma...@cxense.com> wrote:
>> If anyone has "war stories" on the topic of Cassandra & Hadoop (or
>> even just Hadoop in general) let me know.
>
> Don't know if it counts as a war story, but I was successful recently in
> implementing something I got advice on in an earlier thread, namely feeding
> both a Cassandra table and a Hadoop sequence file into the same map/reduce
> process and updating the same Cassandra table with the results.  I used the
> approach I mentioned before, of creating an InputFormat that returns splits
> from both (and creating a RecordReader that massages the Cass data into the
> same format as the sequence file data).  I'll write something up about it
> for the wiki, when I can find some time.

That would be very interesting, I hope you can get around to it!

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Re: Hadoop over Cassandra

Posted by Mark Schnitzius <ma...@cxense.com>.
>
> If anyone has "war stories" on the topic of Cassandra & Hadoop (or
> even just Hadoop in general) let me know.



Don't know if it counts as a war story, but I was successful recently in
implementing something I got advice on in an earlier thread, namely feeding
both a Cassandra table and a Hadoop sequence file into the same map/reduce
process and updating the same Cassandra table with the results.  I used the
approach I mentioned before, of creating an InputFormat that returns splits
from both (and creating a RecordReader that massages the Cass data into the
same format as the sequence file data).  I'll write something up about it
for the wiki, when I can find some time.

My chief concern with it, though, is gracefully handling a map/reduce
failure.  As Cassandra isn't transactional, the table may end up partially
updated, which is a problem, at least in the domain I'm working in.  So now
I'm trying to come up with a way to effect Cassandra transactions via column
naming conventions or indexes or something like that.  I'd be curious to
hear if anyone here has ever implemented a solution for something similar
before...


Thanks
Mark

Re: Hadoop over Cassandra

Posted by Joseph Stein <cr...@gmail.com>.
If anyone is interested there is a great talk from Jonathan Ellis on
the topic of Hadoop & Cassandra (interviewed yesterday)
http://wp.me/pTu1i-40

I never knew that Pig was supported and I must say it is pretty kewl
that you can run Pig scripts against your Cassandra data.

It is a podcast so grab your headphones and enjoy.

If anyone has "war stories" on the topic of Cassandra & Hadoop (or
even just Hadoop in general) let me know.

/*
Joe Stein
http://www.linkedin.com/in/charmalloc
*/


On Tue, May 18, 2010 at 1:51 PM, Stu Hood <st...@rackspace.com> wrote:
> The Hadoop integration (as demonstrated by contrib/word_count) is locality aware: it begins by querying Cassandra to generate locality aware splits, and when the hostnames match up between the Hadoop and Cassandra clusters, the data can be mapped locally.
>
> -----Original Message-----
> From: "Maxim Grinev" <ma...@grinev.net>
> Sent: Tuesday, May 18, 2010 2:42am
> To: user@cassandra.apache.org
> Subject: Re: Hadoop over Cassandra
>
> On Tue, May 18, 2010 at 2:23 AM, Jonathan Ellis <jb...@gmail.com> wrote:
>
>> On Mon, May 17, 2010 at 4:12 PM, Vick Khera <vi...@khera.org> wrote:
>> > On Mon, May 17, 2010 at 3:46 PM, Jonathan Ellis <jb...@gmail.com>
>> wrote:
>> >> Moving to the user@ list.
>> >>
>> >> http://wiki.apache.org/cassandra/HadoopSupport should be useful.
>> >
>> > That document doesn't really answer the "is data locality preserved"
>> > when running the map phase, but my hunch is "no".
>>
>> The answer is, "yes, as long as you have hadoop on all the cassandra
>> machines." (the case where it's easy to map cassandra locality to
>> hadoop locality :)
>
>
> Jonathan,
>
> could you please clarify this. I also cannot understand how it works. Even
> if Hadoop is deployed on all the Cassandra machines, how will Hadoop be
> aware of Cassandra's data placement (partitioning and replication)?
>
> Maxim
>
>
>

Re: Hadoop over Cassandra

Posted by Stu Hood <st...@rackspace.com>.
The Hadoop integration (as demonstrated by contrib/word_count) is locality aware: it begins by querying Cassandra to generate locality aware splits, and when the hostnames match up between the Hadoop and Cassandra clusters, the data can be mapped locally.

-----Original Message-----
From: "Maxim Grinev" <ma...@grinev.net>
Sent: Tuesday, May 18, 2010 2:42am
To: user@cassandra.apache.org
Subject: Re: Hadoop over Cassandra

On Tue, May 18, 2010 at 2:23 AM, Jonathan Ellis <jb...@gmail.com> wrote:

> On Mon, May 17, 2010 at 4:12 PM, Vick Khera <vi...@khera.org> wrote:
> > On Mon, May 17, 2010 at 3:46 PM, Jonathan Ellis <jb...@gmail.com>
> wrote:
> >> Moving to the user@ list.
> >>
> >> http://wiki.apache.org/cassandra/HadoopSupport should be useful.
> >
> > That document doesn't really answer the "is data locality preserved"
> > when running the map phase, but my hunch is "no".
>
> The answer is, "yes, as long as you have hadoop on all the cassandra
> machines." (the case where it's easy to map cassandra locality to
> hadoop locality :)


Jonathan,

could you please clarify this. I also cannot understand how it works. Even
if Hadoop is deployed on all the Cassandra machines, how will Hadoop be
aware of Cassandra's data placement (partitioning and replication)?

Maxim



Re: Hadoop over Cassandra

Posted by Maxim Grinev <ma...@grinev.net>.
On Tue, May 18, 2010 at 2:23 AM, Jonathan Ellis <jb...@gmail.com> wrote:

> On Mon, May 17, 2010 at 4:12 PM, Vick Khera <vi...@khera.org> wrote:
> > On Mon, May 17, 2010 at 3:46 PM, Jonathan Ellis <jb...@gmail.com>
> wrote:
> >> Moving to the user@ list.
> >>
> >> http://wiki.apache.org/cassandra/HadoopSupport should be useful.
> >
> > That document doesn't really answer the "is data locality preserved"
> > when running the map phase, but my hunch is "no".
>
> The answer is, "yes, as long as you have hadoop on all the cassandra
> machines." (the case where it's easy to map cassandra locality to
> hadoop locality :)


Jonathan,

could you please clarify this. I also cannot understand how it works. Even
if Hadoop is deployed on all the Cassandra machines, how will Hadoop be
aware of Cassandra's data placement (partitioning and replication)?

Maxim

Re: Hadoop over Cassandra

Posted by Jonathan Ellis <jb...@gmail.com>.
On Mon, May 17, 2010 at 4:12 PM, Vick Khera <vi...@khera.org> wrote:
> On Mon, May 17, 2010 at 3:46 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>> Moving to the user@ list.
>>
>> http://wiki.apache.org/cassandra/HadoopSupport should be useful.
>
> That document doesn't really answer the "is data locality preserved"
> when running the map phase, but my hunch is "no".

The answer is, "yes, as long as you have hadoop on all the cassandra
machines." (the case where it's easy to map cassandra locality to
hadoop locality :)

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Re: Hadoop over Cassandra

Posted by Vick Khera <vi...@khera.org>.
On Mon, May 17, 2010 at 3:46 PM, Jonathan Ellis <jb...@gmail.com> wrote:
> Moving to the user@ list.
>
> http://wiki.apache.org/cassandra/HadoopSupport should be useful.

That document doesn't really answer the "is data locality preserved"
when running the map phase, but my hunch is "no".

>
> On Mon, May 17, 2010 at 2:41 PM, Yan Virin <ja...@gmail.com> wrote:
>> Hi,
>> Can someone explain how this works? As long as I know, there is no execution
>> engine in Cassandra alone, so I assume that Hadoop gives the MapReduce
>> execution engine which uses Cassandra as the distributed storage? Is data
>> locality preserved? How mature this "couple" is? How is the performance of
>> this compared to the original Hadoop over HDFS?

The built-in execution engine is one thing that excites me about the
Riak data store -- the work is done locally to where the data is.
That and you can specify your jobs in javascript, making it that much
easier for web-oriented people :-)  The big drawback for Riak is that
building it for FreeBSD is pretty much impossible.