You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Or Yanay <or...@peer39.com> on 2011/03/14 16:06:05 UTC
Map-Reduce on top of cassandra
Hi All,
I am trying to write some map-reduce tasks so I can find out stuff like - how many records have X status?
I am using 0.7.0 and have 5 nodes with ~100G of data on each node.
I have written the code based on the word_count example and the map-reduce is running successfully BUT is extremely slow (about 2 hours for the simplest key count).
I am now looking to track down the slowness and tune my process, or explore alternative ways to achieve the same goal.
Can anyone point me to a way to tune my map-reduce job?
Does anyone have any experience exploring Cassandra data with Hadoop cluster configuration? ( As suggested in http://wiki.apache.org/cassandra/HadoopSupport#ClusterConfig)
Thanks,
Orr
RE: Map-Reduce on top of cassandra
Posted by Or Yanay <or...@peer39.com>.
Worked like a charm!
I have installed hadoop on my Cassandra nodes and ran the MR using Hadoop job tracker.
A simple key count improved from ~2 hours to about 25 minutes (150M keys and ~100G on each node)
Thanks Jeremy.
-----Original Message-----
From: Jeremy Hanna [mailto:jeremy.hanna1234@gmail.com]
Sent: Monday, March 14, 2011 8:42 PM
To: user@cassandra.apache.org
Subject: Re: Map-Reduce on top of cassandra
Just for the sake of updating this thread - Orr didn't yet have task trackers on the Cassandra nodes so most of the time was likely due to copying the ~100G of data to the hadoop cluster prior to processing. They're going to try after installing task trackers on the nodes.
On Mar 14, 2011, at 10:06 AM, Or Yanay wrote:
> Hi All,
>
> I am trying to write some map-reduce tasks so I can find out stuff like - how many records have X status?
> I am using 0.7.0 and have 5 nodes with ~100G of data on each node.
>
> I have written the code based on the word_count example and the map-reduce is running successfully BUT is extremely slow (about 2 hours for the simplest key count).
>
> I am now looking to track down the slowness and tune my process, or explore alternative ways to achieve the same goal.
>
> Can anyone point me to a way to tune my map-reduce job?
> Does anyone have any experience exploring Cassandra data with Hadoop cluster configuration? ( As suggested inhttp://wiki.apache.org/cassandra/HadoopSupport#ClusterConfig)
>
> Thanks,
> Orr
>
Re: Map-Reduce on top of cassandra
Posted by Jeremy Hanna <je...@gmail.com>.
Just for the sake of updating this thread - Orr didn't yet have task trackers on the Cassandra nodes so most of the time was likely due to copying the ~100G of data to the hadoop cluster prior to processing. They're going to try after installing task trackers on the nodes.
On Mar 14, 2011, at 10:06 AM, Or Yanay wrote:
> Hi All,
>
> I am trying to write some map-reduce tasks so I can find out stuff like – how many records have X status?
> I am using 0.7.0 and have 5 nodes with ~100G of data on each node.
>
> I have written the code based on the word_count example and the map-reduce is running successfully BUT is extremely slow (about 2 hours for the simplest key count).
>
> I am now looking to track down the slowness and tune my process, or explore alternative ways to achieve the same goal.
>
> Can anyone point me to a way to tune my map-reduce job?
> Does anyone have any experience exploring Cassandra data with Hadoop cluster configuration? ( As suggested inhttp://wiki.apache.org/cassandra/HadoopSupport#ClusterConfig)
>
> Thanks,
> Orr
>
Re: Map-Reduce on top of cassandra
Posted by Jeremy Hanna <je...@gmail.com>.
Can you go into the #cassandra channel and ask your question? See if jeromatron or driftx are around. That way there can be a back and forth about settings and things.
http://webchat.freenode.net/?channels=#cassandra
On Mar 14, 2011, at 10:06 AM, Or Yanay wrote:
> Hi All,
>
> I am trying to write some map-reduce tasks so I can find out stuff like – how many records have X status?
> I am using 0.7.0 and have 5 nodes with ~100G of data on each node.
>
> I have written the code based on the word_count example and the map-reduce is running successfully BUT is extremely slow (about 2 hours for the simplest key count).
>
> I am now looking to track down the slowness and tune my process, or explore alternative ways to achieve the same goal.
>
> Can anyone point me to a way to tune my map-reduce job?
> Does anyone have any experience exploring Cassandra data with Hadoop cluster configuration? ( As suggested inhttp://wiki.apache.org/cassandra/HadoopSupport#ClusterConfig)
>
> Thanks,
> Orr
>