You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Jeremy Hanna <je...@gmail.com> on 2010/05/25 18:35:20 UTC

Anyone using hadoop/MapReduce integration currently?

I'll be doing a presentation on Cassandra's (0.6+) hadoop integration next week. Is anyone currently using MapReduce or the initial Pig integration?

(If you're unaware of such integration, see http://wiki.apache.org/cassandra/HadoopSupport)

If so, could you post to this thread on how you're using it or planning on using it (if not covered by the shroud of secrecy)?

e.g.
What is the use case?

Why are you using Cassandra versus using data stored in HDFS or HBase?

Are you using a separate Hadoop cluster to run the MR jobs on, or perhaps are you running the Job Tracker and Task Trackers on Cassandra nodes?

Is there anything holding you back from using it (if you would like to use it but currently cannot)?

Thanks!

Re: Anyone using hadoop/MapReduce integration currently?

Posted by 朱蓝天 <bs...@gmail.com>.
2010/5/26 Utku Can Topçu <ut...@topcu.gen.tr>

> Hi Jeremy,
>
>
> > Why are you using Cassandra versus using data stored in HDFS or HBase?
> - I'm thinking of using it for realtime streaming of user data. While
> streaming the requests, I'm also using Lucandra for indexing the data in
> realtime. It's a better option when you compare it with HBase or the native
> HDFS flat files, because of low latency in writes.


     i'm  interested in realtime index with lucandra. but how to intersect
posting list from multiple terms with cansandra. if through the network, i
think it is very
inefficient

>
>
> > Is there anything holding you back from using it (if you would like to
> use it but currently cannot)?
>
> My answer to this would be:
> - The current integration only supports the whole range of the CF to be
> input for the map phase, it would be way much better if the InputFormat had
> means of support for a KeyRange.
>
> Best Regards,
> Utku
>
>
> On Tue, May 25, 2010 at 6:35 PM, Jeremy Hanna <je...@gmail.com>wrote:
>
>> I'll be doing a presentation on Cassandra's (0.6+) hadoop integration next
>> week. Is anyone currently using MapReduce or the initial Pig integration?
>>
>> (If you're unaware of such integration, see
>> http://wiki.apache.org/cassandra/HadoopSupport)
>>
>> If so, could you post to this thread on how you're using it or planning on
>> using it (if not covered by the shroud of secrecy)?
>>
>> e.g.
>> What is the use case?
>>
>> Why are you using Cassandra versus using data stored in HDFS or HBase?
>>
>> Are you using a separate Hadoop cluster to run the MR jobs on, or perhaps
>> are you running the Job Tracker and Task Trackers on Cassandra nodes?
>>
>> Is there anything holding you back from using it (if you would like to use
>> it but currently cannot)?
>>
>> Thanks!
>
>
>

Re: Anyone using hadoop/MapReduce integration currently?

Posted by Utku Can Topçu <ut...@topcu.gen.tr>.
Hi Jeremy,

> Why are you using Cassandra versus using data stored in HDFS or HBase?
- I'm thinking of using it for realtime streaming of user data. While
streaming the requests, I'm also using Lucandra for indexing the data in
realtime. It's a better option when you compare it with HBase or the native
HDFS flat files, because of low latency in writes.

> Is there anything holding you back from using it (if you would like to use
it but currently cannot)?

My answer to this would be:
- The current integration only supports the whole range of the CF to be
input for the map phase, it would be way much better if the InputFormat had
means of support for a KeyRange.

Best Regards,
Utku

On Tue, May 25, 2010 at 6:35 PM, Jeremy Hanna <je...@gmail.com>wrote:

> I'll be doing a presentation on Cassandra's (0.6+) hadoop integration next
> week. Is anyone currently using MapReduce or the initial Pig integration?
>
> (If you're unaware of such integration, see
> http://wiki.apache.org/cassandra/HadoopSupport)
>
> If so, could you post to this thread on how you're using it or planning on
> using it (if not covered by the shroud of secrecy)?
>
> e.g.
> What is the use case?
>
> Why are you using Cassandra versus using data stored in HDFS or HBase?
>
> Are you using a separate Hadoop cluster to run the MR jobs on, or perhaps
> are you running the Job Tracker and Task Trackers on Cassandra nodes?
>
> Is there anything holding you back from using it (if you would like to use
> it but currently cannot)?
>
> Thanks!

Re: Anyone using hadoop/MapReduce integration currently?

Posted by Jeremy Hanna <je...@gmail.com>.
>> Is there anything holding you back from using it (if you would like to use it but currently cannot)?
> 
> It would be nice if the output of the mapreduce job was a
> MutationOutputFormat in which we could write insert/delete, but I
> recall there is something on jira already albeit not sure if it was
> merged.


Yep - that sounds like CASSANDRA-1101 - https://issues.apache.org/jira/browse/CASSANDRA-1101

Looks like it's being considered to be added to 0.6.x - probably 0.6.3+

Re: Anyone using hadoop/MapReduce integration currently?

Posted by gabriele renzi <rf...@gmail.com>.
On Tue, May 25, 2010 at 6:35 PM, Jeremy Hanna
<je...@gmail.com> wrote:


> What is the use case?

we end up with messed up data in the database, we run a mapreduce job
to find irregular data from time to time.


> Why are you using Cassandra versus using data stored in HDFS or HBase?

as of now our mapreduce task is only used for "fixing" cassandra so
the question is useless :)


> Are you using a separate Hadoop cluster to run the MR jobs on, or perhaps are you running the Job Tracker and Task Trackers on Cassandra nodes?

separate

> Is there anything holding you back from using it (if you would like to use it but currently cannot)?

It would be nice if the output of the mapreduce job was a
MutationOutputFormat in which we could write insert/delete, but I
recall there is something on jira already albeit not sure if it was
merged.