You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Utku Can Topçu <ut...@topcu.gen.tr> on 2010/04/29 18:14:20 UTC

ColumnFamilyInputFormat KeyRange scans on a CF

Hi,

I've been trying to use Cassandra for some kind of a supplementary input
source for Hadoop MapReduce jobs.

The default usage of the ColumnFamilyInputFormat does a full columnfamily
scan for using within the MapReduce framework as map input.

However I believe that, it should be possible to give a keyrange to scan the
specified range.

Is it anymeans possible?

Best Regards,

Utku

Re: ColumnFamilyInputFormat KeyRange scans on a CF

Posted by Utku Can Topçu <ut...@topcu.gen.tr>.
I meant in the first sentence "running the get_range_slices from a single
point"

On Fri, Apr 30, 2010 at 4:08 PM, Utku Can Topçu <ut...@topcu.gen.tr> wrote:

> Do you mean, running the get_range_slices from a single? Yes, it would be
> reasonable for a relatively small key range, when it comes to analyze a
> really big range in really big data collection (i.e. like the one we
> currently populate) having a way for distributing the reads among the
> cluster seems the only reasonable solution.
>
> In this current situation, the best option might be distributing the range
> among ColumnFamilies (say, 1 CF for each day) and emptying the CF in order
> to reuse for another day range after analyzing the data.
>
> Can you suggest a workaround for this?
>

Re: ColumnFamilyInputFormat KeyRange scans on a CF

Posted by Utku Can Topçu <ut...@topcu.gen.tr>.
Do you mean, running the get_range_slices from a single? Yes, it would be
reasonable for a relatively small key range, when it comes to analyze a
really big range in really big data collection (i.e. like the one we
currently populate) having a way for distributing the reads among the
cluster seems the only reasonable solution.

In this current situation, the best option might be distributing the range
among ColumnFamilies (say, 1 CF for each day) and emptying the CF in order
to reuse for another day range after analyzing the data.

Can you suggest a workaround for this?

On Fri, Apr 30, 2010 at 3:22 PM, Jonathan Ellis <jb...@gmail.com> wrote:

> Sounds like doing this w/o m/r with get_range_slices is a reasonable way to
> go.
>
> On Thu, Apr 29, 2010 at 6:04 PM, Utku Can Topçu <ut...@topcu.gen.tr> wrote:
> > I'm currently writing collected data continuously to Cassandra, having
> keys
> > starting with a timestamp and a unique identifier (like
> > 2009.01.01.00.00.00.RANDOM) for being able to query in time ranges.
> >
> > I'm thinking of running periodical mapreduce jobs which will go through a
> > designated time period. I might want to analyze the data only between
> > 2009.01 and 2009.02.
> > I had done this previously with HBase however I thought cassandra would
> be a
> > better choice for continuously storing data in a safe manner.
> >
> > I guess this briefly explains my designated use case.
> >
> > Best Regards,
> > Utku
> >
> > On Thu, Apr 29, 2010 at 11:32 PM, Jonathan Ellis <jb...@gmail.com>
> wrote:
> >>
> >> It's technically possible but 0.6 does not support this, no.
> >>
> >> What is the use case you are thinking of?
> >>
> >> On Thu, Apr 29, 2010 at 11:14 AM, Utku Can Topçu <ut...@topcu.gen.tr>
> >> wrote:
> >> > Hi,
> >> >
> >> > I've been trying to use Cassandra for some kind of a supplementary
> input
> >> > source for Hadoop MapReduce jobs.
> >> >
> >> > The default usage of the ColumnFamilyInputFormat does a full
> >> > columnfamily
> >> > scan for using within the MapReduce framework as map input.
> >> >
> >> > However I believe that, it should be possible to give a keyrange to
> scan
> >> > the
> >> > specified range.
> >> >
> >> > Is it anymeans possible?
> >> >
> >> > Best Regards,
> >> >
> >> > Utku
> >>
> >> --
> >> Jonathan Ellis
> >> Project Chair, Apache Cassandra
> >> co-founder of Riptano, the source for professional Cassandra support
> >> http://riptano.com
> >
> >
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of Riptano, the source for professional Cassandra support
> http://riptano.com
>

Re: ColumnFamilyInputFormat KeyRange scans on a CF

Posted by Jonathan Ellis <jb...@gmail.com>.
Sounds like doing this w/o m/r with get_range_slices is a reasonable way to go.

On Thu, Apr 29, 2010 at 6:04 PM, Utku Can Topçu <ut...@topcu.gen.tr> wrote:
> I'm currently writing collected data continuously to Cassandra, having keys
> starting with a timestamp and a unique identifier (like
> 2009.01.01.00.00.00.RANDOM) for being able to query in time ranges.
>
> I'm thinking of running periodical mapreduce jobs which will go through a
> designated time period. I might want to analyze the data only between
> 2009.01 and 2009.02.
> I had done this previously with HBase however I thought cassandra would be a
> better choice for continuously storing data in a safe manner.
>
> I guess this briefly explains my designated use case.
>
> Best Regards,
> Utku
>
> On Thu, Apr 29, 2010 at 11:32 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>>
>> It's technically possible but 0.6 does not support this, no.
>>
>> What is the use case you are thinking of?
>>
>> On Thu, Apr 29, 2010 at 11:14 AM, Utku Can Topçu <ut...@topcu.gen.tr>
>> wrote:
>> > Hi,
>> >
>> > I've been trying to use Cassandra for some kind of a supplementary input
>> > source for Hadoop MapReduce jobs.
>> >
>> > The default usage of the ColumnFamilyInputFormat does a full
>> > columnfamily
>> > scan for using within the MapReduce framework as map input.
>> >
>> > However I believe that, it should be possible to give a keyrange to scan
>> > the
>> > specified range.
>> >
>> > Is it anymeans possible?
>> >
>> > Best Regards,
>> >
>> > Utku
>>
>> --
>> Jonathan Ellis
>> Project Chair, Apache Cassandra
>> co-founder of Riptano, the source for professional Cassandra support
>> http://riptano.com
>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Re: ColumnFamilyInputFormat KeyRange scans on a CF

Posted by Utku Can Topçu <ut...@topcu.gen.tr>.
I'm currently writing collected data continuously to Cassandra, having keys
starting with a timestamp and a unique identifier (like
2009.01.01.00.00.00.RANDOM) for being able to query in time ranges.

I'm thinking of running periodical mapreduce jobs which will go through a
designated time period. I might want to analyze the data only between
2009.01 and 2009.02.
I had done this previously with HBase however I thought cassandra would be a
better choice for continuously storing data in a safe manner.

I guess this briefly explains my designated use case.

Best Regards,
Utku

On Thu, Apr 29, 2010 at 11:32 PM, Jonathan Ellis <jb...@gmail.com> wrote:

> It's technically possible but 0.6 does not support this, no.
>
> What is the use case you are thinking of?
>
> On Thu, Apr 29, 2010 at 11:14 AM, Utku Can Topçu <ut...@topcu.gen.tr>
> wrote:
> > Hi,
> >
> > I've been trying to use Cassandra for some kind of a supplementary input
> > source for Hadoop MapReduce jobs.
> >
> > The default usage of the ColumnFamilyInputFormat does a full columnfamily
> > scan for using within the MapReduce framework as map input.
> >
> > However I believe that, it should be possible to give a keyrange to scan
> the
> > specified range.
> >
> > Is it anymeans possible?
> >
> > Best Regards,
> >
> > Utku
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of Riptano, the source for professional Cassandra support
> http://riptano.com
>

Re: ColumnFamilyInputFormat KeyRange scans on a CF

Posted by Jonathan Ellis <jb...@gmail.com>.
It's technically possible but 0.6 does not support this, no.

What is the use case you are thinking of?

On Thu, Apr 29, 2010 at 11:14 AM, Utku Can Topçu <ut...@topcu.gen.tr> wrote:
> Hi,
>
> I've been trying to use Cassandra for some kind of a supplementary input
> source for Hadoop MapReduce jobs.
>
> The default usage of the ColumnFamilyInputFormat does a full columnfamily
> scan for using within the MapReduce framework as map input.
>
> However I believe that, it should be possible to give a keyrange to scan the
> specified range.
>
> Is it anymeans possible?
>
> Best Regards,
>
> Utku

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com