You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Renat Gilfanov <gr...@mail.ru> on 2013/09/12 07:58:09 UTC

Re[2]: Cassandra input paging for Hadoop

 Hello,

So it means that job will process only first "cassandra.input.page.row.size" rows, and ignore the rest? Or CqlPagingRecordReader supports paging through the entire result set?


  Aaron Morton <aa...@thelastpickle.com>:
>>>
>>>I'm looking at the ConfigHelper.setRangeBatchSize() and
>>>CqlConfigHelper.setInputCQLPageRowSize() methods, but a bit confused if
>>>that's what I need and if yes, which one should I use for those purposes. If you are using CQL 3 via Hadoop CqlConfigHelper.setInputCQLPageRowSize is the one you want. 
>
>it maps to the LIMIT clause of the select statement the input reader will generate, the default is 1,000.
>
>A
> 
>-----------------
>Aaron Morton
>New Zealand
>@aaronmorton
>
>Co-Founder & Principal Consultant
>Apache Cassandra Consulting
>http://www.thelastpickle.com
>
>On 12/09/2013, at 9:04 AM, Jiaan Zeng < l.allen09@gmail.com > wrote:
>>Speaking of thrift client, i.e. ColumnFamilyInputFormat, yes,
>>ConfigHelper.setRangeBatchSize() can reduce the number of rows sent to
>>Cassandra.
>>
>>Depend on how big your column is, you may also want to increase thrift
>>message length through setThriftMaxMessageLengthInMb().
>>
>>Hope that helps.
>>
>>On Tue, Sep 10, 2013 at 8:18 PM, Renat Gilfanov < grennat@mail.ru > wrote:
>>>Hi,
>>>
>>>We have Hadoop jobs that read data from our Cassandra column families and
>>>write some data back to another column families.
>>>The input column families are pretty simple CQL3 tables without wide rows.
>>>In Hadoop jobs we set up corresponding WHERE clause in
>>>ConfigHelper.setInputWhereClauses(...), so we don't process the whole table
>>>at once.
>>>Never  the less, sometimes the amount of data returned by input query is big
>>>enough to cause TimedOutExceptions.
>>>
>>>To mitigate this, I'd like to configure Hadoop job in a such way that it
>>>sequentially fetches input rows by smaller portions.
>>>
>>>I'm looking at the ConfigHelper.setRangeBatchSize() and
>>>CqlConfigHelper.setInputCQLPageRowSize() methods, but a bit confused if
>>>that's what I need and if yes, which one should I use for those purposes.
>>>
>>>Any help is appreciated.
>>>
>>>Hadoop version is 1.1.2, Cassandra version is 1.2.8.
>>
>>
>>
>>-- 
>>Regards,
>>Jiaan
>

Re: Cassandra input paging for Hadoop

Posted by Aaron Morton <aa...@thelastpickle.com>.

> Or CqlPagingRecordReader supports paging through the entire result set?
Supports paging through the entire result set. 

Cheers

-----------------
Aaron Morton
New Zealand
@aaronmorton

Co-Founder & Principal Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On 12/09/2013, at 5:58 PM, Renat Gilfanov <gr...@mail.ru> wrote:

> Hello,
> 
> So it means that job will process only first "cassandra.input.page.row.size" rows, and ignore the rest? Or CqlPagingRecordReader supports paging through the entire result set?
> 
> 
>   Aaron Morton <aa...@thelastpickle.com>:
>>> 
>>> I'm looking at the ConfigHelper.setRangeBatchSize() and
>>> CqlConfigHelper.setInputCQLPageRowSize() methods, but a bit confused if
>>> that's what I need and if yes, which one should I use for those purposes.
> If you are using CQL 3 via Hadoop CqlConfigHelper.setInputCQLPageRowSize is the one you want. 
> 
> it maps to the LIMIT clause of the select statement the input reader will generate, the default is 1,000.
> 
> A
>  
> -----------------
> Aaron Morton
> New Zealand
> @aaronmorton
> 
> Co-Founder & Principal Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
> 
> On 12/09/2013, at 9:04 AM, Jiaan Zeng <l....@gmail.com> wrote:
> 
>> Speaking of thrift client, i.e. ColumnFamilyInputFormat, yes,
>> ConfigHelper.setRangeBatchSize() can reduce the number of rows sent to
>> Cassandra.
>> 
>> Depend on how big your column is, you may also want to increase thrift
>> message length through setThriftMaxMessageLengthInMb().
>> 
>> Hope that helps.
>> 
>> On Tue, Sep 10, 2013 at 8:18 PM, Renat Gilfanov <gr...@mail.ru> wrote:
>>> Hi,
>>> 
>>> We have Hadoop jobs that read data from our Cassandra column families and
>>> write some data back to another column families.
>>> The input column families are pretty simple CQL3 tables without wide rows.
>>> In Hadoop jobs we set up corresponding WHERE clause in
>>> ConfigHelper.setInputWhereClauses(...), so we don't process the whole table
>>> at once.
>>> Never  the less, sometimes the amount of data returned by input query is big
>>> enough to cause TimedOutExceptions.
>>> 
>>> To mitigate this, I'd like to configure Hadoop job in a such way that it
>>> sequentially fetches input rows by smaller portions.
>>> 
>>> I'm looking at the ConfigHelper.setRangeBatchSize() and
>>> CqlConfigHelper.setInputCQLPageRowSize() methods, but a bit confused if
>>> that's what I need and if yes, which one should I use for those purposes.
>>> 
>>> Any help is appreciated.
>>> 
>>> Hadoop version is 1.1.2, Cassandra version is 1.2.8.
>> 
>> 
>> 
>> -- 
>> Regards,
>> Jiaan
>