You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by aaron morton <aa...@thelastpickle.com> on 2013/04/01 02:02:26 UTC
Re: MultiInput/MultiGet CF in MapReduce

> If I would use client.get_slice ( key).  My rowkey is '20130314'  from Index Table.
> Q1) How to know for rowkey '20130314' is in which Token Range & EndPoint.
Calculate the MD5 hash of the key and find the token range that contains it. 
This is what is used internally https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/utils/FBUtilities.java#L239

Cheers

-----------------
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 30/03/2013, at 10:45 AM, Alicia Leong <lc...@gmail.com> wrote:

> This is the current flow for ColumnFamilyInputFormat.  Please correct me If I'm wrong
> 
> 1) In ColumnFamilyInputFormat, Get all nodes token ranges using client.describe_ring
> 2) Get CfSplit using client.describe_splits_ex with the token range
> 2) new ColumnFamilySplit with start range, end range and endpoint
> 3) In ColumnFamilyRecordReader, will query client.get_range_slices with the start range & end range of the ColumnFamilySplit at endpoint (datanode)
> 
> 
> If I would use client.get_slice ( key).  My rowkey is '20130314'  from Index Table.
> Q1) How to know for rowkey '20130314' is in which Token Range & EndPoint.
> Even though I manage to find out the Token Range & EndPoint.  
> Is the available Thrift API, that I can pass the ( ByteBuffer key, KeyRange range )  Likes merge of client.get_slice & client.get_range_slices
> 
> 
> Thanks
> 
> 
> 
> On Sat, Mar 30, 2013 at 7:53 AM, Edward Capriolo <ed...@gmail.com> wrote:
> You can use the output of describe_ring along with partitioner information to determine which nodes data lives on.
> 
> 
> On Fri, Mar 29, 2013 at 12:33 PM, Alicia Leong <lc...@gmail.com> wrote:
> Hi All
> I’m thinking to do in this way.
> 
> 1)      1) get_slice ( YYYYMMDDHH )  from Index Table.
> 
> 2)      2) With the returned list of ROWKEYs
> 
> 3)      3) Pass it to multiget_slice ( keys …)
> 
>  
> But my questions is how to ensure ‘Data Locality’  ??
> 
> 
> 
> On Tue, Mar 19, 2013 at 3:33 PM, aaron morton <aa...@thelastpickle.com> wrote:
> I would be looking at Hive or Pig, rather than writing the MapReduce. 
> 
> There is an example in the source cassandra distribution, or you can look at Data Stax Enterprise to start playing with Hive. 
> 
> Typically with hadoop queries you want to query a lot of data, if you are only querying a few rows consider writing the code in your favourite language. 
> 
> Cheers
>  
> -----------------
> Aaron Morton
> Freelance Cassandra Consultant
> New Zealand
> 
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 18/03/2013, at 1:29 PM, Alicia Leong <lc...@gmail.com> wrote:
> 
>> Hi All
>> 
>> I have 2 tables 
>> 
>> Data Table 
>> -----------------
>> RowKey: 1 
>> => (column=name, value=apple) 
>> RowKey: 2 
>> => (column=name, value=orange) 
>> RowKey: 3 
>> => (column=name, value=banana) 
>> RowKey: 4 
>> => (column=name, value=mango) 
>> 
>> 
>> Index Table (YYYYMMDDHH)
>> ------------------------------------------------
>> RowKey: 2013030114 
>> => (column=1, value=) 
>> => (column=2, value=) 
>> => (column=3, value=) 
>> RowKey: 2013030115 
>> => (column=4, value=) 
>> 
>> 
>> I would like to know, how to implement below in MapReduce 
>> 1) first query the Index Table by RowKey: 2013030114 
>> 2) then pass the Index Table column names  (1,2,3) to query the Data Table 
>> 
>> Thanks in advance.
> 
> 
> 
>