You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Bob Hutchison <hu...@recursive.ca> on 2010/04/25 20:14:39 UTC

How do you construct an index and use it, especially in Ruby

Hi,

I'm new to Cassandra and trying to work out how to do something that I've implemented any number of times (e.g. TokyoCabinet, Perst, even the filesystem using grep :-) I've managed to get some of this working in Cassandra but not all.

So here's the core of the situation.

I have this opaque chunk of data that I want to store in Cassandra and then find it again.

I can generate a key when the data is created very easily, and I've stored it in a straight forward manner: in a column with a key whose value is the data. And I can retrieve it when I know the key. No difficulties here at all, works fine.

Now I want to index this data taking what I imagine to be a pretty typical approach.

Lets say there's two many-to-one indexes: 'colour', and 'size'. Each colour value will have more than one chunk of data, same for size.

What I thought I'd do is make a super column and index the chunk of data kind of like: { 'colour' => { 'blue' => 1 }, 'size' => { 'large' => 1}} with the key equal to the key of the chunk of data. And Cassandra stores it without error like that. So using the Ruby gem, it'd be something along the lines of:

  cassandra.insert(:Indexes, key-of-the-chunk-of-data, { 'colour' => { 'blue' => 1 }, 'size' => { 'large' => 1 } })

Q1: is this a reasonable approach? It *seems* to be what I've read is supposed to be done. The 1 is meaningless. Anyway, it executes without error in Ruby.

Q2: what is the syntax of the (Ruby) query to find the keys of all 'blue' chunks of data? I'm assuming get_range is the correct method, but what are the parameters? The docs say: get_range(column_family, options={}) but that seems to be missing a bit of detail, in particular the super column name.

Q2a: So I know there's a :start and :finish key supported in the options hash, inclusive, exclusive respectively. How do you define a range for equals with a UTF8 key? Surely not 'blue'.succ?? or by some kind of suffix??

Q2b: How do you specify the super column name 'colour'? Looking at the (Ruby) source of the get_range method and I'm unconvinced that this is implemented (seems to be a constant '' used where the super column name makes sense to be.)

Anyway I ended up hacking at the Ruby gem's source to use the column name where the '' was in the original, and didn't really get anywhere useful (I can find nothing, or everything, nothing in between).

Q3: If I am correct about what is supposed to be done, does the Ruby gem support it?

Q4: Does anyone know of some Ruby code that does and indexed lookup that they could point me at. (lots of code that indexes but nothing that searches by the index)

I'll try to take a look at some of the other Cassandra client implementations and see if I can get this model to work. Maybe just a Ruby problem?? With any luck, it'll be me messing up.

If it'd help I can post the source of what I have, but it'll need some cleanup. Let me know.

Thanks for taking the time to read this far :-)

Bob

----
Bob Hutchison
Recursive Design Inc.
http://www.recursive.ca/
weblog: http://xampl.com/so


----
Bob Hutchison
Recursive Design Inc.
http://www.recursive.ca/
weblog: http://xampl.com/so





Re: How do you construct an index and use it, especially in Ruby

Posted by Bob Hutchison <hu...@recursive.ca>.

embedded response, way down below...

On 2010-04-26, at 12:56 PM, Ryan King wrote:

> On Sun, Apr 25, 2010 at 11:14 AM, Bob Hutchison
> <hu...@recursive.ca> wrote:
>> 
>> Hi,
>> 
>> I'm new to Cassandra and trying to work out how to do something that I've implemented any number of times (e.g. TokyoCabinet, Perst, even the filesystem using grep :-) I've managed to get some of this working in Cassandra but not all.
>> 
>> So here's the core of the situation.
>> 
>> I have this opaque chunk of data that I want to store in Cassandra and then find it again.
>> 
>> I can generate a key when the data is created very easily, and I've stored it in a straight forward manner: in a column with a key whose value is the data. And I can retrieve it when I know the key. No difficulties here at all, works fine.
>> 
>> Now I want to index this data taking what I imagine to be a pretty typical approach.
>> 
>> Lets say there's two many-to-one indexes: 'colour', and 'size'. Each colour value will have more than one chunk of data, same for size.
>> 
>> What I thought I'd do is make a super column and index the chunk of data kind of like: { 'colour' => { 'blue' => 1 }, 'size' => { 'large' => 1}} with the key equal to the key of the chunk of data. And Cassandra stores it without error like that. So using the Ruby gem, it'd be something along the lines of:
>> 
>> cassandra.insert(:Indexes, key-of-the-chunk-of-data, { 'colour' => { 'blue' => 1 }, 'size' => { 'large' => 1 } })
>> 
>> Q1: is this a reasonable approach? It *seems* to be what I've read is supposed to be done. The 1 is meaningless. Anyway, it executes without error in Ruby.
> 
> No. In order to index your data, you need to invert it. Since you're
> working in ruby I'd recommend CassandraObject:
> http://github.com/nzKoz/cassandra_object. It has indexing built in.

Thanks Ryan. I don't really want to add a lot of layers of abstraction here, since what I'm writing is itself an abstraction. Worse, I can't get cassandra_object to install, some kind of gem issue. Anyway...

I dusted off my 20-years-ago experience with python (i.e. with the help of google), downloaded and installed pycassa (and thrift itself) and played around a bit. I find that the following python/pycassa snippet works just fine (or well enough).

import pycassa

client = pycassa.connect()
indexes_scf = pycassa.ColumnFamily(client, 'Play', 'Indexes', super=True)
rows = list(indexes_scf.get_range(column_start='blue', column_finish='blue', super_column='colour'))

The data was inserted using Ruby, but not read, because, as I said (below now), I don't know how to write the equivalent to the indexes_scf.get_range call in the snippet. So a simpler question, how do you write the equivalent to that in ruby and using the cassandra gem?

Cheers,
Bob

> 
> -ryan
> 
>> Q2: what is the syntax of the (Ruby) query to find the keys of all 'blue' chunks of data? I'm assuming get_range is the correct method, but what are the parameters? The docs say: get_range(column_family, options={}) but that seems to be missing a bit of detail, in particular the super column name.
>> 
>> Q2a: So I know there's a :start and :finish key supported in the options hash, inclusive, exclusive respectively. How do you define a range for equals with a UTF8 key? Surely not 'blue'.succ?? or by some kind of suffix??
>> 
>> Q2b: How do you specify the super column name 'colour'? Looking at the (Ruby) source of the get_range method and I'm unconvinced that this is implemented (seems to be a constant '' used where the super column name makes sense to be.)
>> 
>> Anyway I ended up hacking at the Ruby gem's source to use the column name where the '' was in the original, and didn't really get anywhere useful (I can find nothing, or everything, nothing in between).
>> 
>> Q3: If I am correct about what is supposed to be done, does the Ruby gem support it?
>> 
>> Q4: Does anyone know of some Ruby code that does and indexed lookup that they could point me at. (lots of code that indexes but nothing that searches by the index)
>> 
>> I'll try to take a look at some of the other Cassandra client implementations and see if I can get this model to work. Maybe just a Ruby problem?? With any luck, it'll be me messing up.
>> 
>> If it'd help I can post the source of what I have, but it'll need some cleanup. Let me know.
>> 
>> Thanks for taking the time to read this far :-)
>> 
>> Bob
>> 
>> ----
>> Bob Hutchison
>> Recursive Design Inc.
>> http://www.recursive.ca/
>> weblog: http://xampl.com/so
>> 
>> 
>> ----
>> Bob Hutchison
>> Recursive Design Inc.
>> http://www.recursive.ca/
>> weblog: http://xampl.com/so
>> 
>> 
>> 
>> 
>> 

----
Bob Hutchison
Recursive Design Inc.
http://www.recursive.ca/
weblog: http://xampl.com/so





Re: How do you construct an index and use it, especially in Ruby

Posted by Ryan King <ry...@twitter.com>.
On Sun, Apr 25, 2010 at 11:14 AM, Bob Hutchison
<hu...@recursive.ca> wrote:
>
> Hi,
>
> I'm new to Cassandra and trying to work out how to do something that I've implemented any number of times (e.g. TokyoCabinet, Perst, even the filesystem using grep :-) I've managed to get some of this working in Cassandra but not all.
>
> So here's the core of the situation.
>
> I have this opaque chunk of data that I want to store in Cassandra and then find it again.
>
> I can generate a key when the data is created very easily, and I've stored it in a straight forward manner: in a column with a key whose value is the data. And I can retrieve it when I know the key. No difficulties here at all, works fine.
>
> Now I want to index this data taking what I imagine to be a pretty typical approach.
>
> Lets say there's two many-to-one indexes: 'colour', and 'size'. Each colour value will have more than one chunk of data, same for size.
>
> What I thought I'd do is make a super column and index the chunk of data kind of like: { 'colour' => { 'blue' => 1 }, 'size' => { 'large' => 1}} with the key equal to the key of the chunk of data. And Cassandra stores it without error like that. So using the Ruby gem, it'd be something along the lines of:
>
>  cassandra.insert(:Indexes, key-of-the-chunk-of-data, { 'colour' => { 'blue' => 1 }, 'size' => { 'large' => 1 } })
>
> Q1: is this a reasonable approach? It *seems* to be what I've read is supposed to be done. The 1 is meaningless. Anyway, it executes without error in Ruby.

No. In order to index your data, you need to invert it. Since you're
working in ruby I'd recommend CassandraObject:
http://github.com/nzKoz/cassandra_object. It has indexing built in.

-ryan

> Q2: what is the syntax of the (Ruby) query to find the keys of all 'blue' chunks of data? I'm assuming get_range is the correct method, but what are the parameters? The docs say: get_range(column_family, options={}) but that seems to be missing a bit of detail, in particular the super column name.
>
> Q2a: So I know there's a :start and :finish key supported in the options hash, inclusive, exclusive respectively. How do you define a range for equals with a UTF8 key? Surely not 'blue'.succ?? or by some kind of suffix??
>
> Q2b: How do you specify the super column name 'colour'? Looking at the (Ruby) source of the get_range method and I'm unconvinced that this is implemented (seems to be a constant '' used where the super column name makes sense to be.)
>
> Anyway I ended up hacking at the Ruby gem's source to use the column name where the '' was in the original, and didn't really get anywhere useful (I can find nothing, or everything, nothing in between).
>
> Q3: If I am correct about what is supposed to be done, does the Ruby gem support it?
>
> Q4: Does anyone know of some Ruby code that does and indexed lookup that they could point me at. (lots of code that indexes but nothing that searches by the index)
>
> I'll try to take a look at some of the other Cassandra client implementations and see if I can get this model to work. Maybe just a Ruby problem?? With any luck, it'll be me messing up.
>
> If it'd help I can post the source of what I have, but it'll need some cleanup. Let me know.
>
> Thanks for taking the time to read this far :-)
>
> Bob
>
> ----
> Bob Hutchison
> Recursive Design Inc.
> http://www.recursive.ca/
> weblog: http://xampl.com/so
>
>
> ----
> Bob Hutchison
> Recursive Design Inc.
> http://www.recursive.ca/
> weblog: http://xampl.com/so
>
>
>
>
>