You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by mcasandra <mo...@gmail.com> on 2011/02/23 23:49:40 UTC

Understanding Indexes

So far my understanding about indexes is that you can create indexes only on
column values (username in below eg).

Does it make sense to also have index on the keys that columnFamily uses to
store rows (row keys "abc" in below example). I am thinking in an event rows
keep growing would search be fast if there is an index on row keys if you
want to retrieve for eg "def" only out of tons of rows?

UserProfile = { // this is a ColumnFamily
    abc: {   // this is the key to this Row inside the CF
        // now we have an infinite # of columns in this row
        username: "phatduckk",
        email: "phatduckk@example.com",
        phone: "(900) 976-6666"
    }, // end row
    def: {   // this is the key to another row in the CF
        // now we have another infinite # of columns in this row
        username: "ieure",
        email: "ieure@example.com",
        phone: "(888) 555-1212"
        age: "66",
        gender: "undecided"
    },
}


2) Is the hash of column key used or row key used by RandomPartitioner to
distribute it accross the cassandra nodes?
-- 
View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Understanding-Indexes-tp6058238p6058238.html
Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.

Re: Understanding Indexes

Posted by Tyler Hobbs <ty...@datastax.com>.

On Thu, Feb 24, 2011 at 3:07 PM, mcasandra <mo...@gmail.com> wrote:

>
> Thanks! I just started reading about Bloom Filter. Is this something that
> is
> inbuilt by default or is it something that need to be explicitly
> configured?
>

It's built in, no configuration needed.

-- 
Tyler Hobbs
Software Engineer, DataStax <http://datastax.com/>
Maintainer of the pycassa <http://github.com/pycassa/pycassa> Cassandra
Python client library

Re: Understanding Indexes

Posted by Michal Augustýn <au...@gmail.com>.

Retrieving data using row key is the primary way how to get data from
Cassandra, so it's highly optimized.
Firstly, node responsible for the row is computed using partitioner. You can
use RandomPartitioner (distributes md5 of keys) or
OrderPreservingPartitioner (key must be UTF8 string).
Then the row is found on the node using bloom filter (
http://wiki.apache.org/cassandra/ArchitectureOverview).

So when you want to retrieve row by its key then it's the fastest way you
can get the row.

Augi

2011/2/24 mcasandra <mo...@gmail.com>

>
> Thanks! I just started reading about Bloom Filter. Is this something that
> is
> inbuilt by default or is it something that need to be explicitly
> configured?
> --
> View this message in context:
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Understanding-Indexes-tp6058238p6062010.html
> Sent from the cassandra-user@incubator.apache.org mailing list archive at
> Nabble.com.
>

Re: Understanding Indexes

Posted by mcasandra <mo...@gmail.com>.

Thanks! I just started reading about Bloom Filter. Is this something that is
inbuilt by default or is it something that need to be explicitly configured?
-- 
View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Understanding-Indexes-tp6058238p6062010.html
Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.

Re: Understanding Indexes

Posted by Edward Capriolo <ed...@gmail.com>.

On Thu, Feb 24, 2011 at 3:55 PM, mcasandra <mo...@gmail.com> wrote:
>
> Either I am not explaning properly or I don't understand the data model just
> yet. Please check again:
>
> In below example this is what I understand:
>
> 1) UserProfile is a CF
> 2) 1111 is a row key
> 3) username is a column. Each row (eg 1111) has username column
>
> My understanding is that secondary indexes can be created only on column
> value. Which means I can create secondary index only on username, email etc.
> not on 1111. 1111 is the row key, but you keep saying that I need secondary
> index, but I am actually asking about index on the row key.
>
> Is my understanding incorrect about this?
>
>> UserProfile = { // this is a ColumnFamily
>>    1111 {   // this is the key to this Row inside the CF
>>        // now we have an infinite # of columns in this row
>>        username: "phatduckk",
>>        email: "[hidden email]",
>>        phone: "(900) 976-6666"
>>    }, // end row
>>    2222 {   // this is the key to another row in the CF
>>        // now we have another infinite # of columns in this row
>>        username: "ieure",
>>        email: "[hidden email]",
>>        phone: "(888) 555-1212"
>>        age: "66",
>>        gender: "undecided"
>>    },
>>  }
>
> --
> View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Understanding-Indexes-tp6058238p6061959.html
> Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.
>

You do not need secondary indexes to search on the RowKey. The Row Key
is used by the partitioner to locate your data across the cluster. The
Row Key is also used as the primary sort of the SSTables. Thus the row
key is naturally indexed.

Re: Understanding Indexes

Posted by mcasandra <mo...@gmail.com>.

Either I am not explaning properly or I don't understand the data model just
yet. Please check again:

In below example this is what I understand:

1) UserProfile is a CF
2) 1111 is a row key
3) username is a column. Each row (eg 1111) has username column

My understanding is that secondary indexes can be created only on column
value. Which means I can create secondary index only on username, email etc.
not on 1111. 1111 is the row key, but you keep saying that I need secondary
index, but I am actually asking about index on the row key.

Is my understanding incorrect about this?

> UserProfile = { // this is a ColumnFamily 
>    1111 {   // this is the key to this Row inside the CF 
>        // now we have an infinite # of columns in this row 
>        username: "phatduckk", 
>        email: "[hidden email]", 
>        phone: "(900) 976-6666" 
>    }, // end row 
>    2222 {   // this is the key to another row in the CF 
>        // now we have another infinite # of columns in this row 
>        username: "ieure", 
>        email: "[hidden email]", 
>        phone: "(888) 555-1212" 
>        age: "66", 
>        gender: "undecided" 
>    }, 
>  } 

-- 
View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Understanding-Indexes-tp6058238p6061959.html
Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.

Re: Understanding Indexes

Posted by Edward Capriolo <ed...@gmail.com>.

On Thu, Feb 24, 2011 at 3:34 PM, mcasandra <mo...@gmail.com> wrote:
>
> I wasn't aware that there is an index on primary key (that is row keys). So
> from what I understand there is by default an index on for eg: 1111,2222 in
> below example? Where can I read more about it?
>
> UserProfile = { // this is a ColumnFamily
>    1111 {   // this is the key to this Row inside the CF
>        // now we have an infinite # of columns in this row
>        username: "phatduckk",
>        email: "[hidden email]",
>        phone: "(900) 976-6666"
>    }, // end row
>    2222 {   // this is the key to another row in the CF
>        // now we have another infinite # of columns in this row
>        username: "ieure",
>        email: "[hidden email]",
>        phone: "(888) 555-1212"
>        age: "66",
>        gender: "undecided"
>    },
>  }
>
>
> --
> View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Understanding-Indexes-tp6058238p6061857.html
> Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.
>


Dude! You are running before you can walk why are your worried about
secondary indexing before you know what the primary index is? :)

http://wiki.apache.org/cassandra/ArchitectureOverview
http://wiki.apache.org/cassandra/ArchitectureSSTable

Re: Understanding Indexes

Posted by mcasandra <mo...@gmail.com>.

I wasn't aware that there is an index on primary key (that is row keys). So
from what I understand there is by default an index on for eg: 1111,2222 in
below example? Where can I read more about it?

UserProfile = { // this is a ColumnFamily
    1111 {   // this is the key to this Row inside the CF
        // now we have an infinite # of columns in this row
        username: "phatduckk",
        email: "[hidden email]",
        phone: "(900) 976-6666"
    }, // end row
    2222 {   // this is the key to another row in the CF
        // now we have another infinite # of columns in this row
        username: "ieure",
        email: "[hidden email]",
        phone: "(888) 555-1212"
        age: "66",
        gender: "undecided"
    },
 }


-- 
View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Understanding-Indexes-tp6058238p6061857.html
Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.

Re: Understanding Indexes

Posted by Ed Anuff <ed...@anuff.com>.

It all depends on what you're trying to do.  What you're proposing doing, by
defintion, is creating a secondary index.  The primary index is your row
key.  Depending on the partitioner, it might or might not be a conveniently
iterable index or sorted index.  If you need your keys sorted in a different
order than the partitioner does, if you need your keys organized into groups
that can be quickly retrieved or membership in tested against, or some other
reason why the primary index doesn't suffice, then you need a secondary
index.  It all depends on whether you need to retrieve rows based on a
different criteria than what the primary index provides.  If so, then yes,
you'll probably end up doing something that involves creating rows that are
full of row keys.  But, if you're not storing a subset of your full key set
or you don't have specific needs for ordering and iterating, then it would
be redundant.

On Thu, Feb 24, 2011 at 11:18 AM, mcasandra <mo...@gmail.com> wrote:

>
> Thanks! I am thinking more in terms where you have millions of keys (rows).
> For eg: UUID as a row key. or there could millions of users.
>
> So are we saying that we should NOT create column families with these many
> keys? What are the other options in such cases?
>
> UserProfile = { // this is a ColumnFamily
> >    1 {   // this is the key to this Row inside the CF
> >        // now we have an infinite # of columns in this row
> >        username: "phatduckk",
> >        email: "[hidden email]",
> >        phone: "(900) 976-6666"
> >    }, // end row
> >    2 {   // this is the key to another row in the CF
> >        // now we have another infinite # of columns in this row
> >        username: "ieure",
> >        email: "[hidden email]",
> >        phone: "(888) 555-1212"
> >        age: "66",
> >        gender: "undecided"
> >    },
> > }
>
> --
> View this message in context:
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Understanding-Indexes-tp6058238p6061574.html
> Sent from the cassandra-user@incubator.apache.org mailing list archive at
> Nabble.com.
>

Re: Understanding Indexes

Posted by Javier Canillas <ja...@gmail.com>.

I don't say you shouldn't. In case you feel like there is a problem, you may
think of splitting column families into N. But I think you won't get that
problem. You can read about RowCacheSize and KeyCache support on 0.7.X of
Cassandra, if you rows are small, you may cache a lot of them and avoid a
lot of latency issues when reading writing.

On Thu, Feb 24, 2011 at 4:18 PM, mcasandra <mo...@gmail.com> wrote:

>
> Thanks! I am thinking more in terms where you have millions of keys (rows).
> For eg: UUID as a row key. or there could millions of users.
>
> So are we saying that we should NOT create column families with these many
> keys? What are the other options in such cases?
>
> UserProfile = { // this is a ColumnFamily
> >    1 {   // this is the key to this Row inside the CF
> >        // now we have an infinite # of columns in this row
> >        username: "phatduckk",
> >        email: "[hidden email]",
> >        phone: "(900) 976-6666"
> >    }, // end row
> >    2 {   // this is the key to another row in the CF
> >        // now we have another infinite # of columns in this row
> >        username: "ieure",
> >        email: "[hidden email]",
> >        phone: "(888) 555-1212"
> >        age: "66",
> >        gender: "undecided"
> >    },
> > }
>
> --
> View this message in context:
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Understanding-Indexes-tp6058238p6061574.html
> Sent from the cassandra-user@incubator.apache.org mailing list archive at
> Nabble.com.
>

Re: Understanding Indexes

Posted by mcasandra <mo...@gmail.com>.

Thanks! I am thinking more in terms where you have millions of keys (rows).
For eg: UUID as a row key. or there could millions of users. 

So are we saying that we should NOT create column families with these many
keys? What are the other options in such cases?

UserProfile = { // this is a ColumnFamily
>    1 {   // this is the key to this Row inside the CF
>        // now we have an infinite # of columns in this row
>        username: "phatduckk",
>        email: "[hidden email]",
>        phone: "(900) 976-6666"
>    }, // end row
>    2 {   // this is the key to another row in the CF
>        // now we have another infinite # of columns in this row
>        username: "ieure",
>        email: "[hidden email]",
>        phone: "(888) 555-1212"
>        age: "66",
>        gender: "undecided"
>    },
> }

-- 
View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Understanding-Indexes-tp6058238p6061574.html
Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.

Re: Understanding Indexes

Posted by Javier Canillas <ja...@gmail.com>.

I really don't see the point.. Again, suppose a cluster with 3 nodes, where
there is a ColumnFamily that will hold data which key is basically consisted
on a word of 2 letters (pretty simple). That's make a total of 729 posible
keys.

RandomPartitioner then will tokenize each key and assign them to a node
within the cluster. Then, each node will handle 243 keys each (plus
replication, of course).

ok, Now suppose that you need to look for data on key "AG", the node that
you ask, will then use RandomPartitioner to tokenize the key and determine
which node is the coordinator for that key and proceed to ask that node for
the data (and ask the replicas an md5 version of the data to compare). So,
each node will only need to look for over 1/3 of the stored keys.

How do you think an Index is implemented? As far as I know, a simple index
is básically a HashTable that has the Index value as Key, and the position
as value. How do you think a search within the Index (Hashcode) is
implemented?

I don't know, maybe there is some magic behind indexes (I know there are
some complex indexes that hold some B-Tree, etc; like the one used over SQL
solutions), but I think all the whole thing will only add more complexity
over a more straight solution. How big should be the CF (in terms of keys)
to be able to present latency when searching over hashcodes? And then think,
if I need to add a new Key, what's the cost in the whole process? Now, lets
assume you can make the whole B-Tree in first place (even for the keys that
does not exists), how much memory would that cost? There should be some
papers that discuss this problem somewhere.

I would definitly make some volume calculations and some stress test over
this at least to be sure there is a problem before attempting any kind of
solution.

PD: I feel this is like the problem I present about TTL values, saying
basically, that a TTL value past 2050 year would throw an exception. Who
will be alive after 2012 doomsday? :)

On Thu, Feb 24, 2011 at 3:18 PM, mcasandra <mo...@gmail.com> wrote:

>
> What I am trying to ask is that what if there are billions of row keys (eg:
> abc, def, xyz in below eg.) and then client does a lookup/query on a row
> say
> xyz (get all cols for row xyz). Now since there are billions of rows look
> up
> using Hash mechanism, is it going to be slow? What algorithm will be used
> to
> retrieve row xyz which could be anywhere in those billion rows on a
> particular node.
>
> Is it going to help if there is an index on row keys (eg: abc, xyz)?
>
> > UserProfile = { // this is a ColumnFamily
> >    abc: {   // this is the key to this Row inside the CF
> >        // now we have an infinite # of columns in this row
> >        username: "phatduckk",
> >        email: "phatduckk@example.com",
> >        phone: "(900) 976-6666"
> >    }, // end row
> >    def: {   // this is the key to another row in the CF
> >        // now we have another infinite # of columns in this row
> >        username: "ieure",
> >        email: "ieure@example.com",
> >        phone: "(888) 555-1212"
> >        age: "66",
> >        gender: "undecided"
> >    },
> > }
> --
> View this message in context:
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Understanding-Indexes-tp6058238p6061356.html
> Sent from the cassandra-user@incubator.apache.org mailing list archive at
> Nabble.com.
>

Re: Understanding Indexes

Posted by mcasandra <mo...@gmail.com>.

What I am trying to ask is that what if there are billions of row keys (eg:
abc, def, xyz in below eg.) and then client does a lookup/query on a row say
xyz (get all cols for row xyz). Now since there are billions of rows look up
using Hash mechanism, is it going to be slow? What algorithm will be used to
retrieve row xyz which could be anywhere in those billion rows on a
particular node.

Is it going to help if there is an index on row keys (eg: abc, xyz)?

> UserProfile = { // this is a ColumnFamily
>    abc: {   // this is the key to this Row inside the CF
>        // now we have an infinite # of columns in this row
>        username: "phatduckk",
>        email: "phatduckk@example.com",
>        phone: "(900) 976-6666"
>    }, // end row
>    def: {   // this is the key to another row in the CF
>        // now we have another infinite # of columns in this row
>        username: "ieure",
>        email: "ieure@example.com",
>        phone: "(888) 555-1212"
>        age: "66",
>        gender: "undecided"
>    },
> }
-- 
View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Understanding-Indexes-tp6058238p6061356.html
Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.

Re: Understanding Indexes

Posted by Ed Anuff <ed...@anuff.com>.

If you mean does it make sense to have a CF where each row contains a set of
keys to other rows in another CF, then yes, that's a common design pattern,
although usually it's because you're creating collections of those rows
(i.e. a Groups CF where each row consists of a set of keys to rows in the
Users CF).  Not sure if that's what you're getting at, though.

On Thu, Feb 24, 2011 at 9:34 AM, mcasandra <mo...@gmail.com> wrote:

>
> Generally no. But yes if retrieving the key through index is faster than
> going through the hash buckets.
>
> Currently I am thinking there could be 100s of million or billion of rows
> and in that case if we have to retrieve a row which one will be fast going
> through hash bucket or index? I am thinking in such scenario Index would be
> faster. Please help me understand where I am going wrong. Some example will
> be helpful.
> --
> View this message in context:
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Understanding-Indexes-tp6058238p6061197.html
> Sent from the cassandra-user@incubator.apache.org mailing list archive at
> Nabble.com.
>

Re: Understanding Indexes

Posted by mcasandra <mo...@gmail.com>.

Generally no. But yes if retrieving the key through index is faster than
going through the hash buckets. 

Currently I am thinking there could be 100s of million or billion of rows
and in that case if we have to retrieve a row which one will be fast going
through hash bucket or index? I am thinking in such scenario Index would be
faster. Please help me understand where I am going wrong. Some example will
be helpful.
-- 
View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Understanding-Indexes-tp6058238p6061197.html
Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.

Re: Understanding Indexes

Posted by Javier Canillas <ja...@gmail.com>.

I dont think i got the point in your question. But if you are thinking
about key indexes (like PKs), take in mind that cassandra will manage
keys using the partition strategy. By doing so, it will be able to
determine on which node the row with such key should be hold.
So, in another words, inside cassandra, each column family is treated
as a big table (hashtable). Taking this last in mind, there is no need
to have an index by key. Would you put an index over a hashtable's
keys??

Enviado desde mi iPhone

El 23/02/2011, a las 19:50, mcasandra <mo...@gmail.com> escribió:

>
> So far my understanding about indexes is that you can create indexes only on
> column values (username in below eg).
>
> Does it make sense to also have index on the keys that columnFamily uses to
> store rows (row keys "abc" in below example). I am thinking in an event rows
> keep growing would search be fast if there is an index on row keys if you
> want to retrieve for eg "def" only out of tons of rows?
>
> UserProfile = { // this is a ColumnFamily
>    abc: {   // this is the key to this Row inside the CF
>        // now we have an infinite # of columns in this row
>        username: "phatduckk",
>        email: "phatduckk@example.com",
>        phone: "(900) 976-6666"
>    }, // end row
>    def: {   // this is the key to another row in the CF
>        // now we have another infinite # of columns in this row
>        username: "ieure",
>        email: "ieure@example.com",
>        phone: "(888) 555-1212"
>        age: "66",
>        gender: "undecided"
>    },
> }
>
>
> 2) Is the hash of column key used or row key used by RandomPartitioner to
> distribute it accross the cassandra nodes?
> --
> View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Understanding-Indexes-tp6058238p6058238.html
> Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.