You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@cassandra.apache.org by "Gary Dusbabek (JIRA)" <ji...@apache.org> on 2010/01/29 21:24:36 UTC

[jira] Created: (CASSANDRA-749) Secondary indices for column families

Secondary indices for column families
-------------------------------------

                 Key: CASSANDRA-749
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-749
             Project: Cassandra
          Issue Type: New Feature
          Components: Core
            Reporter: Gary Dusbabek
            Assignee: Gary Dusbabek
            Priority: Minor
             Fix For: 0.6




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-749) Secondary indices for column families

Posted by "Gary Dusbabek (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829641#action_12829641 ] 

Gary Dusbabek commented on CASSANDRA-749:
-----------------------------------------

It might as well wait until 0.7 then.

There will still be a string restriction on the column name, unless there is a practical way to express a binary column name in storage-conf.xml where the secondary indices are declared.   Maybe add an attribute to <Index> to indicate the On attribute contains base64 data or something like that.  Ugly crap though.

> Secondary indices for column families
> -------------------------------------
>
>                 Key: CASSANDRA-749
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-749
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Gary Dusbabek
>            Assignee: Gary Dusbabek
>            Priority: Minor
>             Fix For: 0.6
>
>         Attachments: 0001-simple-secondary-indices.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-749) Secondary indices for column families

Posted by "Gary Dusbabek (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gary Dusbabek updated CASSANDRA-749:
------------------------------------

    Attachment: 0001-simple-secondary-indices.patch

> Secondary indices for column families
> -------------------------------------
>
>                 Key: CASSANDRA-749
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-749
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Gary Dusbabek
>            Assignee: Gary Dusbabek
>            Priority: Minor
>             Fix For: 0.6
>
>         Attachments: 0001-simple-secondary-indices.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-749) Secondary indices for column families

Posted by "Filippo Fadda (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12834610#action_12834610 ] 

Filippo Fadda commented on CASSANDRA-749:
-----------------------------------------

Hello, I just watched the patch code, and I saw that there is only a method to search for a row set inside the specified ColomnFamily, for the specified Column, given a value for the column itself. I think we really need a method to retrieve all the rows inside the specified ColumnFamily, for the specified Column, sorted for column values (az or za). Is there a way to do this? I don't see it in the code. Am I wrong?

> Secondary indices for column families
> -------------------------------------
>
>                 Key: CASSANDRA-749
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-749
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Gary Dusbabek
>            Assignee: Gary Dusbabek
>            Priority: Minor
>             Fix For: 0.7
>
>         Attachments: 0001-simple-secondary-indices.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-749) Secondary indices for column families

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844967#action_12844967 ] 

Jonathan Ellis commented on CASSANDRA-749:
------------------------------------------

Skinny rows is more special casing than just partitioning indexes, and doesn't solve the "how do we keep the index consistent w/ the actual data in the case of failures" problem.

> Secondary indices for column families
> -------------------------------------
>
>                 Key: CASSANDRA-749
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-749
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Gary Dusbabek
>            Assignee: Gary Dusbabek
>            Priority: Minor
>             Fix For: 0.8
>
>         Attachments: 0001-simple-secondary-indices.patch, views-discussion-2.txt, views-discussion.txt
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-749) Secondary indices for column families

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844811#action_12844811 ] 

Jonathan Ellis commented on CASSANDRA-749:
------------------------------------------

Step one is convert key from String to byte[], so that's symmetric.  (Should probably get its own ticket.)

> Secondary indices for column families
> -------------------------------------
>
>                 Key: CASSANDRA-749
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-749
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Gary Dusbabek
>            Assignee: Gary Dusbabek
>            Priority: Minor
>             Fix For: 0.8
>
>         Attachments: 0001-simple-secondary-indices.patch, views-discussion.txt
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-749) Secondary indices for column families

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829247#action_12829247 ] 

Stu Hood commented on CASSANDRA-749:
------------------------------------

> you mean, make each machine hold a copy of the full index?
I mean a distributed secondary index would be stored in a true ColumnFamily, so it would be partitioned like any other CF.

> let's keep this ticket's scope to just automating using another CF to look up keys by value.
In the local secondary index case, perhaps rather than using a separate column family, the index should be another component of the ColumnFamilyStore, tied to each Memtable/SSTable, rather than being a separate ColumnFamily. This gains us a lot of efficiency, and some consistency.

We already have the primary index file (Index.db) on disk, so secondary indexes would be similar: (column, datafile_offset) tuples. Consistency wise, all replication and repairs happen at the ColumnFamily level, so replication might repair the data ColumnFamily but not its index for instance.

> Secondary indices for column families
> -------------------------------------
>
>                 Key: CASSANDRA-749
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-749
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Gary Dusbabek
>            Assignee: Gary Dusbabek
>            Priority: Minor
>             Fix For: 0.6
>
>         Attachments: 0001-simple-secondary-indices.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-749) Secondary indices for column families

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stu Hood updated CASSANDRA-749:
-------------------------------

    Attachment: views-discussion.txt

More thoughts on incrementally updated views:

In order to not add reads to Cassandra's write path, we could implement a modification to mapreduce that splits the Map function into a map-key function and a map-value. I'm attaching a conversation about a potential approach, but it doesn't really get into implementation details or what the API for the map-key and map-value functions would look like.

> Secondary indices for column families
> -------------------------------------
>
>                 Key: CASSANDRA-749
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-749
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Gary Dusbabek
>            Assignee: Gary Dusbabek
>            Priority: Minor
>             Fix For: 0.8
>
>         Attachments: 0001-simple-secondary-indices.patch, views-discussion.txt
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-749) Secondary indices for column families

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844942#action_12844942 ] 

Stu Hood commented on CASSANDRA-749:
------------------------------------

> Load balancing doesn't help if you are indexing something with less potential values than you have nodes in the cluster
Again, this brings up the topic of skinny rows: I'm sticking with the idea that we would want skinny rows with a compound key, so that each row key in the index/view might start with "true" in the boolean case, but the actual view row key would be a compound: "true|<base-key>". So, yes, even in the boolean case it is possible to partition the index: you would have very hot spots around "true|*" and "false|*", but that wouldn't stop our load balancing from splitting based on the remainder of the key.

> Nobody wants to write static java code to define a view, I can promise you that. :)
I know, but it is a temporary solution that allows us to fine tune the interface without providing scripting support or anything else crazy.

> Secondary indices for column families
> -------------------------------------
>
>                 Key: CASSANDRA-749
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-749
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Gary Dusbabek
>            Assignee: Gary Dusbabek
>            Priority: Minor
>             Fix For: 0.8
>
>         Attachments: 0001-simple-secondary-indices.patch, views-discussion-2.txt, views-discussion.txt
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-749) Secondary indices for column families

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12847026#action_12847026 ] 

Stu Hood commented on CASSANDRA-749:
------------------------------------

> Doing this for non-local indexes requires the cluster to be OPP, which experience has demonstrated is not what
> most people want to use despite its advantages
If people don't want to use OPP, most likely that is because we have more work to do on load balancing (fixing CASSANDRA-579 for instance). OPP is one of our key advantages, and throwing it away because it still needs improvement is not wise.

> Secondary indices for column families
> -------------------------------------
>
>                 Key: CASSANDRA-749
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-749
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Gary Dusbabek
>            Assignee: Gary Dusbabek
>            Priority: Minor
>             Fix For: 0.8
>
>         Attachments: 0001-simple-secondary-indices.patch, views-discussion-2.txt, views-discussion.txt
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-749) Secondary indices for column families

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829635#action_12829635 ] 

Jonathan Ellis commented on CASSANDRA-749:
------------------------------------------

gary, do you think this is worth committing to 0.6 given the String limitation?  for 0.7 we will almost certainly move to byte[] keys which would make the column -> key thing much more sane.

> Secondary indices for column families
> -------------------------------------
>
>                 Key: CASSANDRA-749
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-749
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Gary Dusbabek
>            Assignee: Gary Dusbabek
>            Priority: Minor
>             Fix For: 0.6
>
>         Attachments: 0001-simple-secondary-indices.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-749) Secondary indices for column families

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829254#action_12829254 ] 

Jonathan Ellis commented on CASSANDRA-749:
------------------------------------------

I think having each node index its CFs locally is a lose for us, because as you say we have to query the full cluster for any index lookup, since we are throwing away our usual partitioning scheme.  

This means we add a ton of complexity (adding a completely different query path) in exchange for not being able to scale these queries, since the work generated increases in lockstep w/ machines added.

> Secondary indices for column families
> -------------------------------------
>
>                 Key: CASSANDRA-749
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-749
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Gary Dusbabek
>            Assignee: Gary Dusbabek
>            Priority: Minor
>             Fix For: 0.6
>
>         Attachments: 0001-simple-secondary-indices.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-749) Secondary indices for column families

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12847022#action_12847022 ] 

Stu Hood commented on CASSANDRA-749:
------------------------------------

> So in practice I strongly suspect this will scale at least to hundreds of nodes if not thousands
> so saying "we can't do this because it won't scale" is not a strong argument. 
I think you're making the "speed == scalability" mistake. It doesn't matter if we could do 30k index queries per second on one node: your bound for index queries for the entire cluster would still be 30k, no matter how many nodes you added.

> So you have to check each index hit for validity *each* *time* which is a huge hit.
You have to do the same thing for the secondary index: presumably you actually want to find the content of the row that was indexed, and so you need to seek to the row in the indexed CF. Both solutions need this seek: one just performs it across the network.

> you have no way of knowing if that's because another process is about to clean out the index entry, or add the natural entry.
This is a problem: I'll admit. One option is to do something like 'view-read-repair': when retrieving the indexed row from the base, only clean up an invalid index entry after enough time has passed since the entries' creation time for any in flight-writes to have completed.

----

I think I'm convinced that fully materialized views will not be able to be consistent (even eventually), since the nodes storing the base/view are probably in different scopes of serializability. BUT I'm sticking to the idea that the partitioned view that queries the base for the row content is the superior one.

> Secondary indices for column families
> -------------------------------------
>
>                 Key: CASSANDRA-749
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-749
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Gary Dusbabek
>            Assignee: Gary Dusbabek
>            Priority: Minor
>             Fix For: 0.8
>
>         Attachments: 0001-simple-secondary-indices.patch, views-discussion-2.txt, views-discussion.txt
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-749) Secondary indices for column families

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844755#action_12844755 ] 

Jonathan Ellis commented on CASSANDRA-749:
------------------------------------------

> Is it worth creating a secondary index that only contains local data, versus a distributed secondary index (a normal ColumnFamily?) 

I think my initial reasoning was wrong here.  I was anti-local-indexes because "we have to query the full cluster for any index lookup, since we are throwing away our usual partitioning scheme."

Which is true, but it ignores the fact that, in most cases, you will have to "query the full cluster" to get the actual matching rows, b/c the indexed rows will be spread across all machines.  So, having local indexes is better in the common case, since it actually saves a round trip from querying a the index to querying the rows.

Also, having each node index the rows it has locally means you don't have to worry about sharding a very large index since it happens automatically.

Finally, it lets us use the local commitlog to keep index + data in sync.

> Secondary indices for column families
> -------------------------------------
>
>                 Key: CASSANDRA-749
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-749
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Gary Dusbabek
>            Assignee: Gary Dusbabek
>            Priority: Minor
>             Fix For: 0.8
>
>         Attachments: 0001-simple-secondary-indices.patch, views-discussion.txt
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-749) Secondary indices for column families

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844872#action_12844872 ] 

Jonathan Ellis commented on CASSANDRA-749:
------------------------------------------

> This is why we have load balancing.

Load balancing doesn't help if you are indexing something with less potential values than you have nodes in the cluster.  At the extreme, say booleans, it's probably not worth indexing vs just doing full scans.  But if you have 100s of nodes then not being able to usefully index something woth 20 or 50 or 100 values kinda sucks.

> We could easily have a built in "SecondaryIndex" view class that uses a matching column name/value as the row key in the view. 

That would probably work, although I don't want to fall into the trap of overgeneralizing because it's sexy.  Nobody wants to write static java code to define a view, I can promise you that. :)

> Is the intention that the indexes would be used to speed up predicates/filters in get_range_slices

No, it's to add a different kind of predicate: "give me these columns [existing functionality] from rows that match this index condition [new functionality]."

> Secondary indices for column families
> -------------------------------------
>
>                 Key: CASSANDRA-749
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-749
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Gary Dusbabek
>            Assignee: Gary Dusbabek
>            Priority: Minor
>             Fix For: 0.8
>
>         Attachments: 0001-simple-secondary-indices.patch, views-discussion-2.txt, views-discussion.txt
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-749) Secondary indices for column families

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844777#action_12844777 ] 

Jonathan Ellis commented on CASSANDRA-749:
------------------------------------------

I'm leaning more and more towards "we should implement 2ary indexes + querying first, then later add full view support" since the former we can do w/o opening the whole user defined functions box which is a pretty big deal.

> Secondary indices for column families
> -------------------------------------
>
>                 Key: CASSANDRA-749
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-749
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Gary Dusbabek
>            Assignee: Gary Dusbabek
>            Priority: Minor
>             Fix For: 0.8
>
>         Attachments: 0001-simple-secondary-indices.patch, views-discussion.txt
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-749) Secondary indices for column families

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-749:
-------------------------------------

    Fix Version/s:     (was: 0.6)
                   0.7

> Secondary indices for column families
> -------------------------------------
>
>                 Key: CASSANDRA-749
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-749
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Gary Dusbabek
>            Assignee: Gary Dusbabek
>            Priority: Minor
>             Fix For: 0.7
>
>         Attachments: 0001-simple-secondary-indices.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-749) Secondary indices for column families

Posted by "Filippo Fadda (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12834855#action_12834855 ] 

Filippo Fadda commented on CASSANDRA-749:
-----------------------------------------

Thank you for your answer Gary. Any plan for querying? Do you think querying will be include in 0.7?

> Secondary indices for column families
> -------------------------------------
>
>                 Key: CASSANDRA-749
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-749
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Gary Dusbabek
>            Assignee: Gary Dusbabek
>            Priority: Minor
>             Fix For: 0.7
>
>         Attachments: 0001-simple-secondary-indices.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (CASSANDRA-749) Secondary indices for column families

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844811#action_12844811 ] 

Jonathan Ellis edited comment on CASSANDRA-749 at 3/13/10 5:47 AM:
-------------------------------------------------------------------

Step one is convert key from String to byte[], so that's symmetric with columns to avoid the problem Gary noted in his original patch.  (Should probably get its own ticket.)

      was (Author: jbellis):
    Step one is convert key from String to byte[], so that's symmetric.  (Should probably get its own ticket.)
  
> Secondary indices for column families
> -------------------------------------
>
>                 Key: CASSANDRA-749
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-749
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Gary Dusbabek
>            Assignee: Gary Dusbabek
>            Priority: Minor
>             Fix For: 0.8
>
>         Attachments: 0001-simple-secondary-indices.patch, views-discussion.txt
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-749) Secondary indices for column families

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12847047#action_12847047 ] 

Jonathan Ellis commented on CASSANDRA-749:
------------------------------------------

> I think you're making the "speed == scalability" mistake

No, I'm simply acknowledging that there's no such thing as "infinite scalability," and if this scales to the machine counts people actually deploy on then it's silly to do something more complex.

> Secondary indices for column families
> -------------------------------------
>
>                 Key: CASSANDRA-749
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-749
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Gary Dusbabek
>            Assignee: Gary Dusbabek
>            Priority: Minor
>             Fix For: 0.8
>
>         Attachments: 0001-simple-secondary-indices.patch, views-discussion-2.txt, views-discussion.txt
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-749) Secondary indices for column families

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stu Hood updated CASSANDRA-749:
-------------------------------

    Attachment: views-discussion-2.txt

> Which is true, but it ignores the fact that, in most cases, you will have to "query the full cluster" to get the actual matching rows
This was the point of the views being "semi-materialized". If your view contains all of the data you were interested in from the base row, and it matches a configured recency, then you don't need to query the base. Please see my comment re: "cribs" in the latest attached conversation.

> local indexes is better in the common case, since it actually saves a round trip from querying a the index to querying the rows
I disagree. I would expect that the view would contain a large number of rows (typically depending on 1 row each), so querying for one row in the view would usually query one or two rows in the base: not necessarily thousands. Also, the partitioned index has much better best case performance: for the local secondary indexes, you _always_ need to query every unique range/endpoint in the cluster during the first phase, and then merge sort the results from all nodes before you can return a response for even a single row. Federating without partitioning will not scale.

Being able to implement these skinny rows (rather than the million column rows lazyboy attempts) depends on being able to support non-unique row keys, but that is basically just a compound key of the view-key and the base-key appended, as described on CASSANDRA-767.

> locally means you don't have to worry about sharding a very large index since it happens automatically
This is why we have load balancing.

> since the former we can do w/o opening the whole user defined functions box which is a pretty big deal
There is no need to allow for arbitrary functions initially if we take the same approach we've taken for comparators: to start, a new view would be defined by extending an abstract class. We could easily have a built in "SecondaryIndex" view class that uses a matching column name/value as the row key in the view.

----

Without a way to use these secondary indexes in queries, they are completely pointless. Is the intention that the indexes would be used to speed up predicates/filters in get_range_slices, or are you proposing that the secondary index/view looks and acts like a normal column family, with all of the row content, but with the secondary key as the row key? The former seems pointless, and the latter seems like it should be implemented using the partitioned secondary index approach.

> Secondary indices for column families
> -------------------------------------
>
>                 Key: CASSANDRA-749
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-749
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Gary Dusbabek
>            Assignee: Gary Dusbabek
>            Priority: Minor
>             Fix For: 0.8
>
>         Attachments: 0001-simple-secondary-indices.patch, views-discussion-2.txt, views-discussion.txt
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-749) Secondary indices for column families

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-749:
-------------------------------------

    Fix Version/s:     (was: 0.7)
                   0.8

> Secondary indices for column families
> -------------------------------------
>
>                 Key: CASSANDRA-749
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-749
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Gary Dusbabek
>            Assignee: Gary Dusbabek
>            Priority: Minor
>             Fix For: 0.8
>
>         Attachments: 0001-simple-secondary-indices.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-749) Secondary indices for column families

Posted by "Gary Dusbabek (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829128#action_12829128 ] 

Gary Dusbabek commented on CASSANDRA-749:
-----------------------------------------

Any strong opinions on whether this should be supported for binary loading?

> Secondary indices for column families
> -------------------------------------
>
>                 Key: CASSANDRA-749
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-749
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Gary Dusbabek
>            Assignee: Gary Dusbabek
>            Priority: Minor
>             Fix For: 0.6
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-749) Secondary indices for column families

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829236#action_12829236 ] 

Stu Hood commented on CASSANDRA-749:
------------------------------------

> I'm not sure why that would be useful.
A distributed secondary index would allow you to query one machine to figure out what other machines have columns matching a predicate, as opposed to a local secondary index where you immediately query every machine to figure out whether a predicate matches.

> Secondary indices for column families
> -------------------------------------
>
>                 Key: CASSANDRA-749
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-749
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Gary Dusbabek
>            Assignee: Gary Dusbabek
>            Priority: Minor
>             Fix For: 0.6
>
>         Attachments: 0001-simple-secondary-indices.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-749) Secondary indices for column families

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829239#action_12829239 ] 

Jonathan Ellis commented on CASSANDRA-749:
------------------------------------------

you mean, make each machine hold a copy of the full index?

that's worth thinking about, but it's not useful yet since we don't support index scans anyway.  (another ticket.)

let's keep this ticket's scope to just automating using another CF to look up keys by value.

> Secondary indices for column families
> -------------------------------------
>
>                 Key: CASSANDRA-749
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-749
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Gary Dusbabek
>            Assignee: Gary Dusbabek
>            Priority: Minor
>             Fix For: 0.6
>
>         Attachments: 0001-simple-secondary-indices.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-749) Secondary indices for column families

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12847044#action_12847044 ] 

Jonathan Ellis commented on CASSANDRA-749:
------------------------------------------

> You have to do the same thing for the secondary index: presumably you actually want to find the content of the row that was indexed

Not if you denormalize into subcolumns, you don't.


> Secondary indices for column families
> -------------------------------------
>
>                 Key: CASSANDRA-749
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-749
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Gary Dusbabek
>            Assignee: Gary Dusbabek
>            Priority: Minor
>             Fix For: 0.8
>
>         Attachments: 0001-simple-secondary-indices.patch, views-discussion-2.txt, views-discussion.txt
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-749) Secondary indices for column families

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829273#action_12829273 ] 

Stu Hood commented on CASSANDRA-749:
------------------------------------

It would appear that the HBase folks have had this exact same discussion, and have settled on two disparate packages  for local and distributed secondary indexes: http://issues.apache.org/jira/browse/HBASE-2037

If we are going to settle on one (or some kind of hybrid?) we need to think more about usecases.

> Secondary indices for column families
> -------------------------------------
>
>                 Key: CASSANDRA-749
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-749
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Gary Dusbabek
>            Assignee: Gary Dusbabek
>            Priority: Minor
>             Fix For: 0.6
>
>         Attachments: 0001-simple-secondary-indices.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-749) Secondary indices for column families

Posted by "Gary Dusbabek (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12834767#action_12834767 ] 

Gary Dusbabek commented on CASSANDRA-749:
-----------------------------------------

Filippo, you're conflating indexing and querying.  Querying will come later.  This patch merely does the indexing.  It included a simple search method for demonstration purposes.

> Secondary indices for column families
> -------------------------------------
>
>                 Key: CASSANDRA-749
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-749
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Gary Dusbabek
>            Assignee: Gary Dusbabek
>            Priority: Minor
>             Fix For: 0.7
>
>         Attachments: 0001-simple-secondary-indices.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-749) Secondary indices for column families

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846970#action_12846970 ] 

Jonathan Ellis commented on CASSANDRA-749:
------------------------------------------

Another point: local indexes allow us to do indexed inequality comparisons (birth_date > $year) trivially, since we can safely make the local index sstables OPP no matter what the cluster partitioner setting is.  Doing this for non-local indexes requires the cluster to be OPP, which experience has demonstrated is not what most people want to use despite its advantages.

> Secondary indices for column families
> -------------------------------------
>
>                 Key: CASSANDRA-749
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-749
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Gary Dusbabek
>            Assignee: Gary Dusbabek
>            Priority: Minor
>             Fix For: 0.8
>
>         Attachments: 0001-simple-secondary-indices.patch, views-discussion-2.txt, views-discussion.txt
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-749) Secondary indices for column families

Posted by "Chris Goffinet (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844809#action_12844809 ] 

Chris Goffinet commented on CASSANDRA-749:
------------------------------------------

+1 Jonathan. I'd love to help, what can I do to start?

> Secondary indices for column families
> -------------------------------------
>
>                 Key: CASSANDRA-749
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-749
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Gary Dusbabek
>            Assignee: Gary Dusbabek
>            Priority: Minor
>             Fix For: 0.8
>
>         Attachments: 0001-simple-secondary-indices.patch, views-discussion.txt
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-749) Secondary indices for column families

Posted by "Gary Dusbabek (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844970#action_12844970 ] 

Gary Dusbabek commented on CASSANDRA-749:
-----------------------------------------

>Step one is convert key from String to byte[], so that's symmetric with columns to avoid the problem Gary noted in his original patch.

We also still have the (minor) problem of describing the indexed column in storage-conf. <Index OnHex="0xabcd"/> or something like that ought to work.

As to the locality discussion: I'm still in favor of keeping them node-local, at least for the way we intend to use them.  (Please give me colB for rows where colA=foo and the row key is unknown.)

> Secondary indices for column families
> -------------------------------------
>
>                 Key: CASSANDRA-749
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-749
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Gary Dusbabek
>            Assignee: Gary Dusbabek
>            Priority: Minor
>             Fix For: 0.8
>
>         Attachments: 0001-simple-secondary-indices.patch, views-discussion-2.txt, views-discussion.txt
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-749) Secondary indices for column families

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846968#action_12846968 ] 

Jonathan Ellis commented on CASSANDRA-749:
------------------------------------------

It's worth pointing out that our row bloom filter rejects requests for non-existing rows very, very performantly, so the overhead for  doing requests to all nodes for local indexes (or at least nodes / RF) when cardinality is high is lower than it looks at first.

So in practice I strongly suspect this will scale at least to hundreds of nodes if not thousands, so saying "we can't do this because it won't scale" is not a strong argument.

And when you are doing requests against a "index on a single node," the consistency problem is worth than you think.  There's no way to make it consistent with a batch m/r, without a Big Lock against the CF being indexed, since if you are examining an index entry w/ no matching "natural" entry, you have no way of knowing if that's because another process is about to clean out the index entry, or add the natural entry.  So you have to check each index hit for validity *each* *time* which is a huge hit.  (And allowing the user to say "stale data" is okay is wrong, because it's not "eventually consistent," once out of sync it will stay that way.)

> Secondary indices for column families
> -------------------------------------
>
>                 Key: CASSANDRA-749
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-749
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Gary Dusbabek
>            Assignee: Gary Dusbabek
>            Priority: Minor
>             Fix For: 0.8
>
>         Attachments: 0001-simple-secondary-indices.patch, views-discussion-2.txt, views-discussion.txt
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-749) Secondary indices for column families

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829403#action_12829403 ] 

Stu Hood commented on CASSANDRA-749:
------------------------------------

I've been thinking about this more, and I don't think implementing secondary indexes is worth it: distributed or otherwise. Instead, I think the 'view' approach that CouchDB and Riak have taken is definitely superior.

For instance, it is easy to implement a secondary index as a view of a ColumnFamily: the key for the view is the value of the indexed column, and the value for the view is the key of the original row. But views are considerably more powerful, since you can store any item in the key or value for the view.

Also, a view is more conducive to duplication of data, which we prefer in Cassandra: rather than having secondary indexes pointing to the one true copy of the data, you can duplicate that data in a view if you'd like, and have it be lazily/eagerly updated serverside.

Yes, views might mean a server side scripting language, or an easy to way to plug in and configure Java view classes. It might even mean map-reduce.

> Secondary indices for column families
> -------------------------------------
>
>                 Key: CASSANDRA-749
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-749
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Gary Dusbabek
>            Assignee: Gary Dusbabek
>            Priority: Minor
>             Fix For: 0.6
>
>         Attachments: 0001-simple-secondary-indices.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-749) Secondary indices for column families

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12833472#action_12833472 ] 

Stu Hood commented on CASSANDRA-749:
------------------------------------

Thoughts on incrementally updated views:

In a first version a View CF could be defined with a normal column family definition, plus a function that transforms a mutation against a 'base' column family into zero or more mutations against the 'view' column family.

At mutation time, inserts can immediately be transformed into inserts to the view. But, an insert that overwrites an older value implies deletion from the base, and therefore a potential deletion from the view.

One disadvantage/advantage Cassandra has is that an entire row is not available at mutation time, so we need to defer all deletes to the view until read time. To handle deletes, all columns in the view cf should be tagged with the column_key from the base cf that caused their creation.

At read time, the base cf column_keys that are tagging the columns in the view need to be queried and the view function re-applied. If the output of the view function no longer causes an insert to the column in question in the view, the column can be deleted from the view.

Definitely need to think of ways to efficiently account for deletes, and minimize the number of reads to the base cf that need to occur at read time for the view cf.

> Secondary indices for column families
> -------------------------------------
>
>                 Key: CASSANDRA-749
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-749
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Gary Dusbabek
>            Assignee: Gary Dusbabek
>            Priority: Minor
>             Fix For: 0.7
>
>         Attachments: 0001-simple-secondary-indices.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-749) Secondary indices for column families

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829231#action_12829231 ] 

Jonathan Ellis commented on CASSANDRA-749:
------------------------------------------

> Is it worth creating a secondary index that only contains local data, versus a distributed secondary index (a normal ColumnFamily?) 

I'm not sure why that would be useful.

What we're trying to do here is move to the server a pattern that can be more efficiently done there, and so every client doesn't have to reimplement it manually.

> Secondary indices for column families
> -------------------------------------
>
>                 Key: CASSANDRA-749
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-749
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Gary Dusbabek
>            Assignee: Gary Dusbabek
>            Priority: Minor
>             Fix For: 0.6
>
>         Attachments: 0001-simple-secondary-indices.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-749) Secondary indices for column families

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829229#action_12829229 ] 

Stu Hood commented on CASSANDRA-749:
------------------------------------

Is it worth creating a secondary index that only contains local data, versus a distributed secondary index (a normal ColumnFamily?)

Also, adding an example/dummy predicate that uses the index would be useful.

> Secondary indices for column families
> -------------------------------------
>
>                 Key: CASSANDRA-749
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-749
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Gary Dusbabek
>            Assignee: Gary Dusbabek
>            Priority: Minor
>             Fix For: 0.6
>
>         Attachments: 0001-simple-secondary-indices.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-749) Secondary indices for column families

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829132#action_12829132 ] 

Jonathan Ellis commented on CASSANDRA-749:
------------------------------------------

it should be "supported" in the sense that if you want to load the index rows yourself you should be able to do that.  but we shouldn't try to create indexes from the serialized row blobs sent to bmt.

> Secondary indices for column families
> -------------------------------------
>
>                 Key: CASSANDRA-749
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-749
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Gary Dusbabek
>            Assignee: Gary Dusbabek
>            Priority: Minor
>             Fix For: 0.6
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.