You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "T Jake Luciani (JIRA)" <ji...@apache.org> on 2011/07/18 20:05:57 UTC

[jira] [Created] (CASSANDRA-2915) Lucene based Secondary Indexes

Lucene based Secondary Indexes
------------------------------

                 Key: CASSANDRA-2915
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2915
             Project: Cassandra
          Issue Type: New Feature
          Components: Core
            Reporter: T Jake Luciani
             Fix For: 1.0


Secondary indexes (Type KEYS) currently suffer from a number of limitations in their current form:

   - Multiple IndexClauses only work when there is a subset of rows under the highest clause
   - One new column family is created per index this means 10 new CFs for 10 secondary indexes

This ticket will use the Lucene library to implement secondary indexes as one index per CF, and utilize the Lucene query engine to handle multiple index clauses. Also, by using the Lucene we get a highly optimized file format.

There are a few parallels we can draw between Cassandra and Lucene.

Lucene indexes segments in memory then flushes them to disk so we can sync our memtable flushes to lucene flushes. Lucene also has optimize() which correlates to our compaction process, so these can be sync'd as well.

We will also need to correlate column validators to Lucene tokenizers, so the data can be stored properly, the big win in once this is done we can perform complex queries within a column like wildcard searches.

The downside of this approach is we will need to read before write since documents in Lucene are written as complete documents. For random workloads with lot's of indexed columns this means we need to read the document from the index, update it and write it back.



--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (CASSANDRA-2915) Lucene based Secondary Indexes

Posted by "Todd Nine (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13079653#comment-13079653 ] 

Todd Nine edited comment on CASSANDRA-2915 at 8/4/11 10:33 PM:
---------------------------------------------------------------

Hey guys.  We're doing something similar in the hector JPA plugin. 

Would using dynamic composites within cassandra alleviate the need for Lucene documents?  We're using this in secondary indexing and it gives us order by semantics and AND (Union).  The largest issue becomes iteration with OR clauses, AND clauses can be compressed into a single column for efficient range scans, we then use iterators to UNION the OR trees together with order clauses in the composites.  The caveat is that the user must define indexes with order semantics up front.  However this can easily be added to the existing secondary indexing clauses. 

      was (Author: tnine):
    Hey guys.  We're doing something similar in the hector JPA plugin. 

Would using dynamic composites within cassandra alleviate the need for Lucene documents?  We're using this in secondary indexing and it gives us order by semantics and AND (Union).  The largest issue becomes iteration with OR clauses, AND clauses can be compressed into a single column for efficient range scans, we then use iterators to UNION the OR trees together with order clauses in the composites.  
  
> Lucene based Secondary Indexes
> ------------------------------
>
>                 Key: CASSANDRA-2915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2915
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: T Jake Luciani
>              Labels: secondary_index
>             Fix For: 1.0
>
>
> Secondary indexes (of type KEYS) suffer from a number of limitations in their current form:
>    - Multiple IndexClauses only work when there is a subset of rows under the highest clause
>    - One new column family is created per index this means 10 new CFs for 10 secondary indexes
> This ticket will use the Lucene library to implement secondary indexes as one index per CF, and utilize the Lucene query engine to handle multiple index clauses. Also, by using the Lucene we get a highly optimized file format.
> There are a few parallels we can draw between Cassandra and Lucene.
> Lucene indexes segments in memory then flushes them to disk so we can sync our memtable flushes to lucene flushes. Lucene also has optimize() which correlates to our compaction process, so these can be sync'd as well.
> We will also need to correlate column validators to Lucene tokenizers, so the data can be stored properly, the big win in once this is done we can perform complex queries within a column like wildcard searches.
> The downside of this approach is we will need to read before write since documents in Lucene are written as complete documents. For random workloads with lot's of indexed columns this means we need to read the document from the index, update it and write it back.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2915) Lucene based Secondary Indexes

Posted by "Todd Nine (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13092606#comment-13092606 ] 

Todd Nine commented on CASSANDRA-2915:
--------------------------------------


I don't necessaryly think there is a 1 to 1 relationship between a column and a Lucene document field.  In our case we have the need to index fields in more than one manner.  For instance, we index users as straight strings (lowercased) with email, first name and last name columns.  However we also want to tokenize the email, first and last name columns to allow our customer support people to perform partial name matching.  I think a 1 to N mapping is required for column to document field to allow this sort of functionality.

As far as expiration on columns, is there a system event that we can hook into to just force a document reindex when a column expires rather than add an additional field that will need to be sorted from?

As per Jason's previous post, I think supporting ORDER BY, GROUP BY, COUNT, LIKE etc are a must.  Most users have become accustomed to this functionality with RDBMS.  If they cause potential performance problems, I think this should be documented so that users have enough information to determine if they can rely on the Lucene index or should build their own index directly.


Lastly, this is a huge feature for the hector-jpa plugin, what can I do to help?



> Lucene based Secondary Indexes
> ------------------------------
>
>                 Key: CASSANDRA-2915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2915
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: T Jake Luciani
>            Assignee: Jason Rutherglen
>              Labels: secondary_index
>
> Secondary indexes (of type KEYS) suffer from a number of limitations in their current form:
>    - Multiple IndexClauses only work when there is a subset of rows under the highest clause
>    - One new column family is created per index this means 10 new CFs for 10 secondary indexes
> This ticket will use the Lucene library to implement secondary indexes as one index per CF, and utilize the Lucene query engine to handle multiple index clauses. Also, by using the Lucene we get a highly optimized file format.
> There are a few parallels we can draw between Cassandra and Lucene.
> Lucene indexes segments in memory then flushes them to disk so we can sync our memtable flushes to lucene flushes. Lucene also has optimize() which correlates to our compaction process, so these can be sync'd as well.
> We will also need to correlate column validators to Lucene tokenizers, so the data can be stored properly, the big win in once this is done we can perform complex queries within a column like wildcard searches.
> The downside of this approach is we will need to read before write since documents in Lucene are written as complete documents. For random workloads with lot's of indexed columns this means we need to read the document from the index, update it and write it back.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2915) Lucene based Secondary Indexes

Posted by "T Jake Luciani (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13079745#comment-13079745 ] 

T Jake Luciani commented on CASSANDRA-2915:
-------------------------------------------

bq. Will read after write be available? I.E if your mutation for the row key returns to the client, then the row now has an entry in the Lucence index, which can immediately be queried to return the results.

Yes.  We can use a RAMDirectory() to keep writes real-time.

bq. What about durability, in the event cassandra crashes, will the Lucene index retain these indexed values, or will they be lost if commit is not invoked on the index?

When the memtable is flushed. we will merge the RAMDirectory index into the FSDirectory index and call reopen().

> Lucene based Secondary Indexes
> ------------------------------
>
>                 Key: CASSANDRA-2915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2915
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: T Jake Luciani
>              Labels: secondary_index
>             Fix For: 1.0
>
>
> Secondary indexes (of type KEYS) suffer from a number of limitations in their current form:
>    - Multiple IndexClauses only work when there is a subset of rows under the highest clause
>    - One new column family is created per index this means 10 new CFs for 10 secondary indexes
> This ticket will use the Lucene library to implement secondary indexes as one index per CF, and utilize the Lucene query engine to handle multiple index clauses. Also, by using the Lucene we get a highly optimized file format.
> There are a few parallels we can draw between Cassandra and Lucene.
> Lucene indexes segments in memory then flushes them to disk so we can sync our memtable flushes to lucene flushes. Lucene also has optimize() which correlates to our compaction process, so these can be sync'd as well.
> We will also need to correlate column validators to Lucene tokenizers, so the data can be stored properly, the big win in once this is done we can perform complex queries within a column like wildcard searches.
> The downside of this approach is we will need to read before write since documents in Lucene are written as complete documents. For random workloads with lot's of indexed columns this means we need to read the document from the index, update it and write it back.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2915) Lucene based Secondary Indexes

Posted by "T Jake Luciani (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13079952#comment-13079952 ] 

T Jake Luciani commented on CASSANDRA-2915:
-------------------------------------------

Another issue we need to work around is Expiring columns... We could store the expiration time in the document and make it a constraint on the lucene query so we don't pull expired data.

> Lucene based Secondary Indexes
> ------------------------------
>
>                 Key: CASSANDRA-2915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2915
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: T Jake Luciani
>              Labels: secondary_index
>             Fix For: 1.0
>
>
> Secondary indexes (of type KEYS) suffer from a number of limitations in their current form:
>    - Multiple IndexClauses only work when there is a subset of rows under the highest clause
>    - One new column family is created per index this means 10 new CFs for 10 secondary indexes
> This ticket will use the Lucene library to implement secondary indexes as one index per CF, and utilize the Lucene query engine to handle multiple index clauses. Also, by using the Lucene we get a highly optimized file format.
> There are a few parallels we can draw between Cassandra and Lucene.
> Lucene indexes segments in memory then flushes them to disk so we can sync our memtable flushes to lucene flushes. Lucene also has optimize() which correlates to our compaction process, so these can be sync'd as well.
> We will also need to correlate column validators to Lucene tokenizers, so the data can be stored properly, the big win in once this is done we can perform complex queries within a column like wildcard searches.
> The downside of this approach is we will need to read before write since documents in Lucene are written as complete documents. For random workloads with lot's of indexed columns this means we need to read the document from the index, update it and write it back.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2915) Lucene based Secondary Indexes

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13079717#comment-13079717 ] 

Jason Rutherglen commented on CASSANDRA-2915:
---------------------------------------------

bq. This currently works by executing the query locally if that does not have enough results it moves on to the next node. 

Ok.  Typically in distributed search one needs/wants to send the request to all of the possible nodes that contain data pertinent to the query.  Is this possible?

bq. In the meantime we need to think of how to link lucene analyzers to column_metadata

Can we simply define a class that intercepts row updates for a column family?  Then that class can implement what is needed to analyze the columns / row?

> Lucene based Secondary Indexes
> ------------------------------
>
>                 Key: CASSANDRA-2915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2915
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: T Jake Luciani
>              Labels: secondary_index
>             Fix For: 1.0
>
>
> Secondary indexes (of type KEYS) suffer from a number of limitations in their current form:
>    - Multiple IndexClauses only work when there is a subset of rows under the highest clause
>    - One new column family is created per index this means 10 new CFs for 10 secondary indexes
> This ticket will use the Lucene library to implement secondary indexes as one index per CF, and utilize the Lucene query engine to handle multiple index clauses. Also, by using the Lucene we get a highly optimized file format.
> There are a few parallels we can draw between Cassandra and Lucene.
> Lucene indexes segments in memory then flushes them to disk so we can sync our memtable flushes to lucene flushes. Lucene also has optimize() which correlates to our compaction process, so these can be sync'd as well.
> We will also need to correlate column validators to Lucene tokenizers, so the data can be stored properly, the big win in once this is done we can perform complex queries within a column like wildcard searches.
> The downside of this approach is we will need to read before write since documents in Lucene are written as complete documents. For random workloads with lot's of indexed columns this means we need to read the document from the index, update it and write it back.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2915) Lucene based Secondary Indexes

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13079776#comment-13079776 ] 

Jason Rutherglen commented on CASSANDRA-2915:
---------------------------------------------

bq. Yes. We can use a RAMDirectory() to keep writes real-time.

LUCENE-3092 implemented NRTCachingDirectory which we can use for in RAM NRT until LUCENE-2312 is completed.



> Lucene based Secondary Indexes
> ------------------------------
>
>                 Key: CASSANDRA-2915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2915
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: T Jake Luciani
>              Labels: secondary_index
>             Fix For: 1.0
>
>
> Secondary indexes (of type KEYS) suffer from a number of limitations in their current form:
>    - Multiple IndexClauses only work when there is a subset of rows under the highest clause
>    - One new column family is created per index this means 10 new CFs for 10 secondary indexes
> This ticket will use the Lucene library to implement secondary indexes as one index per CF, and utilize the Lucene query engine to handle multiple index clauses. Also, by using the Lucene we get a highly optimized file format.
> There are a few parallels we can draw between Cassandra and Lucene.
> Lucene indexes segments in memory then flushes them to disk so we can sync our memtable flushes to lucene flushes. Lucene also has optimize() which correlates to our compaction process, so these can be sync'd as well.
> We will also need to correlate column validators to Lucene tokenizers, so the data can be stored properly, the big win in once this is done we can perform complex queries within a column like wildcard searches.
> The downside of this approach is we will need to read before write since documents in Lucene are written as complete documents. For random workloads with lot's of indexed columns this means we need to read the document from the index, update it and write it back.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (CASSANDRA-2915) Lucene based Secondary Indexes

Posted by "T Jake Luciani (Assigned) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

T Jake Luciani reassigned CASSANDRA-2915:
-----------------------------------------

    Assignee:     (was: Jason Rutherglen)
    
> Lucene based Secondary Indexes
> ------------------------------
>
>                 Key: CASSANDRA-2915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2915
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: T Jake Luciani
>              Labels: secondary_index
>
> Secondary indexes (of type KEYS) suffer from a number of limitations in their current form:
>    - Multiple IndexClauses only work when there is a subset of rows under the highest clause
>    - One new column family is created per index this means 10 new CFs for 10 secondary indexes
> This ticket will use the Lucene library to implement secondary indexes as one index per CF, and utilize the Lucene query engine to handle multiple index clauses. Also, by using the Lucene we get a highly optimized file format.
> There are a few parallels we can draw between Cassandra and Lucene.
> Lucene indexes segments in memory then flushes them to disk so we can sync our memtable flushes to lucene flushes. Lucene also has optimize() which correlates to our compaction process, so these can be sync'd as well.
> We will also need to correlate column validators to Lucene tokenizers, so the data can be stored properly, the big win in once this is done we can perform complex queries within a column like wildcard searches.
> The downside of this approach is we will need to read before write since documents in Lucene are written as complete documents. For random workloads with lot's of indexed columns this means we need to read the document from the index, update it and write it back.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2915) Lucene based Secondary Indexes

Posted by "T Jake Luciani (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13079947#comment-13079947 ] 

T Jake Luciani commented on CASSANDRA-2915:
-------------------------------------------

LUCENE-2454 adds support for nested documents. we can perhaps use this to avoid the read before write.  We could create a document per field and nest them together under a row level parent doc

> Lucene based Secondary Indexes
> ------------------------------
>
>                 Key: CASSANDRA-2915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2915
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: T Jake Luciani
>              Labels: secondary_index
>             Fix For: 1.0
>
>
> Secondary indexes (of type KEYS) suffer from a number of limitations in their current form:
>    - Multiple IndexClauses only work when there is a subset of rows under the highest clause
>    - One new column family is created per index this means 10 new CFs for 10 secondary indexes
> This ticket will use the Lucene library to implement secondary indexes as one index per CF, and utilize the Lucene query engine to handle multiple index clauses. Also, by using the Lucene we get a highly optimized file format.
> There are a few parallels we can draw between Cassandra and Lucene.
> Lucene indexes segments in memory then flushes them to disk so we can sync our memtable flushes to lucene flushes. Lucene also has optimize() which correlates to our compaction process, so these can be sync'd as well.
> We will also need to correlate column validators to Lucene tokenizers, so the data can be stored properly, the big win in once this is done we can perform complex queries within a column like wildcard searches.
> The downside of this approach is we will need to read before write since documents in Lucene are written as complete documents. For random workloads with lot's of indexed columns this means we need to read the document from the index, update it and write it back.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (CASSANDRA-2915) Lucene based Secondary Indexes

Posted by "T Jake Luciani (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

T Jake Luciani updated CASSANDRA-2915:
--------------------------------------

    Description: 
Secondary indexes (of type KEYS) suffer from a number of limitations in their current form:

   - Multiple IndexClauses only work when there is a subset of rows under the highest clause
   - One new column family is created per index this means 10 new CFs for 10 secondary indexes

This ticket will use the Lucene library to implement secondary indexes as one index per CF, and utilize the Lucene query engine to handle multiple index clauses. Also, by using the Lucene we get a highly optimized file format.

There are a few parallels we can draw between Cassandra and Lucene.

Lucene indexes segments in memory then flushes them to disk so we can sync our memtable flushes to lucene flushes. Lucene also has optimize() which correlates to our compaction process, so these can be sync'd as well.

We will also need to correlate column validators to Lucene tokenizers, so the data can be stored properly, the big win in once this is done we can perform complex queries within a column like wildcard searches.

The downside of this approach is we will need to read before write since documents in Lucene are written as complete documents. For random workloads with lot's of indexed columns this means we need to read the document from the index, update it and write it back.



  was:
Secondary indexes (Type KEYS) currently suffer from a number of limitations in their current form:

   - Multiple IndexClauses only work when there is a subset of rows under the highest clause
   - One new column family is created per index this means 10 new CFs for 10 secondary indexes

This ticket will use the Lucene library to implement secondary indexes as one index per CF, and utilize the Lucene query engine to handle multiple index clauses. Also, by using the Lucene we get a highly optimized file format.

There are a few parallels we can draw between Cassandra and Lucene.

Lucene indexes segments in memory then flushes them to disk so we can sync our memtable flushes to lucene flushes. Lucene also has optimize() which correlates to our compaction process, so these can be sync'd as well.

We will also need to correlate column validators to Lucene tokenizers, so the data can be stored properly, the big win in once this is done we can perform complex queries within a column like wildcard searches.

The downside of this approach is we will need to read before write since documents in Lucene are written as complete documents. For random workloads with lot's of indexed columns this means we need to read the document from the index, update it and write it back.




> Lucene based Secondary Indexes
> ------------------------------
>
>                 Key: CASSANDRA-2915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2915
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: T Jake Luciani
>              Labels: secondary_index
>             Fix For: 1.0
>
>
> Secondary indexes (of type KEYS) suffer from a number of limitations in their current form:
>    - Multiple IndexClauses only work when there is a subset of rows under the highest clause
>    - One new column family is created per index this means 10 new CFs for 10 secondary indexes
> This ticket will use the Lucene library to implement secondary indexes as one index per CF, and utilize the Lucene query engine to handle multiple index clauses. Also, by using the Lucene we get a highly optimized file format.
> There are a few parallels we can draw between Cassandra and Lucene.
> Lucene indexes segments in memory then flushes them to disk so we can sync our memtable flushes to lucene flushes. Lucene also has optimize() which correlates to our compaction process, so these can be sync'd as well.
> We will also need to correlate column validators to Lucene tokenizers, so the data can be stored properly, the big win in once this is done we can perform complex queries within a column like wildcard searches.
> The downside of this approach is we will need to read before write since documents in Lucene are written as complete documents. For random workloads with lot's of indexed columns this means we need to read the document from the index, update it and write it back.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2915) Lucene based Secondary Indexes

Posted by "Todd Nine (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13093344#comment-13093344 ] 

Todd Nine commented on CASSANDRA-2915:
--------------------------------------

I think forcing users to install classes for common use cases would cause issues with adoption.  What about creating new CQL commands to handle this?  When creating an index in a db, you would define the fields and the manner in which they are indexed.  Could we do something like the following?


create index [colname] in [colfamily] using [index type 1] as [indexFieldName], [index type 2] as [indexFieldName], [index type n] as [indexFieldName]?

drop index [indexFieldName] in [colfamily] on [colname]



This way clients such as JPA can update and create indexes, without the need to install custom classes on Cassandra itself.  They also have the ability to directly reference the field name when using CQL queries.

Assuming that the index class types exist in the Lucene classpath, you get the 1 to many mappings for column to indexing strategy.  This would allow more advanced clients such as the JPA plugin to automatically add indexes to the document based on indexes defined on persistent fields, without generating any code the user has to install in the Cassandra runtime.  If users want to install custom analyzers, they still have the option to do so, and would gain access to it via CQL.

> Lucene based Secondary Indexes
> ------------------------------
>
>                 Key: CASSANDRA-2915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2915
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: T Jake Luciani
>            Assignee: Jason Rutherglen
>              Labels: secondary_index
>
> Secondary indexes (of type KEYS) suffer from a number of limitations in their current form:
>    - Multiple IndexClauses only work when there is a subset of rows under the highest clause
>    - One new column family is created per index this means 10 new CFs for 10 secondary indexes
> This ticket will use the Lucene library to implement secondary indexes as one index per CF, and utilize the Lucene query engine to handle multiple index clauses. Also, by using the Lucene we get a highly optimized file format.
> There are a few parallels we can draw between Cassandra and Lucene.
> Lucene indexes segments in memory then flushes them to disk so we can sync our memtable flushes to lucene flushes. Lucene also has optimize() which correlates to our compaction process, so these can be sync'd as well.
> We will also need to correlate column validators to Lucene tokenizers, so the data can be stored properly, the big win in once this is done we can perform complex queries within a column like wildcard searches.
> The downside of this approach is we will need to read before write since documents in Lucene are written as complete documents. For random workloads with lot's of indexed columns this means we need to read the document from the index, update it and write it back.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2915) Lucene based Secondary Indexes

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13069110#comment-13069110 ] 

Jason Rutherglen commented on CASSANDRA-2915:
---------------------------------------------

bq. I'd like to avoid writing two copies of the base data

Cassandra only needs to store the row UID as a Lucene document.

> Lucene based Secondary Indexes
> ------------------------------
>
>                 Key: CASSANDRA-2915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2915
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: T Jake Luciani
>              Labels: secondary_index
>             Fix For: 1.0
>
>
> Secondary indexes (of type KEYS) suffer from a number of limitations in their current form:
>    - Multiple IndexClauses only work when there is a subset of rows under the highest clause
>    - One new column family is created per index this means 10 new CFs for 10 secondary indexes
> This ticket will use the Lucene library to implement secondary indexes as one index per CF, and utilize the Lucene query engine to handle multiple index clauses. Also, by using the Lucene we get a highly optimized file format.
> There are a few parallels we can draw between Cassandra and Lucene.
> Lucene indexes segments in memory then flushes them to disk so we can sync our memtable flushes to lucene flushes. Lucene also has optimize() which correlates to our compaction process, so these can be sync'd as well.
> We will also need to correlate column validators to Lucene tokenizers, so the data can be stored properly, the big win in once this is done we can perform complex queries within a column like wildcard searches.
> The downside of this approach is we will need to read before write since documents in Lucene are written as complete documents. For random workloads with lot's of indexed columns this means we need to read the document from the index, update it and write it back.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2915) Lucene based Secondary Indexes

Posted by "Ed Anuff (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13093341#comment-13093341 ] 

Ed Anuff commented on CASSANDRA-2915:
-------------------------------------

+1 on having the ability to provide a conversion class for handling transformations from columns to Lucene documents.  It's not uncommon for people to store objects serialized to JSON or other some other serialization format into columns.  CQL will have to catch up with this practice at some point.

> Lucene based Secondary Indexes
> ------------------------------
>
>                 Key: CASSANDRA-2915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2915
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: T Jake Luciani
>            Assignee: Jason Rutherglen
>              Labels: secondary_index
>
> Secondary indexes (of type KEYS) suffer from a number of limitations in their current form:
>    - Multiple IndexClauses only work when there is a subset of rows under the highest clause
>    - One new column family is created per index this means 10 new CFs for 10 secondary indexes
> This ticket will use the Lucene library to implement secondary indexes as one index per CF, and utilize the Lucene query engine to handle multiple index clauses. Also, by using the Lucene we get a highly optimized file format.
> There are a few parallels we can draw between Cassandra and Lucene.
> Lucene indexes segments in memory then flushes them to disk so we can sync our memtable flushes to lucene flushes. Lucene also has optimize() which correlates to our compaction process, so these can be sync'd as well.
> We will also need to correlate column validators to Lucene tokenizers, so the data can be stored properly, the big win in once this is done we can perform complex queries within a column like wildcard searches.
> The downside of this approach is we will need to read before write since documents in Lucene are written as complete documents. For random workloads with lot's of indexed columns this means we need to read the document from the index, update it and write it back.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2915) Lucene based Secondary Indexes

Posted by "Ryan King (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13080008#comment-13080008 ] 

Ryan King commented on CASSANDRA-2915:
--------------------------------------

Regarding realtime search, hasn't our (twitter's) realtime search branch been merged into lucene trunk? Whenever that's available we should get real realtime results.

> Lucene based Secondary Indexes
> ------------------------------
>
>                 Key: CASSANDRA-2915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2915
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: T Jake Luciani
>              Labels: secondary_index
>             Fix For: 1.0
>
>
> Secondary indexes (of type KEYS) suffer from a number of limitations in their current form:
>    - Multiple IndexClauses only work when there is a subset of rows under the highest clause
>    - One new column family is created per index this means 10 new CFs for 10 secondary indexes
> This ticket will use the Lucene library to implement secondary indexes as one index per CF, and utilize the Lucene query engine to handle multiple index clauses. Also, by using the Lucene we get a highly optimized file format.
> There are a few parallels we can draw between Cassandra and Lucene.
> Lucene indexes segments in memory then flushes them to disk so we can sync our memtable flushes to lucene flushes. Lucene also has optimize() which correlates to our compaction process, so these can be sync'd as well.
> We will also need to correlate column validators to Lucene tokenizers, so the data can be stored properly, the big win in once this is done we can perform complex queries within a column like wildcard searches.
> The downside of this approach is we will need to read before write since documents in Lucene are written as complete documents. For random workloads with lot's of indexed columns this means we need to read the document from the index, update it and write it back.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2915) Lucene based Secondary Indexes

Posted by "T Jake Luciani (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13067228#comment-13067228 ] 

T Jake Luciani commented on CASSANDRA-2915:
-------------------------------------------

bq. We need to specify how configuration parameters are passed into the Lucene secondary index. This needs to include things like the local Lucene file path, a class to transform Cassandra CF rows into Lucene documents, etc.

The secondary indexes would go into the data directory defined in cassandra.yaml, currently there is a dir per KeySpace, we can create a subdir like "indexes" were the lucene indexes are stored.

As for transforms, I mentioned column validators. This is meta information about the contents of columns, see http://www.datastax.com/dev/blog/whats-new-cassandra-07-secondary-indexes

This validation_class can be extended to let users map columns to lucene analyzer.

The document would be a row: fields would be columns (with analyzers specified in the column meta-data validation_class) 



> Lucene based Secondary Indexes
> ------------------------------
>
>                 Key: CASSANDRA-2915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2915
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: T Jake Luciani
>              Labels: secondary_index
>             Fix For: 1.0
>
>
> Secondary indexes (of type KEYS) suffer from a number of limitations in their current form:
>    - Multiple IndexClauses only work when there is a subset of rows under the highest clause
>    - One new column family is created per index this means 10 new CFs for 10 secondary indexes
> This ticket will use the Lucene library to implement secondary indexes as one index per CF, and utilize the Lucene query engine to handle multiple index clauses. Also, by using the Lucene we get a highly optimized file format.
> There are a few parallels we can draw between Cassandra and Lucene.
> Lucene indexes segments in memory then flushes them to disk so we can sync our memtable flushes to lucene flushes. Lucene also has optimize() which correlates to our compaction process, so these can be sync'd as well.
> We will also need to correlate column validators to Lucene tokenizers, so the data can be stored properly, the big win in once this is done we can perform complex queries within a column like wildcard searches.
> The downside of this approach is we will need to read before write since documents in Lucene are written as complete documents. For random workloads with lot's of indexed columns this means we need to read the document from the index, update it and write it back.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2915) Lucene based Secondary Indexes

Posted by "T Jake Luciani (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13093748#comment-13093748 ] 

T Jake Luciani commented on CASSANDRA-2915:
-------------------------------------------

bq. I think supporting ORDER BY, GROUP BY, COUNT, LIKE etc are a must.

I don't think supporting GROUP BY and ORDER BY is something we want to support using secondary indexes.  The whole idea of scatter gather in cassandra would be a performance killer and promote bad data-modeling practices.

The goal of this ticket is to support lucene search features with the current secondary index api.  

We can add LIKE, OR, NOT, BETWEEN with this.

> Lucene based Secondary Indexes
> ------------------------------
>
>                 Key: CASSANDRA-2915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2915
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: T Jake Luciani
>            Assignee: Jason Rutherglen
>              Labels: secondary_index
>
> Secondary indexes (of type KEYS) suffer from a number of limitations in their current form:
>    - Multiple IndexClauses only work when there is a subset of rows under the highest clause
>    - One new column family is created per index this means 10 new CFs for 10 secondary indexes
> This ticket will use the Lucene library to implement secondary indexes as one index per CF, and utilize the Lucene query engine to handle multiple index clauses. Also, by using the Lucene we get a highly optimized file format.
> There are a few parallels we can draw between Cassandra and Lucene.
> Lucene indexes segments in memory then flushes them to disk so we can sync our memtable flushes to lucene flushes. Lucene also has optimize() which correlates to our compaction process, so these can be sync'd as well.
> We will also need to correlate column validators to Lucene tokenizers, so the data can be stored properly, the big win in once this is done we can perform complex queries within a column like wildcard searches.
> The downside of this approach is we will need to read before write since documents in Lucene are written as complete documents. For random workloads with lot's of indexed columns this means we need to read the document from the index, update it and write it back.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2915) Lucene based Secondary Indexes

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13067234#comment-13067234 ] 

Jonathan Ellis commented on CASSANDRA-2915:
-------------------------------------------

Right.  I didn't mean to imply this solves read-before-write, only that I'd like to avoid writing two copies of the base data.

> Lucene based Secondary Indexes
> ------------------------------
>
>                 Key: CASSANDRA-2915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2915
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: T Jake Luciani
>              Labels: secondary_index
>             Fix For: 1.0
>
>
> Secondary indexes (of type KEYS) suffer from a number of limitations in their current form:
>    - Multiple IndexClauses only work when there is a subset of rows under the highest clause
>    - One new column family is created per index this means 10 new CFs for 10 secondary indexes
> This ticket will use the Lucene library to implement secondary indexes as one index per CF, and utilize the Lucene query engine to handle multiple index clauses. Also, by using the Lucene we get a highly optimized file format.
> There are a few parallels we can draw between Cassandra and Lucene.
> Lucene indexes segments in memory then flushes them to disk so we can sync our memtable flushes to lucene flushes. Lucene also has optimize() which correlates to our compaction process, so these can be sync'd as well.
> We will also need to correlate column validators to Lucene tokenizers, so the data can be stored properly, the big win in once this is done we can perform complex queries within a column like wildcard searches.
> The downside of this approach is we will need to read before write since documents in Lucene are written as complete documents. For random workloads with lot's of indexed columns this means we need to read the document from the index, update it and write it back.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2915) Lucene based Secondary Indexes

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13067205#comment-13067205 ] 

Jason Rutherglen commented on CASSANDRA-2915:
---------------------------------------------

Jake, this looks good.  We need to specify how configuration parameters are passed into the Lucene secondary index.  This needs to include things like the local Lucene file path, a class to transform Cassandra CF rows into Lucene documents, etc.

> Lucene based Secondary Indexes
> ------------------------------
>
>                 Key: CASSANDRA-2915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2915
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: T Jake Luciani
>              Labels: secondary_index
>             Fix For: 1.0
>
>
> Secondary indexes (of type KEYS) suffer from a number of limitations in their current form:
>    - Multiple IndexClauses only work when there is a subset of rows under the highest clause
>    - One new column family is created per index this means 10 new CFs for 10 secondary indexes
> This ticket will use the Lucene library to implement secondary indexes as one index per CF, and utilize the Lucene query engine to handle multiple index clauses. Also, by using the Lucene we get a highly optimized file format.
> There are a few parallels we can draw between Cassandra and Lucene.
> Lucene indexes segments in memory then flushes them to disk so we can sync our memtable flushes to lucene flushes. Lucene also has optimize() which correlates to our compaction process, so these can be sync'd as well.
> We will also need to correlate column validators to Lucene tokenizers, so the data can be stored properly, the big win in once this is done we can perform complex queries within a column like wildcard searches.
> The downside of this approach is we will need to read before write since documents in Lucene are written as complete documents. For random workloads with lot's of indexed columns this means we need to read the document from the index, update it and write it back.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2915) Lucene based Secondary Indexes

Posted by "T Jake Luciani (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13067231#comment-13067231 ] 

T Jake Luciani commented on CASSANDRA-2915:
-------------------------------------------

bq. Could we go for a deeper level of integration? Instead of storing the data twice as Cassandra row + Lucene document, use the row as the document Source Of Truth, and just let Lucene handle the indexes?

Yes sure, but still requires constructing the full row before writing it to the index, since the client may be updating field 1 but indexes are on field 1 and field 2

> Lucene based Secondary Indexes
> ------------------------------
>
>                 Key: CASSANDRA-2915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2915
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: T Jake Luciani
>              Labels: secondary_index
>             Fix For: 1.0
>
>
> Secondary indexes (of type KEYS) suffer from a number of limitations in their current form:
>    - Multiple IndexClauses only work when there is a subset of rows under the highest clause
>    - One new column family is created per index this means 10 new CFs for 10 secondary indexes
> This ticket will use the Lucene library to implement secondary indexes as one index per CF, and utilize the Lucene query engine to handle multiple index clauses. Also, by using the Lucene we get a highly optimized file format.
> There are a few parallels we can draw between Cassandra and Lucene.
> Lucene indexes segments in memory then flushes them to disk so we can sync our memtable flushes to lucene flushes. Lucene also has optimize() which correlates to our compaction process, so these can be sync'd as well.
> We will also need to correlate column validators to Lucene tokenizers, so the data can be stored properly, the big win in once this is done we can perform complex queries within a column like wildcard searches.
> The downside of this approach is we will need to read before write since documents in Lucene are written as complete documents. For random workloads with lot's of indexed columns this means we need to read the document from the index, update it and write it back.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2915) Lucene based Secondary Indexes

Posted by "Todd Nine (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13093263#comment-13093263 ] 

Todd Nine commented on CASSANDRA-2915:
--------------------------------------

Could we also use this feature as a standard way for building our lucene documents?  This would accomplish what we want, as well as giving a hook for more user functionality.

CASSANDRA-1311


> Lucene based Secondary Indexes
> ------------------------------
>
>                 Key: CASSANDRA-2915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2915
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: T Jake Luciani
>            Assignee: Jason Rutherglen
>              Labels: secondary_index
>
> Secondary indexes (of type KEYS) suffer from a number of limitations in their current form:
>    - Multiple IndexClauses only work when there is a subset of rows under the highest clause
>    - One new column family is created per index this means 10 new CFs for 10 secondary indexes
> This ticket will use the Lucene library to implement secondary indexes as one index per CF, and utilize the Lucene query engine to handle multiple index clauses. Also, by using the Lucene we get a highly optimized file format.
> There are a few parallels we can draw between Cassandra and Lucene.
> Lucene indexes segments in memory then flushes them to disk so we can sync our memtable flushes to lucene flushes. Lucene also has optimize() which correlates to our compaction process, so these can be sync'd as well.
> We will also need to correlate column validators to Lucene tokenizers, so the data can be stored properly, the big win in once this is done we can perform complex queries within a column like wildcard searches.
> The downside of this approach is we will need to read before write since documents in Lucene are written as complete documents. For random workloads with lot's of indexed columns this means we need to read the document from the index, update it and write it back.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2915) Lucene based Secondary Indexes

Posted by "T Jake Luciani (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13079712#comment-13079712 ] 

T Jake Luciani commented on CASSANDRA-2915:
-------------------------------------------

Todd: once CASSANDRA-2982 is done we can get started. I'm trying to focus on that right now.  In the meantime we need to think of how to link lucene analyzers to column_metadata.

Jason: This currently works by executing the query locally if that does not have enough results it moves on to the next node. since the ring is split we know the range of keys to restrict the search to. this avoids dups


> Lucene based Secondary Indexes
> ------------------------------
>
>                 Key: CASSANDRA-2915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2915
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: T Jake Luciani
>              Labels: secondary_index
>             Fix For: 1.0
>
>
> Secondary indexes (of type KEYS) suffer from a number of limitations in their current form:
>    - Multiple IndexClauses only work when there is a subset of rows under the highest clause
>    - One new column family is created per index this means 10 new CFs for 10 secondary indexes
> This ticket will use the Lucene library to implement secondary indexes as one index per CF, and utilize the Lucene query engine to handle multiple index clauses. Also, by using the Lucene we get a highly optimized file format.
> There are a few parallels we can draw between Cassandra and Lucene.
> Lucene indexes segments in memory then flushes them to disk so we can sync our memtable flushes to lucene flushes. Lucene also has optimize() which correlates to our compaction process, so these can be sync'd as well.
> We will also need to correlate column validators to Lucene tokenizers, so the data can be stored properly, the big win in once this is done we can perform complex queries within a column like wildcard searches.
> The downside of this approach is we will need to read before write since documents in Lucene are written as complete documents. For random workloads with lot's of indexed columns this means we need to read the document from the index, update it and write it back.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2915) Lucene based Secondary Indexes

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13079684#comment-13079684 ] 

Jason Rutherglen commented on CASSANDRA-2915:
---------------------------------------------

I think the open design question on this one is distributed search, and how a distributed search client will know which Cassandra servers to send a query to.  Meaning, traditionally a query is sent to N servers whose responses are merged and X results are returned.  We can send a query to all servers however I think we'd then have duplicate rows/documents returned.  How does CQL handle this?

> Lucene based Secondary Indexes
> ------------------------------
>
>                 Key: CASSANDRA-2915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2915
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: T Jake Luciani
>              Labels: secondary_index
>             Fix For: 1.0
>
>
> Secondary indexes (of type KEYS) suffer from a number of limitations in their current form:
>    - Multiple IndexClauses only work when there is a subset of rows under the highest clause
>    - One new column family is created per index this means 10 new CFs for 10 secondary indexes
> This ticket will use the Lucene library to implement secondary indexes as one index per CF, and utilize the Lucene query engine to handle multiple index clauses. Also, by using the Lucene we get a highly optimized file format.
> There are a few parallels we can draw between Cassandra and Lucene.
> Lucene indexes segments in memory then flushes them to disk so we can sync our memtable flushes to lucene flushes. Lucene also has optimize() which correlates to our compaction process, so these can be sync'd as well.
> We will also need to correlate column validators to Lucene tokenizers, so the data can be stored properly, the big win in once this is done we can perform complex queries within a column like wildcard searches.
> The downside of this approach is we will need to read before write since documents in Lucene are written as complete documents. For random workloads with lot's of indexed columns this means we need to read the document from the index, update it and write it back.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (CASSANDRA-2915) Lucene based Secondary Indexes

Posted by "Todd Nine (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13092606#comment-13092606 ] 

Todd Nine edited comment on CASSANDRA-2915 at 8/29/11 4:30 AM:
---------------------------------------------------------------

I don't necessarily think there is a 1 to 1 relationship between a column and a Lucene document field.  In our case we have the need to index fields in more than one manner.  For instance, we index users as straight strings (lowercased) with email, first name and last name columns.  However we also want to tokenize the email, first and last name columns to allow our customer support people to perform partial name matching.  I think a 1 to N mapping is required for column to document field to allow this sort of functionality.

As far as expiration on columns, is there a system event that we can hook into to just force a document reindex when a column expires rather than add an additional field that will need to be sorted from?

As per Jason's previous post, I think supporting ORDER BY, GROUP BY, COUNT, LIKE etc are a must.  Most users have become accustomed to this functionality with RDBMS.  If they cause potential performance problems, I think this should be documented so that users have enough information to determine if they can rely on the Lucene index or should build their own index directly.


Has anyone looked at existing code in ElasticSearch to avoid some of the pitfalls they have already experienced in building something similar?

http://www.elasticsearch.org/


Lastly, this is a huge feature for the hector-jpa plugin, what can I do to help?  



      was (Author: tnine):
    I don't necessaryly think there is a 1 to 1 relationship between a column and a Lucene document field.  In our case we have the need to index fields in more than one manner.  For instance, we index users as straight strings (lowercased) with email, first name and last name columns.  However we also want to tokenize the email, first and last name columns to allow our customer support people to perform partial name matching.  I think a 1 to N mapping is required for column to document field to allow this sort of functionality.

As far as expiration on columns, is there a system event that we can hook into to just force a document reindex when a column expires rather than add an additional field that will need to be sorted from?

As per Jason's previous post, I think supporting ORDER BY, GROUP BY, COUNT, LIKE etc are a must.  Most users have become accustomed to this functionality with RDBMS.  If they cause potential performance problems, I think this should be documented so that users have enough information to determine if they can rely on the Lucene index or should build their own index directly.


Has anyone looked at existing code in ElasticSearch to avoid some of the pitfalls they have already experienced in building something similar?

http://www.elasticsearch.org/


Lastly, this is a huge feature for the hector-jpa plugin, what can I do to help?  


  
> Lucene based Secondary Indexes
> ------------------------------
>
>                 Key: CASSANDRA-2915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2915
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: T Jake Luciani
>            Assignee: Jason Rutherglen
>              Labels: secondary_index
>
> Secondary indexes (of type KEYS) suffer from a number of limitations in their current form:
>    - Multiple IndexClauses only work when there is a subset of rows under the highest clause
>    - One new column family is created per index this means 10 new CFs for 10 secondary indexes
> This ticket will use the Lucene library to implement secondary indexes as one index per CF, and utilize the Lucene query engine to handle multiple index clauses. Also, by using the Lucene we get a highly optimized file format.
> There are a few parallels we can draw between Cassandra and Lucene.
> Lucene indexes segments in memory then flushes them to disk so we can sync our memtable flushes to lucene flushes. Lucene also has optimize() which correlates to our compaction process, so these can be sync'd as well.
> We will also need to correlate column validators to Lucene tokenizers, so the data can be stored properly, the big win in once this is done we can perform complex queries within a column like wildcard searches.
> The downside of this approach is we will need to read before write since documents in Lucene are written as complete documents. For random workloads with lot's of indexed columns this means we need to read the document from the index, update it and write it back.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2915) Lucene based Secondary Indexes

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13072418#comment-13072418 ] 

Jason Rutherglen commented on CASSANDRA-2915:
---------------------------------------------

Does Cassandra have a built in RPC mechanism we can use to send the [Lucene] queries to the distributed servers?

> Lucene based Secondary Indexes
> ------------------------------
>
>                 Key: CASSANDRA-2915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2915
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: T Jake Luciani
>              Labels: secondary_index
>             Fix For: 1.0
>
>
> Secondary indexes (of type KEYS) suffer from a number of limitations in their current form:
>    - Multiple IndexClauses only work when there is a subset of rows under the highest clause
>    - One new column family is created per index this means 10 new CFs for 10 secondary indexes
> This ticket will use the Lucene library to implement secondary indexes as one index per CF, and utilize the Lucene query engine to handle multiple index clauses. Also, by using the Lucene we get a highly optimized file format.
> There are a few parallels we can draw between Cassandra and Lucene.
> Lucene indexes segments in memory then flushes them to disk so we can sync our memtable flushes to lucene flushes. Lucene also has optimize() which correlates to our compaction process, so these can be sync'd as well.
> We will also need to correlate column validators to Lucene tokenizers, so the data can be stored properly, the big win in once this is done we can perform complex queries within a column like wildcard searches.
> The downside of this approach is we will need to read before write since documents in Lucene are written as complete documents. For random workloads with lot's of indexed columns this means we need to read the document from the index, update it and write it back.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (CASSANDRA-2915) Lucene based Secondary Indexes

Posted by "Todd Nine (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13092606#comment-13092606 ] 

Todd Nine edited comment on CASSANDRA-2915 at 8/29/11 4:29 AM:
---------------------------------------------------------------

I don't necessaryly think there is a 1 to 1 relationship between a column and a Lucene document field.  In our case we have the need to index fields in more than one manner.  For instance, we index users as straight strings (lowercased) with email, first name and last name columns.  However we also want to tokenize the email, first and last name columns to allow our customer support people to perform partial name matching.  I think a 1 to N mapping is required for column to document field to allow this sort of functionality.

As far as expiration on columns, is there a system event that we can hook into to just force a document reindex when a column expires rather than add an additional field that will need to be sorted from?

As per Jason's previous post, I think supporting ORDER BY, GROUP BY, COUNT, LIKE etc are a must.  Most users have become accustomed to this functionality with RDBMS.  If they cause potential performance problems, I think this should be documented so that users have enough information to determine if they can rely on the Lucene index or should build their own index directly.


Has anyone looked at existing code in ElasticSearch to avoid some of the pitfalls they have already experienced in building something similar?

http://www.elasticsearch.org/


Lastly, this is a huge feature for the hector-jpa plugin, what can I do to help?  



      was (Author: tnine):
    
I don't necessaryly think there is a 1 to 1 relationship between a column and a Lucene document field.  In our case we have the need to index fields in more than one manner.  For instance, we index users as straight strings (lowercased) with email, first name and last name columns.  However we also want to tokenize the email, first and last name columns to allow our customer support people to perform partial name matching.  I think a 1 to N mapping is required for column to document field to allow this sort of functionality.

As far as expiration on columns, is there a system event that we can hook into to just force a document reindex when a column expires rather than add an additional field that will need to be sorted from?

As per Jason's previous post, I think supporting ORDER BY, GROUP BY, COUNT, LIKE etc are a must.  Most users have become accustomed to this functionality with RDBMS.  If they cause potential performance problems, I think this should be documented so that users have enough information to determine if they can rely on the Lucene index or should build their own index directly.


Lastly, this is a huge feature for the hector-jpa plugin, what can I do to help?


  
> Lucene based Secondary Indexes
> ------------------------------
>
>                 Key: CASSANDRA-2915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2915
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: T Jake Luciani
>            Assignee: Jason Rutherglen
>              Labels: secondary_index
>
> Secondary indexes (of type KEYS) suffer from a number of limitations in their current form:
>    - Multiple IndexClauses only work when there is a subset of rows under the highest clause
>    - One new column family is created per index this means 10 new CFs for 10 secondary indexes
> This ticket will use the Lucene library to implement secondary indexes as one index per CF, and utilize the Lucene query engine to handle multiple index clauses. Also, by using the Lucene we get a highly optimized file format.
> There are a few parallels we can draw between Cassandra and Lucene.
> Lucene indexes segments in memory then flushes them to disk so we can sync our memtable flushes to lucene flushes. Lucene also has optimize() which correlates to our compaction process, so these can be sync'd as well.
> We will also need to correlate column validators to Lucene tokenizers, so the data can be stored properly, the big win in once this is done we can perform complex queries within a column like wildcard searches.
> The downside of this approach is we will need to read before write since documents in Lucene are written as complete documents. For random workloads with lot's of indexed columns this means we need to read the document from the index, update it and write it back.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (CASSANDRA-2915) Lucene based Secondary Indexes

Posted by "Todd Nine (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13093344#comment-13093344 ] 

Todd Nine edited comment on CASSANDRA-2915 at 8/30/11 2:13 AM:
---------------------------------------------------------------

I think forcing users to install classes for common use cases would cause issues with adoption.  What about creating new CQL commands to handle this?  When creating an index in a db, you would define the fields and the manner in which they are indexed.  Could we do something like the following?


create index on [colname] in [colfamily] using [index type 1] as [indexFieldName], [index type 2] as [indexFieldName], [index type n] as [indexFieldName]?

drop index [indexFieldName] in [colfamily] on [colname]



This way clients such as JPA can update and create indexes, without the need to install custom classes on Cassandra itself.  They also have the ability to directly reference the field name when using CQL queries.

Assuming that the index class types exist in the Lucene classpath, you get the 1 to many mappings for column to indexing strategy.  This would allow more advanced clients such as the JPA plugin to automatically add indexes to the document based on indexes defined on persistent fields, without generating any code the user has to install in the Cassandra runtime.  If users want to install custom analyzers, they still have the option to do so, and would gain access to it via CQL.

      was (Author: tnine):
    I think forcing users to install classes for common use cases would cause issues with adoption.  What about creating new CQL commands to handle this?  When creating an index in a db, you would define the fields and the manner in which they are indexed.  Could we do something like the following?


create index [colname] in [colfamily] using [index type 1] as [indexFieldName], [index type 2] as [indexFieldName], [index type n] as [indexFieldName]?

drop index [indexFieldName] in [colfamily] on [colname]



This way clients such as JPA can update and create indexes, without the need to install custom classes on Cassandra itself.  They also have the ability to directly reference the field name when using CQL queries.

Assuming that the index class types exist in the Lucene classpath, you get the 1 to many mappings for column to indexing strategy.  This would allow more advanced clients such as the JPA plugin to automatically add indexes to the document based on indexes defined on persistent fields, without generating any code the user has to install in the Cassandra runtime.  If users want to install custom analyzers, they still have the option to do so, and would gain access to it via CQL.
  
> Lucene based Secondary Indexes
> ------------------------------
>
>                 Key: CASSANDRA-2915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2915
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: T Jake Luciani
>            Assignee: Jason Rutherglen
>              Labels: secondary_index
>
> Secondary indexes (of type KEYS) suffer from a number of limitations in their current form:
>    - Multiple IndexClauses only work when there is a subset of rows under the highest clause
>    - One new column family is created per index this means 10 new CFs for 10 secondary indexes
> This ticket will use the Lucene library to implement secondary indexes as one index per CF, and utilize the Lucene query engine to handle multiple index clauses. Also, by using the Lucene we get a highly optimized file format.
> There are a few parallels we can draw between Cassandra and Lucene.
> Lucene indexes segments in memory then flushes them to disk so we can sync our memtable flushes to lucene flushes. Lucene also has optimize() which correlates to our compaction process, so these can be sync'd as well.
> We will also need to correlate column validators to Lucene tokenizers, so the data can be stored properly, the big win in once this is done we can perform complex queries within a column like wildcard searches.
> The downside of this approach is we will need to read before write since documents in Lucene are written as complete documents. For random workloads with lot's of indexed columns this means we need to read the document from the index, update it and write it back.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2915) Lucene based Secondary Indexes

Posted by "Todd Nine (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13094116#comment-13094116 ] 

Todd Nine commented on CASSANDRA-2915:
--------------------------------------

I agree that order by could be a performance killer for large data sets.  In large data sets I think that users should make use of de-normalization and create their own secondary index for efficient querying.  However, on small data sets, which seem to be very common in web systems (ours is about 80% of the data a user sees), order by semantics are very important.  Most of our data the user sees has a very small result set, < 100 rows.  I think explicitly prohibiting these features limit the user too much.  Shouldn't they be supported and ultimately it is up to the user to determine which approach they take in implementing index for their data?

> Lucene based Secondary Indexes
> ------------------------------
>
>                 Key: CASSANDRA-2915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2915
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: T Jake Luciani
>            Assignee: Jason Rutherglen
>              Labels: secondary_index
>
> Secondary indexes (of type KEYS) suffer from a number of limitations in their current form:
>    - Multiple IndexClauses only work when there is a subset of rows under the highest clause
>    - One new column family is created per index this means 10 new CFs for 10 secondary indexes
> This ticket will use the Lucene library to implement secondary indexes as one index per CF, and utilize the Lucene query engine to handle multiple index clauses. Also, by using the Lucene we get a highly optimized file format.
> There are a few parallels we can draw between Cassandra and Lucene.
> Lucene indexes segments in memory then flushes them to disk so we can sync our memtable flushes to lucene flushes. Lucene also has optimize() which correlates to our compaction process, so these can be sync'd as well.
> We will also need to correlate column validators to Lucene tokenizers, so the data can be stored properly, the big win in once this is done we can perform complex queries within a column like wildcard searches.
> The downside of this approach is we will need to read before write since documents in Lucene are written as complete documents. For random workloads with lot's of indexed columns this means we need to read the document from the index, update it and write it back.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2915) Lucene based Secondary Indexes

Posted by "Todd Nine (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13079653#comment-13079653 ] 

Todd Nine commented on CASSANDRA-2915:
--------------------------------------

Hey guys.  We're doing something similar in the hector JPA plugin. 

Would using dynamic composites within cassandra alleviate the need for Lucene documents?  We're using this in secondary indexing and it gives us order by semantics and AND (Union).  The largest issue becomes iteration with OR clauses, AND clauses can be compressed into a single column for efficient range scans, we then use iterators to UNION the OR trees together with order clauses in the composites.  

> Lucene based Secondary Indexes
> ------------------------------
>
>                 Key: CASSANDRA-2915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2915
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: T Jake Luciani
>              Labels: secondary_index
>             Fix For: 1.0
>
>
> Secondary indexes (of type KEYS) suffer from a number of limitations in their current form:
>    - Multiple IndexClauses only work when there is a subset of rows under the highest clause
>    - One new column family is created per index this means 10 new CFs for 10 secondary indexes
> This ticket will use the Lucene library to implement secondary indexes as one index per CF, and utilize the Lucene query engine to handle multiple index clauses. Also, by using the Lucene we get a highly optimized file format.
> There are a few parallels we can draw between Cassandra and Lucene.
> Lucene indexes segments in memory then flushes them to disk so we can sync our memtable flushes to lucene flushes. Lucene also has optimize() which correlates to our compaction process, so these can be sync'd as well.
> We will also need to correlate column validators to Lucene tokenizers, so the data can be stored properly, the big win in once this is done we can perform complex queries within a column like wildcard searches.
> The downside of this approach is we will need to read before write since documents in Lucene are written as complete documents. For random workloads with lot's of indexed columns this means we need to read the document from the index, update it and write it back.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2915) Lucene based Secondary Indexes

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13079489#comment-13079489 ] 

Jason Rutherglen commented on CASSANDRA-2915:
---------------------------------------------

I looked at MessagingService which seems to be more [custom] asynchronous?  

I think we could offer a Thrift API?  What does CQL use?  

I think we'd want to look towards making this [Lucene] play well / integrate with CQL?

> Lucene based Secondary Indexes
> ------------------------------
>
>                 Key: CASSANDRA-2915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2915
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: T Jake Luciani
>              Labels: secondary_index
>             Fix For: 1.0
>
>
> Secondary indexes (of type KEYS) suffer from a number of limitations in their current form:
>    - Multiple IndexClauses only work when there is a subset of rows under the highest clause
>    - One new column family is created per index this means 10 new CFs for 10 secondary indexes
> This ticket will use the Lucene library to implement secondary indexes as one index per CF, and utilize the Lucene query engine to handle multiple index clauses. Also, by using the Lucene we get a highly optimized file format.
> There are a few parallels we can draw between Cassandra and Lucene.
> Lucene indexes segments in memory then flushes them to disk so we can sync our memtable flushes to lucene flushes. Lucene also has optimize() which correlates to our compaction process, so these can be sync'd as well.
> We will also need to correlate column validators to Lucene tokenizers, so the data can be stored properly, the big win in once this is done we can perform complex queries within a column like wildcard searches.
> The downside of this approach is we will need to read before write since documents in Lucene are written as complete documents. For random workloads with lot's of indexed columns this means we need to read the document from the index, update it and write it back.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2915) Lucene based Secondary Indexes

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13082878#comment-13082878 ] 

Jason Rutherglen commented on CASSANDRA-2915:
---------------------------------------------

Which physical directory do we want to place the Lucene indexes?

> Lucene based Secondary Indexes
> ------------------------------
>
>                 Key: CASSANDRA-2915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2915
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: T Jake Luciani
>              Labels: secondary_index
>             Fix For: 1.0
>
>
> Secondary indexes (of type KEYS) suffer from a number of limitations in their current form:
>    - Multiple IndexClauses only work when there is a subset of rows under the highest clause
>    - One new column family is created per index this means 10 new CFs for 10 secondary indexes
> This ticket will use the Lucene library to implement secondary indexes as one index per CF, and utilize the Lucene query engine to handle multiple index clauses. Also, by using the Lucene we get a highly optimized file format.
> There are a few parallels we can draw between Cassandra and Lucene.
> Lucene indexes segments in memory then flushes them to disk so we can sync our memtable flushes to lucene flushes. Lucene also has optimize() which correlates to our compaction process, so these can be sync'd as well.
> We will also need to correlate column validators to Lucene tokenizers, so the data can be stored properly, the big win in once this is done we can perform complex queries within a column like wildcard searches.
> The downside of this approach is we will need to read before write since documents in Lucene are written as complete documents. For random workloads with lot's of indexed columns this means we need to read the document from the index, update it and write it back.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2915) Lucene based Secondary Indexes

Posted by "Todd Nine (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13079741#comment-13079741 ] 

Todd Nine commented on CASSANDRA-2915:
--------------------------------------

A couple questions.

1. Will read after write be available?  I.E if your mutation for the row key returns to the client, then the row now has an entry in the Lucence index, which can immediately be queried to return the results.

2. What about durability, in the event cassandra crashes, will the Lucene index retain these indexed values, or will they be lost if commit is not invoked on the index?





> Lucene based Secondary Indexes
> ------------------------------
>
>                 Key: CASSANDRA-2915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2915
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: T Jake Luciani
>              Labels: secondary_index
>             Fix For: 1.0
>
>
> Secondary indexes (of type KEYS) suffer from a number of limitations in their current form:
>    - Multiple IndexClauses only work when there is a subset of rows under the highest clause
>    - One new column family is created per index this means 10 new CFs for 10 secondary indexes
> This ticket will use the Lucene library to implement secondary indexes as one index per CF, and utilize the Lucene query engine to handle multiple index clauses. Also, by using the Lucene we get a highly optimized file format.
> There are a few parallels we can draw between Cassandra and Lucene.
> Lucene indexes segments in memory then flushes them to disk so we can sync our memtable flushes to lucene flushes. Lucene also has optimize() which correlates to our compaction process, so these can be sync'd as well.
> We will also need to correlate column validators to Lucene tokenizers, so the data can be stored properly, the big win in once this is done we can perform complex queries within a column like wildcard searches.
> The downside of this approach is we will need to read before write since documents in Lucene are written as complete documents. For random workloads with lot's of indexed columns this means we need to read the document from the index, update it and write it back.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2915) Lucene based Secondary Indexes

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13079735#comment-13079735 ] 

Jason Rutherglen commented on CASSANDRA-2915:
---------------------------------------------

bq. like getLuceneAnalyzer()

There won't always be a 1 to 1 mapping of a column to a field.  For example in Solr, there is copy field, which essentially creates a new field.  Also Analyzer is for any field, the right per-field class would be Tokenizer.  

I strongly believe we need to have an interface that accepts a row and essentially generates a Lucene Document.  This should be the most straightforward approach that enables just about anything, including using a Solr schema at some point.

> Lucene based Secondary Indexes
> ------------------------------
>
>                 Key: CASSANDRA-2915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2915
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: T Jake Luciani
>              Labels: secondary_index
>             Fix For: 1.0
>
>
> Secondary indexes (of type KEYS) suffer from a number of limitations in their current form:
>    - Multiple IndexClauses only work when there is a subset of rows under the highest clause
>    - One new column family is created per index this means 10 new CFs for 10 secondary indexes
> This ticket will use the Lucene library to implement secondary indexes as one index per CF, and utilize the Lucene query engine to handle multiple index clauses. Also, by using the Lucene we get a highly optimized file format.
> There are a few parallels we can draw between Cassandra and Lucene.
> Lucene indexes segments in memory then flushes them to disk so we can sync our memtable flushes to lucene flushes. Lucene also has optimize() which correlates to our compaction process, so these can be sync'd as well.
> We will also need to correlate column validators to Lucene tokenizers, so the data can be stored properly, the big win in once this is done we can perform complex queries within a column like wildcard searches.
> The downside of this approach is we will need to read before write since documents in Lucene are written as complete documents. For random workloads with lot's of indexed columns this means we need to read the document from the index, update it and write it back.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2915) Lucene based Secondary Indexes

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13079682#comment-13079682 ] 

Jason Rutherglen commented on CASSANDRA-2915:
---------------------------------------------

bq. Would using dynamic composites within cassandra alleviate the need for Lucene documents?

I think it is hard to duplicate the efficiency of Lucene for dis/conjunction queries (OR / AND), especially with PFOR implemented (a CPU directed enhanced system for decoding integers on todays microprocessors).  

We can/will turn off scoring which further makes Lucene a straight query execution engine, as opposed to a free text search engine.  Range queries in Lucene use a trie system which is highly effective.  

> Lucene based Secondary Indexes
> ------------------------------
>
>                 Key: CASSANDRA-2915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2915
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: T Jake Luciani
>              Labels: secondary_index
>             Fix For: 1.0
>
>
> Secondary indexes (of type KEYS) suffer from a number of limitations in their current form:
>    - Multiple IndexClauses only work when there is a subset of rows under the highest clause
>    - One new column family is created per index this means 10 new CFs for 10 secondary indexes
> This ticket will use the Lucene library to implement secondary indexes as one index per CF, and utilize the Lucene query engine to handle multiple index clauses. Also, by using the Lucene we get a highly optimized file format.
> There are a few parallels we can draw between Cassandra and Lucene.
> Lucene indexes segments in memory then flushes them to disk so we can sync our memtable flushes to lucene flushes. Lucene also has optimize() which correlates to our compaction process, so these can be sync'd as well.
> We will also need to correlate column validators to Lucene tokenizers, so the data can be stored properly, the big win in once this is done we can perform complex queries within a column like wildcard searches.
> The downside of this approach is we will need to read before write since documents in Lucene are written as complete documents. For random workloads with lot's of indexed columns this means we need to read the document from the index, update it and write it back.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2915) Lucene based Secondary Indexes

Posted by "Todd Nine (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13079695#comment-13079695 ] 

Todd Nine commented on CASSANDRA-2915:
--------------------------------------

I'm quite keen to contribute on this issue, as this will greatly enhance the functionality of the hector-jpa project.  If I can contribute any work, please let me know.

> Lucene based Secondary Indexes
> ------------------------------
>
>                 Key: CASSANDRA-2915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2915
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: T Jake Luciani
>              Labels: secondary_index
>             Fix For: 1.0
>
>
> Secondary indexes (of type KEYS) suffer from a number of limitations in their current form:
>    - Multiple IndexClauses only work when there is a subset of rows under the highest clause
>    - One new column family is created per index this means 10 new CFs for 10 secondary indexes
> This ticket will use the Lucene library to implement secondary indexes as one index per CF, and utilize the Lucene query engine to handle multiple index clauses. Also, by using the Lucene we get a highly optimized file format.
> There are a few parallels we can draw between Cassandra and Lucene.
> Lucene indexes segments in memory then flushes them to disk so we can sync our memtable flushes to lucene flushes. Lucene also has optimize() which correlates to our compaction process, so these can be sync'd as well.
> We will also need to correlate column validators to Lucene tokenizers, so the data can be stored properly, the big win in once this is done we can perform complex queries within a column like wildcard searches.
> The downside of this approach is we will need to read before write since documents in Lucene are written as complete documents. For random workloads with lot's of indexed columns this means we need to read the document from the index, update it and write it back.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (CASSANDRA-2915) Lucene based Secondary Indexes

Posted by "T Jake Luciani (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

T Jake Luciani reassigned CASSANDRA-2915:
-----------------------------------------

    Assignee: Jason Rutherglen

> Lucene based Secondary Indexes
> ------------------------------
>
>                 Key: CASSANDRA-2915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2915
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: T Jake Luciani
>            Assignee: Jason Rutherglen
>              Labels: secondary_index
>             Fix For: 1.0
>
>
> Secondary indexes (of type KEYS) suffer from a number of limitations in their current form:
>    - Multiple IndexClauses only work when there is a subset of rows under the highest clause
>    - One new column family is created per index this means 10 new CFs for 10 secondary indexes
> This ticket will use the Lucene library to implement secondary indexes as one index per CF, and utilize the Lucene query engine to handle multiple index clauses. Also, by using the Lucene we get a highly optimized file format.
> There are a few parallels we can draw between Cassandra and Lucene.
> Lucene indexes segments in memory then flushes them to disk so we can sync our memtable flushes to lucene flushes. Lucene also has optimize() which correlates to our compaction process, so these can be sync'd as well.
> We will also need to correlate column validators to Lucene tokenizers, so the data can be stored properly, the big win in once this is done we can perform complex queries within a column like wildcard searches.
> The downside of this approach is we will need to read before write since documents in Lucene are written as complete documents. For random workloads with lot's of indexed columns this means we need to read the document from the index, update it and write it back.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2915) Lucene based Secondary Indexes

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13080002#comment-13080002 ] 

Jason Rutherglen commented on CASSANDRA-2915:
---------------------------------------------

I think it's important to note all of the many SQL'like features Lucene has [now].  

ORDER BY, GROUP BY, COUNT / facet, AND / OR queries, LIKE.  This makes Lucene ideal for CQL and it's goals.

> Lucene based Secondary Indexes
> ------------------------------
>
>                 Key: CASSANDRA-2915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2915
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: T Jake Luciani
>              Labels: secondary_index
>             Fix For: 1.0
>
>
> Secondary indexes (of type KEYS) suffer from a number of limitations in their current form:
>    - Multiple IndexClauses only work when there is a subset of rows under the highest clause
>    - One new column family is created per index this means 10 new CFs for 10 secondary indexes
> This ticket will use the Lucene library to implement secondary indexes as one index per CF, and utilize the Lucene query engine to handle multiple index clauses. Also, by using the Lucene we get a highly optimized file format.
> There are a few parallels we can draw between Cassandra and Lucene.
> Lucene indexes segments in memory then flushes them to disk so we can sync our memtable flushes to lucene flushes. Lucene also has optimize() which correlates to our compaction process, so these can be sync'd as well.
> We will also need to correlate column validators to Lucene tokenizers, so the data can be stored properly, the big win in once this is done we can perform complex queries within a column like wildcard searches.
> The downside of this approach is we will need to read before write since documents in Lucene are written as complete documents. For random workloads with lot's of indexed columns this means we need to read the document from the index, update it and write it back.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2915) Lucene based Secondary Indexes

Posted by "T Jake Luciani (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13079730#comment-13079730 ] 

T Jake Luciani commented on CASSANDRA-2915:
-------------------------------------------

bq. Ok. Typically in distributed search one needs/wants to send the request to all of the possible nodes that contain data pertinent to the query. Is this possible?

see CASSANDRA-1337 it's going to always need to hit all the nodes in a worst case (or if we add support for order by in CQL)


bq. Can we simply define a class that intercepts row updates for a column family? Then that class can implement what is needed to analyze the columns / row?

The problem is the Type class can be user defined.  So this doesn't get us very far, I was thinking we add a new method to AbtractType class that can be set. like getLuceneAnalyzer()



> Lucene based Secondary Indexes
> ------------------------------
>
>                 Key: CASSANDRA-2915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2915
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: T Jake Luciani
>              Labels: secondary_index
>             Fix For: 1.0
>
>
> Secondary indexes (of type KEYS) suffer from a number of limitations in their current form:
>    - Multiple IndexClauses only work when there is a subset of rows under the highest clause
>    - One new column family is created per index this means 10 new CFs for 10 secondary indexes
> This ticket will use the Lucene library to implement secondary indexes as one index per CF, and utilize the Lucene query engine to handle multiple index clauses. Also, by using the Lucene we get a highly optimized file format.
> There are a few parallels we can draw between Cassandra and Lucene.
> Lucene indexes segments in memory then flushes them to disk so we can sync our memtable flushes to lucene flushes. Lucene also has optimize() which correlates to our compaction process, so these can be sync'd as well.
> We will also need to correlate column validators to Lucene tokenizers, so the data can be stored properly, the big win in once this is done we can perform complex queries within a column like wildcard searches.
> The downside of this approach is we will need to read before write since documents in Lucene are written as complete documents. For random workloads with lot's of indexed columns this means we need to read the document from the index, update it and write it back.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2915) Lucene based Secondary Indexes

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13072582#comment-13072582 ] 

Jonathan Ellis commented on CASSANDRA-2915:
-------------------------------------------

Yes.  Look at uses of MessagingService.

> Lucene based Secondary Indexes
> ------------------------------
>
>                 Key: CASSANDRA-2915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2915
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: T Jake Luciani
>              Labels: secondary_index
>             Fix For: 1.0
>
>
> Secondary indexes (of type KEYS) suffer from a number of limitations in their current form:
>    - Multiple IndexClauses only work when there is a subset of rows under the highest clause
>    - One new column family is created per index this means 10 new CFs for 10 secondary indexes
> This ticket will use the Lucene library to implement secondary indexes as one index per CF, and utilize the Lucene query engine to handle multiple index clauses. Also, by using the Lucene we get a highly optimized file format.
> There are a few parallels we can draw between Cassandra and Lucene.
> Lucene indexes segments in memory then flushes them to disk so we can sync our memtable flushes to lucene flushes. Lucene also has optimize() which correlates to our compaction process, so these can be sync'd as well.
> We will also need to correlate column validators to Lucene tokenizers, so the data can be stored properly, the big win in once this is done we can perform complex queries within a column like wildcard searches.
> The downside of this approach is we will need to read before write since documents in Lucene are written as complete documents. For random workloads with lot's of indexed columns this means we need to read the document from the index, update it and write it back.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2915) Lucene based Secondary Indexes

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13080012#comment-13080012 ] 

Jason Rutherglen commented on CASSANDRA-2915:
---------------------------------------------

bq. Regarding realtime search, hasn't our (twitter's) realtime search branch been merged into lucene trunk? 

There's LUCENE-2312.  Twitter's RT search is highly specialized (yes I'm familiar with it), eg, Lucene is far too general (think of payloads, phrase queries, span queries, etc) for the code Twitter has to be merged into.  If Twitter's search were to be integrated, there would be an awful lot of refactoring of Lucene required.

> Lucene based Secondary Indexes
> ------------------------------
>
>                 Key: CASSANDRA-2915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2915
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: T Jake Luciani
>              Labels: secondary_index
>             Fix For: 1.0
>
>
> Secondary indexes (of type KEYS) suffer from a number of limitations in their current form:
>    - Multiple IndexClauses only work when there is a subset of rows under the highest clause
>    - One new column family is created per index this means 10 new CFs for 10 secondary indexes
> This ticket will use the Lucene library to implement secondary indexes as one index per CF, and utilize the Lucene query engine to handle multiple index clauses. Also, by using the Lucene we get a highly optimized file format.
> There are a few parallels we can draw between Cassandra and Lucene.
> Lucene indexes segments in memory then flushes them to disk so we can sync our memtable flushes to lucene flushes. Lucene also has optimize() which correlates to our compaction process, so these can be sync'd as well.
> We will also need to correlate column validators to Lucene tokenizers, so the data can be stored properly, the big win in once this is done we can perform complex queries within a column like wildcard searches.
> The downside of this approach is we will need to read before write since documents in Lucene are written as complete documents. For random workloads with lot's of indexed columns this means we need to read the document from the index, update it and write it back.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2915) Lucene based Secondary Indexes

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13067224#comment-13067224 ] 

Jonathan Ellis commented on CASSANDRA-2915:
-------------------------------------------

bq. For random workloads with lot's of indexed columns this means we need to read the document from the index, update it and write it back.

Could we go for a deeper level of integration?  Instead of storing the data twice as Cassandra row + Lucene document, use the row as the document Source Of Truth, and just let Lucene handle the indexes?

> Lucene based Secondary Indexes
> ------------------------------
>
>                 Key: CASSANDRA-2915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2915
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: T Jake Luciani
>              Labels: secondary_index
>             Fix For: 1.0
>
>
> Secondary indexes (of type KEYS) suffer from a number of limitations in their current form:
>    - Multiple IndexClauses only work when there is a subset of rows under the highest clause
>    - One new column family is created per index this means 10 new CFs for 10 secondary indexes
> This ticket will use the Lucene library to implement secondary indexes as one index per CF, and utilize the Lucene query engine to handle multiple index clauses. Also, by using the Lucene we get a highly optimized file format.
> There are a few parallels we can draw between Cassandra and Lucene.
> Lucene indexes segments in memory then flushes them to disk so we can sync our memtable flushes to lucene flushes. Lucene also has optimize() which correlates to our compaction process, so these can be sync'd as well.
> We will also need to correlate column validators to Lucene tokenizers, so the data can be stored properly, the big win in once this is done we can perform complex queries within a column like wildcard searches.
> The downside of this approach is we will need to read before write since documents in Lucene are written as complete documents. For random workloads with lot's of indexed columns this means we need to read the document from the index, update it and write it back.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2915) Lucene based Secondary Indexes

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13079982#comment-13079982 ] 

Jason Rutherglen commented on CASSANDRA-2915:
---------------------------------------------

bq. LUCENE-2454 adds support for nested documents. we can perhaps use this to avoid the read before write

I think LUCENE-2454 needs the nested documents to be added at the same time.  In our case that wouldn't be happening.  Google's GData for example doesn't offer the feature of automatically retrieving values from the previous document, it assumes you are replacing the entire document with new contents, and relies on the user to have read the document [somewhere] before.

I think there's another Lucene issue that performs an initial query to obtain the parent document.  However that is the same as a read before write.

I'm guessing Cassandra enables updating an individual column?  I don't think there's any way around this?

bq. We could store the expiration time in the document and make it a constraint on the lucene query so we don't pull expired data

That would work.  We'd need to use a trie range filter query, which will make all queries a little bit slower.

> Lucene based Secondary Indexes
> ------------------------------
>
>                 Key: CASSANDRA-2915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2915
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: T Jake Luciani
>              Labels: secondary_index
>             Fix For: 1.0
>
>
> Secondary indexes (of type KEYS) suffer from a number of limitations in their current form:
>    - Multiple IndexClauses only work when there is a subset of rows under the highest clause
>    - One new column family is created per index this means 10 new CFs for 10 secondary indexes
> This ticket will use the Lucene library to implement secondary indexes as one index per CF, and utilize the Lucene query engine to handle multiple index clauses. Also, by using the Lucene we get a highly optimized file format.
> There are a few parallels we can draw between Cassandra and Lucene.
> Lucene indexes segments in memory then flushes them to disk so we can sync our memtable flushes to lucene flushes. Lucene also has optimize() which correlates to our compaction process, so these can be sync'd as well.
> We will also need to correlate column validators to Lucene tokenizers, so the data can be stored properly, the big win in once this is done we can perform complex queries within a column like wildcard searches.
> The downside of this approach is we will need to read before write since documents in Lucene are written as complete documents. For random workloads with lot's of indexed columns this means we need to read the document from the index, update it and write it back.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2915) Lucene based Secondary Indexes

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13093298#comment-13093298 ] 

Jason Rutherglen commented on CASSANDRA-2915:
---------------------------------------------

Todd,

Another option is to add a [user optional] class that converts raw Cassandra columns into a Lucene document.  Implicitly the Cassandra columns do not need to map to Lucene document fields.  This is more of a slight change in the user's expectations for CQL rather than a core functional change.  Eg, the CQL submitted to a Lucene secondary index may refer to Lucene fields that do not exist as columns.

> Lucene based Secondary Indexes
> ------------------------------
>
>                 Key: CASSANDRA-2915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2915
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: T Jake Luciani
>            Assignee: Jason Rutherglen
>              Labels: secondary_index
>
> Secondary indexes (of type KEYS) suffer from a number of limitations in their current form:
>    - Multiple IndexClauses only work when there is a subset of rows under the highest clause
>    - One new column family is created per index this means 10 new CFs for 10 secondary indexes
> This ticket will use the Lucene library to implement secondary indexes as one index per CF, and utilize the Lucene query engine to handle multiple index clauses. Also, by using the Lucene we get a highly optimized file format.
> There are a few parallels we can draw between Cassandra and Lucene.
> Lucene indexes segments in memory then flushes them to disk so we can sync our memtable flushes to lucene flushes. Lucene also has optimize() which correlates to our compaction process, so these can be sync'd as well.
> We will also need to correlate column validators to Lucene tokenizers, so the data can be stored properly, the big win in once this is done we can perform complex queries within a column like wildcard searches.
> The downside of this approach is we will need to read before write since documents in Lucene are written as complete documents. For random workloads with lot's of indexed columns this means we need to read the document from the index, update it and write it back.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2915) Lucene based Secondary Indexes

Posted by "T Jake Luciani (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13083099#comment-13083099 ] 

T Jake Luciani commented on CASSANDRA-2915:
-------------------------------------------

Under the CF dir I imagine

> Lucene based Secondary Indexes
> ------------------------------
>
>                 Key: CASSANDRA-2915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2915
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: T Jake Luciani
>            Assignee: Jason Rutherglen
>              Labels: secondary_index
>             Fix For: 1.0
>
>
> Secondary indexes (of type KEYS) suffer from a number of limitations in their current form:
>    - Multiple IndexClauses only work when there is a subset of rows under the highest clause
>    - One new column family is created per index this means 10 new CFs for 10 secondary indexes
> This ticket will use the Lucene library to implement secondary indexes as one index per CF, and utilize the Lucene query engine to handle multiple index clauses. Also, by using the Lucene we get a highly optimized file format.
> There are a few parallels we can draw between Cassandra and Lucene.
> Lucene indexes segments in memory then flushes them to disk so we can sync our memtable flushes to lucene flushes. Lucene also has optimize() which correlates to our compaction process, so these can be sync'd as well.
> We will also need to correlate column validators to Lucene tokenizers, so the data can be stored properly, the big win in once this is done we can perform complex queries within a column like wildcard searches.
> The downside of this approach is we will need to read before write since documents in Lucene are written as complete documents. For random workloads with lot's of indexed columns this means we need to read the document from the index, update it and write it back.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira