You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Matt Stump (JIRA)" <ji...@apache.org> on 2013/12/17 07:40:14 UTC

[jira] [Commented] (CASSANDRA-2915) Lucene based Secondary Indexes

    [ https://issues.apache.org/jira/browse/CASSANDRA-2915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13850175#comment-13850175 ] 

Matt Stump commented on CASSANDRA-2915:
---------------------------------------

Given that the read before write issues still stand for non-numeric fields (as of 4.6), is Lucene based secondary indexes still something we want committed in the near term? Do we want to wait until incremental update/stacked segments are available for all field types?

Additionally, Lucene, even for near realtime search still imposes a delay between when a row is added and when it is query-able which would differ from existing behavior; is this something that we can live with?

> Lucene based Secondary Indexes
> ------------------------------
>
>                 Key: CASSANDRA-2915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2915
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: T Jake Luciani
>              Labels: secondary_index
>
> Secondary indexes (of type KEYS) suffer from a number of limitations in their current form:
>    - Multiple IndexClauses only work when there is a subset of rows under the highest clause
>    - One new column family is created per index this means 10 new CFs for 10 secondary indexes
> This ticket will use the Lucene library to implement secondary indexes as one index per CF, and utilize the Lucene query engine to handle multiple index clauses. Also, by using the Lucene we get a highly optimized file format.
> There are a few parallels we can draw between Cassandra and Lucene.
> Lucene indexes segments in memory then flushes them to disk so we can sync our memtable flushes to lucene flushes. Lucene also has optimize() which correlates to our compaction process, so these can be sync'd as well.
> We will also need to correlate column validators to Lucene tokenizers, so the data can be stored properly, the big win in once this is done we can perform complex queries within a column like wildcard searches.
> The downside of this approach is we will need to read before write since documents in Lucene are written as complete documents. For random workloads with lot's of indexed columns this means we need to read the document from the index, update it and write it back.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)