You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Evgeniy Ryabitskiy <ev...@wikimart.ru> on 2011/09/12 15:55:54 UTC

Index search in provided list of rows (list of rowKeys).

Hi,

We have an issue to search over Cassandra and we are using Sphinx for
indexing.
Because of Sphinx architecture we can't use range queries over all fields
that we need to.
So we have to run Sphinx Query first to get List of rowKeys and perform
additional range filtering over column values.

First simple solution is to do it on Client side. That will increase network
traffic and memory usage on client.

Now I'm wondering if it possible to perform such filtering on Cassandra
side.
I wish to use some IndexExpression for range filtering in list of records
(list of rowKeys returned from external Indexing Search Engine).

Looking at get_indexed_slices I found out that in IndexClause is no
possibility set List of rowKeys (like for multiget_slice), only start_key.

So 2 questions:

1) Am I missing something and my idea is possible via some another API?
2) If not possible, can I add JIRA for this feature?

Evgeny.

Re: Index search in provided list of rows (list of rowKeys).

Posted by aaron morton <aa...@thelastpickle.com>.
The way specify more restrictions to the query is to specify them in the index_clause.  The index clause is applied to the set of all rows in the database, not a sub set, applying them to a sub set is implicitly supporting a sub query. Currently it's doing "select then project", this would be "select then select then project".

Right now I would use Solandra, or do the entire search in Sphinx and get the row keys for the result documents. In the future you may be able to use this https://issues.apache.org/jira/browse/CASSANDRA-2915

Cheers

-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 15/09/2011, at 12:46 AM, Evgeniy Ryabitskiy wrote:

> Why it's radically?
> 
> It will be same get_indexes_slices search but in specified set of rows. So mostly it will be one more Search Expression over rowIDs not only column values. Usually the more restrictions you could specify in search query, the faster search it can be (not slower at least).
> 
> About moving to another engine:
> 
> Sphinx has it's advantages (quite fast) and disadvantages (painful integration, lot's of limitations). Currently my company using it on production, so moving to another search engine is a big step and it will be considered.
> 
> 
> What I want to discuss is common task of searching in Cassandra. Maybe I missing some already well known solution for it (silver bullet)?
> I see only 2 solutions:
> 
> 1) Using external search engine that will index all storage fields
> 
> advantage:
>  support full text search
> some engines have nice search features like "sorting by relevance"
> 
> disadvantage: 
> for range scans it stores column values, it mean that huge part of cassandra data will be also stored at Search Engine metadata
> usually engines have set of limitations
> 
> 2) Use Cassandra embedded Indexing search
> advantage: 
> doesn't need to index all columns that are used for filtering. 
> Filtering performed at storage, close to data.
> 
> disadvantage: 
> not full text search support
> require to create and maintain secondary indexes.
> 
> Both solutions are exclusive, you could choose only one and there is no way to use combination of this 2 solutions (except intersection at client side which is not a solution).
> 
> So API that was discussed would open some possibility to use that combination. 
> For me it looks like third solution. Could it really change the way we are searching in Cassandra?
> 
> 
> Evgeny.
>  
> 
> 


Re: Index search in provided list of rows (list of rowKeys).

Posted by Evgeniy Ryabitskiy <ev...@wikimart.ru>.
Why it's radically?

It will be same get_indexes_slices search but in specified set of rows. So
mostly it will be one more Search Expression over rowIDs not only column
values. Usually the more restrictions you could specify in search query, the
faster search it can be (not slower at least).

About moving to another engine:

Sphinx has it's advantages (quite fast) and disadvantages (painful
integration, lot's of limitations). Currently my company using it on
production, so moving to another search engine is a big step and it will be
considered.


What I want to discuss is common task of searching in Cassandra. Maybe I
missing some already well known solution for it (silver bullet)?
I see only 2 solutions:

1) Using external search engine that will index all storage fields

advantage:
 support full text search
some engines have nice search features like "sorting by relevance"

disadvantage:
for range scans it stores column values, it mean that huge part of cassandra
data will be also stored at Search Engine metadata
usually engines have set of limitations

2) Use Cassandra embedded Indexing search
advantage:
doesn't need to index all columns that are used for filtering.
Filtering performed at storage, close to data.

disadvantage:
not full text search support
require to create and maintain secondary indexes.

Both solutions are exclusive, you could choose only one and there is no way
to use combination of this 2 solutions (except intersection at client side
which is not a solution).

So API that was discussed would open some possibility to use that
combination.
For me it looks like third solution. Could it really change the way we are
searching in Cassandra?


Evgeny.

Re: Index search in provided list of rows (list of rowKeys).

Posted by aaron morton <aa...@thelastpickle.com>.
Not sure it's a feature cassandra needs, it would radically change the meaning of get_indexes_slices(). If you already know the row keys the assumption would be you know they are the rows you want to get. 

Feel free to add a Jira though. 

IMHO this sounds more like Sphinx not supporting all the features you need, rather than cassandra. Can you use a different search engine such as Solr, Solandra or Elastic Search? Or 

Cheers
-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 13/09/2011, at 10:27 AM, Evgeniy Ryabitskiy wrote:

> Something like this.
> 
> Actually I think it's better to extend get_indexed_slice() API instead of creating new one thrift method.
> I wish to have something like this:
> 
> //here we run query to external search engine
> List<byte[]> keys = performSphinxQuery(someFullTextSearchQuery);
> IndexClause indexClause = new IndexClause();
> 
> //required API to set list of keys
> indexClause.setKeys(keys);
> indexClause.setExpressions(someFilteringExpressions);
> List finalResult = get_indexed_slices(colParent, indexClause, colPredicate, cLevel);
> 
> 
> 
> I can't solve my issue with single get_indexed_slice().
> Here is issue in more details: 
> 1) have ~ 6 millions records, in feature could be much more
> 2) have  > 10k different properties (stored as column values in Cassandra), in feature could be much more
> 3) properties are text descriptions , int/float values, string values 
> 4) need to implement search over all properties. For text descriptions: full text search. for int/float properties: range search.
> 5) Search query could use any combination of property descriptions. Like full text search description and some range expression for int/float field.
> 6) have external search engine (Sphinx) that indexed all string and text properties
> 7) still need to perform range search for int, float fields.
> 
> So now I split my query expressions in 2 groups:
> 1) expressions that can be handled by search engine
> 2) others (additional filters)
> 
> For example I run first query to Sphinx and got list of rowKeys, with length of 100k.  (mark as RESULT1)
> Now I need to filter it by second group of expressions. For example I have simple expression: "age > 25".
> So imagine I would run get_indexed_slice() with this query and could possibly get half of my records in result. (mark as RESULT2)
> Then I would need to get intersection between RESULT1 and RESULT2 on client side, which could take a lot of time and memory.
> That is why I can't use single get_indexed_slice here.
> 
> For me is better to iterate RESULT1 (with 100k records) at client side to filter by age and got 10-50k record as final result. Disadvantage here is that I have to fetch all 100k records.
> 
> Evgeny.
> 
> 
> 
> 
> 
> 
> 
> 
> 


Re: Index search in provided list of rows (list of rowKeys).

Posted by Evgeniy Ryabitskiy <ev...@wikimart.ru>.
Something like this.

Actually I think it's better to extend get_indexed_slice() API instead of
creating new one thrift method.
I wish to have something like this:

//here we run query to external search engine
List<byte[]> keys = performSphinxQuery(someFullTextSearchQuery);
IndexClause indexClause = new IndexClause();

//required API to set list of keys
indexClause.setKeys(keys);
indexClause.setExpressions(someFilteringExpressions);
List finalResult = get_indexed_slices(colParent, indexClause, colPredicate,
cLevel);



I can't solve my issue with single get_indexed_slice().
Here is issue in more details:
1) have ~ 6 millions records, in feature could be much more
2) have  > 10k different properties (stored as column values in Cassandra),
in feature could be much more
3) properties are text descriptions , int/float values, string values
4) need to implement search over all properties. For text descriptions: full
text search. for int/float properties: range search.
5) Search query could use any combination of property descriptions. Like
full text search description and some range expression for int/float field.
6) have external search engine (Sphinx) that indexed all string and text
properties
7) still need to perform range search for int, float fields.

So now I split my query expressions in 2 groups:
1) expressions that can be handled by search engine
2) others (additional filters)

For example I run first query to Sphinx and got list of rowKeys, with length
of 100k.  (mark as RESULT1)
Now I need to filter it by second group of expressions. For example I have
simple expression: "age > 25".
So imagine I would run get_indexed_slice() with this query and could
possibly get half of my records in result. (mark as RESULT2)
Then I would need to get intersection between RESULT1 and RESULT2 on client
side, which could take a lot of time and memory.
That is why I can't use single get_indexed_slice here.

For me is better to iterate RESULT1 (with 100k records) at client side to
filter by age and got 10-50k record as final result. Disadvantage here is
that I have to fetch all 100k records.

Evgeny.

Re: Index search in provided list of rows (list of rowKeys).

Posted by aaron morton <aa...@thelastpickle.com>.
Just checking, you want an API call like this ? 


multiget_filtered_slice(keys, column_parent, predicate, filter_clause, consistency_level)

Where filter_clause is an IndexClause. 

It's a bit messy.

is there no way to express this as a single get_indexed_slice() call? With a == index expression to get the row keys and the other expressions todo the range filtering ? 

Cheers

-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 13/09/2011, at 1:55 AM, Evgeniy Ryabitskiy wrote:

> Hi,
> 
> We have an issue to search over Cassandra and we are using Sphinx for indexing.
> Because of Sphinx architecture we can't use range queries over all fields that we need to.
> So we have to run Sphinx Query first to get List of rowKeys and perform additional range filtering over column values.
> 
> First simple solution is to do it on Client side. That will increase network traffic and memory usage on client.
> 
> Now I'm wondering if it possible to perform such filtering on Cassandra side.
> I wish to use some IndexExpression for range filtering in list of records (list of rowKeys returned from external Indexing Search Engine).
> 
> Looking at get_indexed_slices I found out that in IndexClause is no possibility set List of rowKeys (like for multiget_slice), only start_key.
> 
> So 2 questions:
> 
> 1) Am I missing something and my idea is possible via some another API?
> 2) If not possible, can I add JIRA for this feature? 
> 
> Evgeny.
> 
> 
> 
> 
>