You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Oleg Dulin <ol...@gmail.com> on 2012/09/03 15:25:25 UTC

Text searches and free form queries

Dear Distinguished Colleagues:

I need to add full-text search and somewhat free form queries to my 
application. Our data is made up of "items" that are stored in a single 
column family, and we have a bunch of secondary indices for look ups. 
An item has header fields and data fields, and the structure of the 
items CF is a super column family with row-key being item's natural ID, 
super column for header, super column for data.

Our application is made up of a several redundant/load balanced servers 
all pointing at a Cassandra cluster. Our servers run embedded Jetty.

I need to be able to find items by a combination of field values. 
Currently I have an index for items by field value which works 
reasonably well. I could also add support for data types and index 
items by fields of appropriate types, so we can do range queries on 
items.

Ultimately, though, what we want is full text search with suggestions 
and human language sensitivity. We want to search by date ranges, by 
field values, etc. I did some homework on this topic, and here is what 
I see as options:

1) Use an SQL database as a helper. This is rather clunky, not sure 
what it gets us since just about anything that can be done in SQL can 
be done in Cassandra with proper structures. Then the problem here also 
is where am I going to get an open source database that can handle the 
workload ? Probably nowhere, nor do I get natural language support.
2) Each of our servers can index data using Lucene, but again we have 
to come up with a clunky mechanism where either one of the servers does 
the indexing and results are replicated, or each server does its own 
indexing.
3) We can use Solr as is, perhaps with some small modifications it can 
run within our server JVM -- since we already run embedded Jetty. I 
like this idea, actually, but I know that Solr indexing doesn't take 
advantage of Cassandra.
4) Datastax Enterprise with search, presumably, supports Solr indexing 
of existing column families -- but for the life of me I couldn't figure 
out how exactly it does that. The Wikipedia example shows that Solr can 
create column families based on Solr schemas that I can then query 
using Cassandra itself (which is great) and supposedly I can modify 
those column families directly and Solr will reindex them (which is 
even better), but I am not sure how that fits into our server design. 
The other concern is locking in to a commercial product, something I am 
very much worried about.

So, one possibility I can see is using Solr embedded within our own 
server solution but storing its indexes in the file system outside of 
Cassandra. This is not optimal, and maybe over time i can add my own 
support for storing Solr index in Cassandra w/o relying on the Datastax 
solution.

In any case, what are your thoughts and experiences ?


Regards,
Oleg

Re: Text searches and free form queries

Posted by Oleg Dulin <ol...@gmail.com>.

>> 
>> It works pretty fast.
> Cool.
> Just keep an eye out for how big the lucene token row gets.
> Cheers
> 
> 

Indeed, it may get out of hand, but for now we are ok -- for the 
foreseable future I would say.

Should it get larger, I can split it up into rows -- i.e. all tokens 
that start with "a", all tokens that start with "b", etc.

Re: Text searches and free form queries

Posted by aaron morton <aa...@thelastpickle.com>.

>  It works pretty fast.
Cool. 

Just keep an eye out for how big the lucene token row gets. 

Cheers

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 7/10/2012, at 2:57 AM, Oleg Dulin <ol...@gmail.com> wrote:

> So, what I ended up doing is this --
> 
> As I write my records into the main CF, I tokenize some fields that I want to search on using Lucene and write an index into a separate CF, such that my columns are a composite of:
> 
> luceneToken:record key
> 
> I can then search my records by doing a slice for each lucene token in the search query and then do an intersection of the sets. It works pretty fast.
> 
> Regards,
> Oleg
> 
> On 2012-09-05 01:28:44 +0000, aaron morton said:
> 
> AFAIk if you want to keep it inside cassandra then DSE, roll your own from scratch or start with https://github.com/tjake/Solandra . 
> 
> Outside of Cassandra I've heard of people using Elastic Search or Solr which I *think* is now faster at updating the index. 
> 
> Hope that helps. 
> 
>  
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 4/09/2012, at 3:00 AM, Andrey V. Panov <pa...@gmail.com> wrote:
> Some one did search on Lucene, but for very fresh data they build search index in memory so data become available for search without delays.
> 
> On 3 September 2012 22:25, Oleg Dulin <ol...@gmail.com> wrote:
> Dear Distinguished Colleagues:
> 
> 
> -- 
> Regards,
> Oleg Dulin
> NYC Java Big Data Engineer
> http://www.olegdulin.com/

Re: Text searches and free form queries

Posted by Oleg Dulin <ol...@gmail.com>.

So, what I ended up doing is this --

As I write my records into the main CF, I tokenize some fields that I 
want to search on using Lucene and write an index into a separate CF, 
such that my columns are a composite of:

luceneToken:record key

I can then search my records by doing a slice for each lucene token in 
the search query and then do an intersection of the sets. It works 
pretty fast.

Regards,
Oleg

On 2012-09-05 01:28:44 +0000, aaron morton said:

> AFAIk if you want to keep it inside cassandra then DSE, roll your own 
> from scratch or start with https://github.com/tjake/Solandra . 
> 
> Outside of Cassandra I've heard of people using Elastic Search or Solr 
> which I *think* is now faster at updating the index. 
> 
> Hope that helps. 
> 
>  
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 4/09/2012, at 3:00 AM, Andrey V. Panov <pa...@gmail.com> wrote:
> Some one did search on Lucene, but for very fresh data they build 
> search index in memory so data become available for search without 
> delays.
> 
> On 3 September 2012 22:25, Oleg Dulin <ol...@gmail.com> wrote:
> Dear Distinguished Colleagues:


-- 
Regards,
Oleg Dulin
NYC Java Big Data Engineer
http://www.olegdulin.com/

Re: Text searches and free form queries

Posted by aaron morton <aa...@thelastpickle.com>.

AFAIk if you want to keep it inside cassandra then DSE, roll your own from scratch or start with https://github.com/tjake/Solandra . 

Outside of Cassandra I've heard of people using Elastic Search or Solr which I *think* is now faster at updating the index. 

Hope that helps. 

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 4/09/2012, at 3:00 AM, Andrey V. Panov <pa...@gmail.com> wrote:

> Some one did search on Lucene, but for very fresh data they build search index in memory so data become available for search without delays.
> 
> On 3 September 2012 22:25, Oleg Dulin <ol...@gmail.com> wrote:
> Dear Distinguished Colleagues:
>

Re: Text searches and free form queries

Posted by "Andrey V. Panov" <pa...@gmail.com>.

Some one did search on Lucene, but for very fresh data they build search
index in memory so data become available for search without delays.

On 3 September 2012 22:25, Oleg Dulin <ol...@gmail.com> wrote:

> Dear Distinguished Colleagues:
>
>