You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@lucenenet.apache.org by Trevor Watson <tw...@datassimilate.com> on 2010/02/09 15:24:25 UTC

Combined MySQL and Lucene Query

Our project requires near-real time searches and constant updating.  The 
data is currently stored in a MySQL database and the Lucene index is 
updated as the database is modified

We have the search capability currently where we want it.  However, we 
are attempting to add the ability to "tag" documents in the 
index/database.  Since the data pots can be millions of records, we 
don't want to update the lucene index for tagging, but instead have a 
table of document IDs that we would like to be using to determine the 
tag sets.

The best option I've so far found is to retrieve both list of IDs as an 
integer array, sort them (so I only need one loop through), then loop 
through and look for matches between the two.  Attempting to use the 
list of Lucene IDs in an "In" query fails because the number of 
documents can be in the millions and MySQL chokes on it.

Any insight into how we could optimize this or do it?

Thanks in advance

Trevor Watson

Re: Combined MySQL and Lucene Query

Posted by "Andrew C. Smith" <po...@gmail.com>.

I think the way I would tackle this, since you don't want to update your
main index. Is to still leave MySql out of it and to do the following:

1) Create a separate Lucene index that only contains the tags information
and an ID
2) Take advantage of the MultiSearcher or ParallelMultiSearcher

This should still allow you to get some pretty good performance without
having to do expensive database queries.

Thanks,
Andrew Smith

On Tue, Feb 9, 2010 at 9:24 AM, Trevor Watson <tw...@datassimilate.com>wrote:

> Our project requires near-real time searches and constant updating.  The
> data is currently stored in a MySQL database and the Lucene index is updated
> as the database is modified
>
> We have the search capability currently where we want it.  However, we are
> attempting to add the ability to "tag" documents in the index/database.
>  Since the data pots can be millions of records, we don't want to update the
> lucene index for tagging, but instead have a table of document IDs that we
> would like to be using to determine the tag sets.
>
> The best option I've so far found is to retrieve both list of IDs as an
> integer array, sort them (so I only need one loop through), then loop
> through and look for matches between the two.  Attempting to use the list of
> Lucene IDs in an "In" query fails because the number of documents can be in
> the millions and MySQL chokes on it.
>
> Any insight into how we could optimize this or do it?
>
> Thanks in advance
>
> Trevor Watson
>