You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Andrew Gilmartin <an...@andrewgilmartin.com> on 2013/01/25 17:15:41 UTC

Filtering top hits based on stored field? And Lucene 1.x -> 3.x for Dummies

I have been using Lucene since 1.x days, but that also means I am 
carrying around some information that is no longer relevant and using 
techniques that are antiquated. I am currently using 3.0 but I am sure I 
am using it in 1.0 fashion. I have two questions -- one general and one 
specific.

The specific question is how, in Lucene 3.x, can I filter the 
IndexSeacher.search() results based on stored fields within candidate 
hits? It is not acceptable to perform the filter post search as now my 
hits list is too short. In the past calling doc() during a search (with 
my own collector) resulted in a severe performance hit. Is that still 
the case? If not, great I will just do that. If it still is, how would 
you suggest I implement the filtering?

The general question is how can I best come up to speed with Lucene 3.6 
and/or 4.0? Should I just consider my existing knowledge redundant and 
learn Lucene and Solr anew? Or are there documents that can better 
direct my re-education?

-- Andrew

-- 
Andrew Gilmartin
andrew@andrewgilmartin.com
401-441-2062


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Filtering top hits based on stored field? And Lucene 1.x -> 3.x for Dummies

Posted by Ian Lea <ia...@gmail.com>.
To the best of my understanding you are spot on about the degradation.
 Loading fields is costly, loading for thousands of docs is liable to
be very costly.  You can mitigate it by only loading the fields you
want with (in 4.x) reader.doc(id, fields) but it will still be costly.


--
Ian.


On Fri, Jan 25, 2013 at 9:20 PM, Andrew Gilmartin
<an...@andrewgilmartin.com> wrote:
> Ian Lea wrote:
>
> Thank you for the quick and helpful reply. I had forgotten that Lucene's
> change document was one of best example of change documents around. I will
> read it.
>
>> On the specific question, calling doc() is still expensive.  You could
>> look at the FieldCache or the new DocValues stuff. See
>>
>> http://www.searchworkings.org/blog/-/blogs/introducing-lucene-index-doc-values
>> for info on the latter.
>
>
> I will explore that.
>
> I occurred to me that I do not know why the search performance degrades when
> doc() is called within the Collector. Is it simply that Lucene will present,
> for example, thousands of candidate hits (from millions of indexed
> documents) to the Collector even though the collector might only return the
> top handful? And so the Collector will need to load thousands of documents
> and it is this document loading that causes the performance degradation? Or
> is it more complex -- perhaps having to do with caches and other internals?
>
> -- Andrew
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Filtering top hits based on stored field? And Lucene 1.x -> 3.x for Dummies

Posted by Andrew Gilmartin <an...@andrewgilmartin.com>.
Ian Lea wrote:

Thank you for the quick and helpful reply. I had forgotten that Lucene's change document was one of best example of change documents around. I will read it.

> On the specific question, calling doc() is still expensive.  You could
> look at the FieldCache or the new DocValues stuff. See
> http://www.searchworkings.org/blog/-/blogs/introducing-lucene-index-doc-values
> for info on the latter.

I will explore that.

I occurred to me that I do not know why the search performance degrades when doc() is called within the Collector. Is it simply that Lucene will present, for example, thousands of candidate hits (from millions of indexed documents) to the Collector even though the collector might only return the top handful? And so the Collector will need to load thousands of documents and it is this document loading that causes the performance degradation? Or is it more complex -- perhaps having to do with caches and other internals?

-- Andrew




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Filtering top hits based on stored field? And Lucene 1.x -> 3.x for Dummies

Posted by Ian Lea <ia...@gmail.com>.
On the specific question, calling doc() is still expensive.  You could
look at the FieldCache or the new DocValues stuff. See
http://www.searchworkings.org/blog/-/blogs/introducing-lucene-index-doc-values
for info on the latter.

On the general question, much of your lucene knowledge will still be
relevant.  There'll be some new stuff you've never heard of, and loads
of changes behind the scenes you may be able to ignore, but the basic
concepts and techniques haven't changed that much.  I suggest reading
the Changes doc from 4.1.


--
Ian.


On Fri, Jan 25, 2013 at 4:15 PM, Andrew Gilmartin
<an...@andrewgilmartin.com> wrote:
> I have been using Lucene since 1.x days, but that also means I am carrying
> around some information that is no longer relevant and using techniques that
> are antiquated. I am currently using 3.0 but I am sure I am using it in 1.0
> fashion. I have two questions -- one general and one specific.
>
> The specific question is how, in Lucene 3.x, can I filter the
> IndexSeacher.search() results based on stored fields within candidate hits?
> It is not acceptable to perform the filter post search as now my hits list
> is too short. In the past calling doc() during a search (with my own
> collector) resulted in a severe performance hit. Is that still the case? If
> not, great I will just do that. If it still is, how would you suggest I
> implement the filtering?
>
> The general question is how can I best come up to speed with Lucene 3.6
> and/or 4.0? Should I just consider my existing knowledge redundant and learn
> Lucene and Solr anew? Or are there documents that can better direct my
> re-education?
>
> -- Andrew
>
> --
> Andrew Gilmartin
> andrew@andrewgilmartin.com
> 401-441-2062
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org