You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by MitchK <mi...@web.de> on 2010/03/06 14:21:22 UTC

Filter Query or Main Query or facetting?

Hello community,

I am not sure about what is the best way to handle the following problem:
I have got an index, let's say with 2mio documents, and there is a
check-field.
The check-field contains on boolean values (TRUE/FALSE).

What is the best way to query only documents with a TRUE check-value?
q, fq or a facetting-index?

When I have a look at fq I think that I am running out of memory, if my
index is growing too large.
The normal query (q) seems to be a bad solution, because it's not
constructed for this use-case.
What about facetting? I have no idea, whether facetting would be a good
solution.

If it makes a difference: Most of the queries will be run against true
check-values.

The only alternative I have in mind is building two indexes; one with
checked values and one with unchecked (false) values. 

Thank you for sharing experiences. 

Kind regards,
- Mitch
-- 
View this message in context: http://old.nabble.com/Filter-Query-or-Main-Query-or-facetting--tp27804169p27804169.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Filter Query or Main Query or facetting?

Posted by Erick Erickson <er...@gmail.com>.

the 250K is an approximation, (total number of docs)/8. As in
one bit per document. Really, all a filter is is a bit-vector where
each bit represents whether the doc ID represented by that bit
should be included in the results or not. Technically, it's the
(largest doc id)/8 where (largest doc id) may be bigger than
the number of docs if you've deleted/added documents and
haven't yet optimized. So, the first byte represents docs 1-8,
second byte 9-16, etc.

See the Lucene website. Here's a place to start as far as scoring
is concerned:
http://lucene.apache.org/java/3_0_1/scoring.html

And, of course, there's Lucene In Action (second edition is
available from Manning as an e-book at least. But I admit
making the connection from the qf parameter to the underlying
Lucene structure is part of the "tribal knowledge" series. At
least I can't point you to a document offhand.


Best
Erick

On Sat, Mar 6, 2010 at 4:36 PM, MitchK <mi...@web.de> wrote:

>
> Erick,
>
> your response was really helpfull - the problem is solved for the next
> time.
>
> However, there are two questions:
> Where do you know, that the bit-vector has a maximum size of 250k?
> Did I overlook something (because I have got an index of 2.000.000
> documents)?
>
> Are there any theoretical documents outside that explain how Solr's
> IndexSearcher works?
> I think this would be really helpfull for future questions.
>
> Kind regards
> - Mitch
>
>
> Erick Erickson wrote:
> >
> > The last thing I'd do is partition my index into two, unless and
> > until I really *knew* I had speed problems. The added complexity
> > isn't worth it and your index isn't huge, so search speed can
> > probably be addressed without that complexity.
> >
> > Filter queries are probably your first choice here. Memory isn't an
> > issue because they're implemented (as I understand) as a bit
> > vector. That is, each one (and you only have two) will be 250K
> > plus a slight overhead. Utterly insignificant.
> >
> > You can easily experiment with the differences in speed with a single
> > index between q and fq if you use a single index. You're right
> > that if you just tack on an AND to the q clause, the true/false
> > will contribute to the score, but I think they'll all contribute the
> > same amount, effectively doing nothing to the ranking. There is
> > something of an efficiency argument here, but maybe not
> > enough to notice.
> >
> > Faceting is generally used more for answering questions like
> > "given I've searched on query <Q> how many of my answers
> > are in groups A, B and C". Than drilling down to things like
> > "show me the ones in group C". Which, while related to your
> > problem isn't what it sounds like you're after.
> >
> > When measuring speed, remember that the first few queries
> > aren't representative.
> >
> > HTH
> > Erick
> > On Sat, Mar 6, 2010 at 12:32 PM, MitchK <mi...@web.de> wrote:
> >
> >>
> >> Yes, that's possible.
> >>
> >> However I thought, that the normal-q-param forces Solr to lookup every
> >> check-field whereas it is true or false.
> >> So I am looking for something like a tree that devides the index into
> two
> >> pieces - true and false.
> >> So Solr do not need to lookup the check-field anymore, because it
> follows
> >> the right node of the tree and according to this, the IndexSearcher
> would
> >> be
> >> more efficient - I emphasize, that I think so, I don't really know.
> >> Another point is, that I have read, that the q-param is scoring every
> >> field
> >> and I don't want that the scoring contains on the check-field in parts.
> >>
> >> Hopefully I have explained my problem correctly.
> >> If there are questions, please ask.
> >>
> >> - Mitch
> >>
> >> --
> >> View this message in context:
> >>
> http://old.nabble.com/Filter-Query-or-Main-Query-or-facetting--tp27804169p27805798.html
> >> Sent from the Solr - User mailing list archive at Nabble.com.
> >>
> >>
> >
> >
>
> --
> View this message in context:
> http://old.nabble.com/Filter-Query-or-Main-Query-or-facetting--tp27804169p27807323.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>

Re: Filter Query or Main Query or facetting?

Posted by MitchK <mi...@web.de>.

Erick,

your response was really helpfull - the problem is solved for the next time. 

However, there are two questions:
Where do you know, that the bit-vector has a maximum size of 250k?
Did I overlook something (because I have got an index of 2.000.000
documents)?

Are there any theoretical documents outside that explain how Solr's
IndexSearcher works?
I think this would be really helpfull for future questions.

Kind regards
- Mitch


Erick Erickson wrote:
> 
> The last thing I'd do is partition my index into two, unless and
> until I really *knew* I had speed problems. The added complexity
> isn't worth it and your index isn't huge, so search speed can
> probably be addressed without that complexity.
> 
> Filter queries are probably your first choice here. Memory isn't an
> issue because they're implemented (as I understand) as a bit
> vector. That is, each one (and you only have two) will be 250K
> plus a slight overhead. Utterly insignificant.
> 
> You can easily experiment with the differences in speed with a single
> index between q and fq if you use a single index. You're right
> that if you just tack on an AND to the q clause, the true/false
> will contribute to the score, but I think they'll all contribute the
> same amount, effectively doing nothing to the ranking. There is
> something of an efficiency argument here, but maybe not
> enough to notice.
> 
> Faceting is generally used more for answering questions like
> "given I've searched on query <Q> how many of my answers
> are in groups A, B and C". Than drilling down to things like
> "show me the ones in group C". Which, while related to your
> problem isn't what it sounds like you're after.
> 
> When measuring speed, remember that the first few queries
> aren't representative.
> 
> HTH
> Erick
> On Sat, Mar 6, 2010 at 12:32 PM, MitchK <mi...@web.de> wrote:
> 
>>
>> Yes, that's possible.
>>
>> However I thought, that the normal-q-param forces Solr to lookup every
>> check-field whereas it is true or false.
>> So I am looking for something like a tree that devides the index into two
>> pieces - true and false.
>> So Solr do not need to lookup the check-field anymore, because it follows
>> the right node of the tree and according to this, the IndexSearcher would
>> be
>> more efficient - I emphasize, that I think so, I don't really know.
>> Another point is, that I have read, that the q-param is scoring every
>> field
>> and I don't want that the scoring contains on the check-field in parts.
>>
>> Hopefully I have explained my problem correctly.
>> If there are questions, please ask.
>>
>> - Mitch
>>
>> --
>> View this message in context:
>> http://old.nabble.com/Filter-Query-or-Main-Query-or-facetting--tp27804169p27805798.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: http://old.nabble.com/Filter-Query-or-Main-Query-or-facetting--tp27804169p27807323.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Filter Query or Main Query or facetting?

Posted by Erick Erickson <er...@gmail.com>.

The last thing I'd do is partition my index into two, unless and
until I really *knew* I had speed problems. The added complexity
isn't worth it and your index isn't huge, so search speed can
probably be addressed without that complexity.

Filter queries are probably your first choice here. Memory isn't an
issue because they're implemented (as I understand) as a bit
vector. That is, each one (and you only have two) will be 250K
plus a slight overhead. Utterly insignificant.

You can easily experiment with the differences in speed with a single
index between q and fq if you use a single index. You're right
that if you just tack on an AND to the q clause, the true/false
will contribute to the score, but I think they'll all contribute the
same amount, effectively doing nothing to the ranking. There is
something of an efficiency argument here, but maybe not
enough to notice.

Faceting is generally used more for answering questions like
"given I've searched on query <Q> how many of my answers
are in groups A, B and C". Than drilling down to things like
"show me the ones in group C". Which, while related to your
problem isn't what it sounds like you're after.

When measuring speed, remember that the first few queries
aren't representative.

HTH
Erick
On Sat, Mar 6, 2010 at 12:32 PM, MitchK <mi...@web.de> wrote:

>
> Yes, that's possible.
>
> However I thought, that the normal-q-param forces Solr to lookup every
> check-field whereas it is true or false.
> So I am looking for something like a tree that devides the index into two
> pieces - true and false.
> So Solr do not need to lookup the check-field anymore, because it follows
> the right node of the tree and according to this, the IndexSearcher would
> be
> more efficient - I emphasize, that I think so, I don't really know.
> Another point is, that I have read, that the q-param is scoring every field
> and I don't want that the scoring contains on the check-field in parts.
>
> Hopefully I have explained my problem correctly.
> If there are questions, please ask.
>
> - Mitch
>
> --
> View this message in context:
> http://old.nabble.com/Filter-Query-or-Main-Query-or-facetting--tp27804169p27805798.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>

Re: Filter Query or Main Query or facetting?

Posted by MitchK <mi...@web.de>.

Yes, that's possible. 

However I thought, that the normal-q-param forces Solr to lookup every
check-field whereas it is true or false.
So I am looking for something like a tree that devides the index into two
pieces - true and false.
So Solr do not need to lookup the check-field anymore, because it follows
the right node of the tree and according to this, the IndexSearcher would be
more efficient - I emphasize, that I think so, I don't really know.
Another point is, that I have read, that the q-param is scoring every field
and I don't want that the scoring contains on the check-field in parts.

Hopefully I have explained my problem correctly.
If there are questions, please ask.

- Mitch

-- 
View this message in context: http://old.nabble.com/Filter-Query-or-Main-Query-or-facetting--tp27804169p27805798.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Filter Query or Main Query or facetting?

Posted by Erick Erickson <er...@gmail.com>.

Hmmmm, Why isn't q helpful? You can specify field:value pairs
for a q clause. so you can pretty easily tack on an AND check:true.

I'd try that and measure performance before trying more complex
solutions....

Or do I misunderstand the problem?

HTH
Erick

On Sat, Mar 6, 2010 at 8:21 AM, MitchK <mi...@web.de> wrote:

>
> Hello community,
>
> I am not sure about what is the best way to handle the following problem:
> I have got an index, let's say with 2mio documents, and there is a
> check-field.
> The check-field contains on boolean values (TRUE/FALSE).
>
> What is the best way to query only documents with a TRUE check-value?
> q, fq or a facetting-index?
>
> When I have a look at fq I think that I am running out of memory, if my
> index is growing too large.
> The normal query (q) seems to be a bad solution, because it's not
> constructed for this use-case.
> What about facetting? I have no idea, whether facetting would be a good
> solution.
>
> If it makes a difference: Most of the queries will be run against true
> check-values.
>
> The only alternative I have in mind is building two indexes; one with
> checked values and one with unchecked (false) values.
>
> Thank you for sharing experiences.
>
> Kind regards,
> - Mitch
> --
> View this message in context:
> http://old.nabble.com/Filter-Query-or-Main-Query-or-facetting--tp27804169p27804169.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>