You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Beady Geraghty <be...@gmail.com> on 2006/01/06 18:04:53 UTC

need some advice/help with negative query.

I would like to do queries that are negative. I mean a query with
only negative terms and phrases.  For example, retrieve all
documents that do not contain the term "apple".

For now, I have a limited set of documents (say, 10000) to index.
I can create a bitset that represents the search result of hits on "apple".
Then I complement (XOR) the result.
Each bit corresponds to a document ID.
My question is :
Inside Lucene, are the hits represented in some form of a bitset.
Can I get at it directly.   I saw the BitSet class.  (I now use
Java's Bitset class).
Assuming that hits are internally represented as bitset, for a
small number of documets, the bitset won't be very big,
and if there are plenty of hits and many many more documents,
is the bitset still  kept entirely
in memory as well ?

Thank you

Re: need some advice/help with negative query.

Posted by Paul Elschot <pa...@xs4all.nl>.
On Friday 06 January 2006 20:57, Yonik Seeley wrote:
> Should we should detect the case of all negative clauses and throw in
> a MatchAllDocsQuery?
> 
> I guess this would be done in the QueryParser, but one could also make
> a case for doing it in the BooleanQuery.

Overriding getBooleanQuery() from QueryParser allows to do that easily.
Allowing negative queries by default will probably give performance
problems.

Automatically adding the MatchAllDocsQuery in a BooleanQuery
is feasible, and then the case of only negative clauses should be
caught in the default QueryParser to prevent performance problems.
But given the age of BooleanQuery such a change is probably not
worthwhile.

Regards,
Paul Elschot


> 
> -Yonik
> 
> On 1/6/06, Erik Hatcher <er...@ehatchersolutions.com> wrote:
> > With Lucene's trunk, there is a MatchAllDocsQuery.   You could use
> > this in a BooleanQuery with your negative-only query.
> >
> > Another option, if you're at Lucene 1.4.3 is to index the same value
> > for a dummy field for every document (say like "dummy:all") and use a
> > TermQuery in a BooleanQuery with the negative-only query.
> >
> > As for BitSet's, if you need to go that route a QueryFilter would
> > give you the BitSet back that you could easily complement, but that
> > might be a bit overkill for what you need given the option above.
> >
> >         Erik
> >
> >
> > On Jan 6, 2006, at 12:04 PM, Beady Geraghty wrote:
> >
> > > I would like to do queries that are negative. I mean a query with
> > > only negative terms and phrases.  For example, retrieve all
> > > documents that do not contain the term "apple".
> > >
> > > For now, I have a limited set of documents (say, 10000) to index.
> > > I can create a bitset that represents the search result of hits on
> > > "apple".
> > > Then I complement (XOR) the result.
> > > Each bit corresponds to a document ID.
> > > My question is :
> > > Inside Lucene, are the hits represented in some form of a bitset.
> > > Can I get at it directly.   I saw the BitSet class.  (I now use
> > > Java's Bitset class).
> > > Assuming that hits are internally represented as bitset, for a
> > > small number of documets, the bitset won't be very big,
> > > and if there are plenty of hits and many many more documents,
> > > is the bitset still  kept entirely
> > > in memory as well ?
> > >
> > > Thank you
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: need some advice/help with negative query.

Posted by Beady Geraghty <be...@gmail.com>.
I thought along the line of dummy:all as well, but for some reason,
I chose bitset.  I am not sure if it matters which route to go.

My situation is that for now, I have say 10000 documents,
but, maybe I incease that by 100 time or more.  I would guess that
half of the documents qualifies, and half don't.

It appears that everyone  suggests that I take  MatchAllDocsQuery
from the trunk. Is this the choice regardless of the number of
documents I have ?

Thanks,






On 1/6/06, Beady Geraghty <be...@gmail.com> wrote:
>
> Thank you all for your answer.
>
> On 1/6/06, Yonik Seeley <ys...@gmail.com> wrote:
> >
> > Should we should detect the case of all negative clauses and throw in
> > a MatchAllDocsQuery?
> >
> > I guess this would be done in the QueryParser, but one could also make
> > a case for doing it in the BooleanQuery.
> >
> > -Yonik
> >
> > On 1/6/06, Erik Hatcher <er...@ehatchersolutions.com> wrote:
> > > With Lucene's trunk, there is a MatchAllDocsQuery.   You could use
> > > this in a BooleanQuery with your negative-only query.
> > >
> > > Another option, if you're at Lucene 1.4.3 is to index the same value
> > > for a dummy field for every document (say like "dummy:all") and use a
> > > TermQuery in a BooleanQuery with the negative-only query.
> > >
> > > As for BitSet's, if you need to go that route a QueryFilter would
> > > give you the BitSet back that you could easily complement, but that
> > > might be a bit overkill for what you need given the option above.
> > >
> > >         Erik
> > >
> > >
> > > On Jan 6, 2006, at 12:04 PM, Beady Geraghty wrote:
> > >
> > > > I would like to do queries that are negative. I mean a query with
> > > > only negative terms and phrases.  For example, retrieve all
> > > > documents that do not contain the term "apple".
> > > >
> > > > For now, I have a limited set of documents (say, 10000) to index.
> > > > I can create a bitset that represents the search result of hits on
> > > > "apple".
> > > > Then I complement (XOR) the result.
> > > > Each bit corresponds to a document ID.
> > > > My question is :
> > > > Inside Lucene, are the hits represented in some form of a bitset.
> > > > Can I get at it directly.   I saw the BitSet class.  (I now use
> > > > Java's Bitset class).
> > > > Assuming that hits are internally represented as bitset, for a
> > > > small number of documets, the bitset won't be very big,
> > > > and if there are plenty of hits and many many more documents,
> > > > is the bitset still  kept entirely
> > > > in memory as well ?
> > > >
> > > > Thank you
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>

Re: need some advice/help with negative query.

Posted by Beady Geraghty <be...@gmail.com>.
Thank you all for your answer.

On 1/6/06, Yonik Seeley <ys...@gmail.com> wrote:
>
> Should we should detect the case of all negative clauses and throw in
> a MatchAllDocsQuery?
>
> I guess this would be done in the QueryParser, but one could also make
> a case for doing it in the BooleanQuery.
>
> -Yonik
>
> On 1/6/06, Erik Hatcher <er...@ehatchersolutions.com> wrote:
> > With Lucene's trunk, there is a MatchAllDocsQuery.   You could use
> > this in a BooleanQuery with your negative-only query.
> >
> > Another option, if you're at Lucene 1.4.3 is to index the same value
> > for a dummy field for every document (say like "dummy:all") and use a
> > TermQuery in a BooleanQuery with the negative-only query.
> >
> > As for BitSet's, if you need to go that route a QueryFilter would
> > give you the BitSet back that you could easily complement, but that
> > might be a bit overkill for what you need given the option above.
> >
> >         Erik
> >
> >
> > On Jan 6, 2006, at 12:04 PM, Beady Geraghty wrote:
> >
> > > I would like to do queries that are negative. I mean a query with
> > > only negative terms and phrases.  For example, retrieve all
> > > documents that do not contain the term "apple".
> > >
> > > For now, I have a limited set of documents (say, 10000) to index.
> > > I can create a bitset that represents the search result of hits on
> > > "apple".
> > > Then I complement (XOR) the result.
> > > Each bit corresponds to a document ID.
> > > My question is :
> > > Inside Lucene, are the hits represented in some form of a bitset.
> > > Can I get at it directly.   I saw the BitSet class.  (I now use
> > > Java's Bitset class).
> > > Assuming that hits are internally represented as bitset, for a
> > > small number of documets, the bitset won't be very big,
> > > and if there are plenty of hits and many many more documents,
> > > is the bitset still  kept entirely
> > > in memory as well ?
> > >
> > > Thank you
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: need some advice/help with negative query.

Posted by Yonik Seeley <ys...@gmail.com>.
+1 from me.
-Yonik

On 1/7/06, Erik Hatcher <er...@ehatchersolutions.com> wrote:
> +1 to Hoss's suggested enhancement to QueryParser.
>
> I'll volunteer to implement this barring any objections in the next
> day or so.
>
>         Erik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: need some advice/help with negative query.

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
+1 to Hoss's suggested enhancement to QueryParser.

I'll volunteer to implement this barring any objections in the next  
day or so.

	Erik


On Jan 6, 2006, at 6:01 PM, Chris Hostetter wrote:

>
> : > Should we should detect the case of all negative clauses and  
> throw in
> : > a MatchAllDocsQuery?
> : >
> : > I guess this would be done in the QueryParser, but one could  
> also make
> : > a case for doing it in the BooleanQuery.
>
> if it were going to be done, i would add it to the QueryParser, and  
> not to
> BooleanQuery ... BooleanQuery serves as a good "container" for other
> queries along with their required/optional/prohibited status that  
> can be
> passed arround -- if someone is building up a BooleanQuery and
> incrimentally adding clauses to it, the only place it could safely  
> assume
> it should add a MatchAllDocs query if all of it's clauses are  
> negated is
> in the createWeight method -- and that would be a very weird place  
> to do
> for the query to modify itself.
>
> Putting logic like this in the QueryParser would seem like it makes  
> the
> most sense to me, as long as their was an option to toggle the  
> behavior
> on/off for people like me who use QueryParser to generate  
> BooleanQueries
> based on one set of input, and then add required/optional clauses  
> to it
> based on other sets of input.
>
> I'd even support defaulting the behavior to "on" just because it seems
> like it would be more DWIM for casual users (who might not notice an
> option for turning it on as easily as experienced users would find an
> option for turning it off)
>
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: need some advice/help with negative query.

Posted by Chris Hostetter <ho...@fucit.org>.
: > Should we should detect the case of all negative clauses and throw in
: > a MatchAllDocsQuery?
: >
: > I guess this would be done in the QueryParser, but one could also make
: > a case for doing it in the BooleanQuery.

if it were going to be done, i would add it to the QueryParser, and not to
BooleanQuery ... BooleanQuery serves as a good "container" for other
queries along with their required/optional/prohibited status that can be
passed arround -- if someone is building up a BooleanQuery and
incrimentally adding clauses to it, the only place it could safely assume
it should add a MatchAllDocs query if all of it's clauses are negated is
in the createWeight method -- and that would be a very weird place to do
for the query to modify itself.

Putting logic like this in the QueryParser would seem like it makes the
most sense to me, as long as their was an option to toggle the behavior
on/off for people like me who use QueryParser to generate BooleanQueries
based on one set of input, and then add required/optional clauses to it
based on other sets of input.

I'd even support defaulting the behavior to "on" just because it seems
like it would be more DWIM for casual users (who might not notice an
option for turning it on as easily as experienced users would find an
option for turning it off)



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: need some advice/help with negative query.

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Jan 6, 2006, at 2:57 PM, Yonik Seeley wrote:
> Should we should detect the case of all negative clauses and throw in
> a MatchAllDocsQuery?
>
> I guess this would be done in the QueryParser, but one could also make
> a case for doing it in the BooleanQuery.

In a  custom (non-generalizable) query parser that I've built, I  
detect a negative only query and then nest it within a BooleanQuery  
with a MatchAllDocsQuery.

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: need some advice/help with negative query.

Posted by Yonik Seeley <ys...@gmail.com>.
Should we should detect the case of all negative clauses and throw in
a MatchAllDocsQuery?

I guess this would be done in the QueryParser, but one could also make
a case for doing it in the BooleanQuery.

-Yonik

On 1/6/06, Erik Hatcher <er...@ehatchersolutions.com> wrote:
> With Lucene's trunk, there is a MatchAllDocsQuery.   You could use
> this in a BooleanQuery with your negative-only query.
>
> Another option, if you're at Lucene 1.4.3 is to index the same value
> for a dummy field for every document (say like "dummy:all") and use a
> TermQuery in a BooleanQuery with the negative-only query.
>
> As for BitSet's, if you need to go that route a QueryFilter would
> give you the BitSet back that you could easily complement, but that
> might be a bit overkill for what you need given the option above.
>
>         Erik
>
>
> On Jan 6, 2006, at 12:04 PM, Beady Geraghty wrote:
>
> > I would like to do queries that are negative. I mean a query with
> > only negative terms and phrases.  For example, retrieve all
> > documents that do not contain the term "apple".
> >
> > For now, I have a limited set of documents (say, 10000) to index.
> > I can create a bitset that represents the search result of hits on
> > "apple".
> > Then I complement (XOR) the result.
> > Each bit corresponds to a document ID.
> > My question is :
> > Inside Lucene, are the hits represented in some form of a bitset.
> > Can I get at it directly.   I saw the BitSet class.  (I now use
> > Java's Bitset class).
> > Assuming that hits are internally represented as bitset, for a
> > small number of documets, the bitset won't be very big,
> > and if there are plenty of hits and many many more documents,
> > is the bitset still  kept entirely
> > in memory as well ?
> >
> > Thank you
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: need some advice/help with negative query.

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
With Lucene's trunk, there is a MatchAllDocsQuery.   You could use  
this in a BooleanQuery with your negative-only query.

Another option, if you're at Lucene 1.4.3 is to index the same value  
for a dummy field for every document (say like "dummy:all") and use a  
TermQuery in a BooleanQuery with the negative-only query.

As for BitSet's, if you need to go that route a QueryFilter would  
give you the BitSet back that you could easily complement, but that  
might be a bit overkill for what you need given the option above.

	Erik


On Jan 6, 2006, at 12:04 PM, Beady Geraghty wrote:

> I would like to do queries that are negative. I mean a query with
> only negative terms and phrases.  For example, retrieve all
> documents that do not contain the term "apple".
>
> For now, I have a limited set of documents (say, 10000) to index.
> I can create a bitset that represents the search result of hits on  
> "apple".
> Then I complement (XOR) the result.
> Each bit corresponds to a document ID.
> My question is :
> Inside Lucene, are the hits represented in some form of a bitset.
> Can I get at it directly.   I saw the BitSet class.  (I now use
> Java's Bitset class).
> Assuming that hits are internally represented as bitset, for a
> small number of documets, the bitset won't be very big,
> and if there are plenty of hits and many many more documents,
> is the bitset still  kept entirely
> in memory as well ?
>
> Thank you


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: need some advice/help with negative query.

Posted by Daniel Naber <lu...@danielnaber.de>.
On Freitag 06 Januar 2006 18:04, Beady Geraghty wrote:

> For now, I have a limited set of documents (say, 10000) to index.
> I can create a bitset that represents the search result of hits on
> "apple".

The development version of Lucene contains a MatchAllDocsQuery so you can 
create queries (programmatically) like:

MatchAllDocsQuery -apple

I guess this is is memory-efficient. MatchAllDocsQuery can be backported 
easily.

Regards
 Daniel

-- 
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: need some advice/help with negative query.

Posted by Beady Geraghty <be...@gmail.com>.
Thanks.

On 1/6/06, Paul Elschot <pa...@xs4all.nl> wrote:
>
> On Friday 06 January 2006 18:04, Beady Geraghty wrote:
> > I would like to do queries that are negative. I mean a query with
> > only negative terms and phrases.  For example, retrieve all
> > documents that do not contain the term "apple".
> >
> > For now, I have a limited set of documents (say, 10000) to index.
> > I can create a bitset that represents the search result of hits on
> "apple".
> > Then I complement (XOR) the result.
> > Each bit corresponds to a document ID.
> > My question is :
> > Inside Lucene, are the hits represented in some form of a bitset.
> > Can I get at it directly.   I saw the BitSet class.  (I now use
> > Java's Bitset class).
> > Assuming that hits are internally represented as bitset, for a
> > small number of documets, the bitset won't be very big,
> > and if there are plenty of hits and many many more documents,
> > is the bitset still  kept entirely
> > in memory as well ?
>
> A Hits is implemented by caching some of the highest scoring
> documents, when more documents are needed the search is
> repeated to collect more documents.
>
> The problem with negative queries is that the scores of the results
> do not vary, so it is not useful to keep only the highest scoring docs.
> This also means that all results will have to be processed further
> in some other way.
> The easiest way to do that is to use the MatchAllDocsQuery
> as indicated earlier, and then use the low level search API
> with your own HitCollector.
> You can then use any data structure in your HitCollector.
> A simple and fast collect() implementation just counts the results, and
> that can already be quite informative. Setting up a BitSet
> for the matching document numbers is also possible.
> It's best to avoid accessing the index via the IndexReader inside
> the collect() implementation of the HitCollector.
>
> Regards,
> Paul Elschot
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: need some advice/help with negative query.

Posted by Paul Elschot <pa...@xs4all.nl>.
On Friday 06 January 2006 18:04, Beady Geraghty wrote:
> I would like to do queries that are negative. I mean a query with
> only negative terms and phrases.  For example, retrieve all
> documents that do not contain the term "apple".
> 
> For now, I have a limited set of documents (say, 10000) to index.
> I can create a bitset that represents the search result of hits on "apple".
> Then I complement (XOR) the result.
> Each bit corresponds to a document ID.
> My question is :
> Inside Lucene, are the hits represented in some form of a bitset.
> Can I get at it directly.   I saw the BitSet class.  (I now use
> Java's Bitset class).
> Assuming that hits are internally represented as bitset, for a
> small number of documets, the bitset won't be very big,
> and if there are plenty of hits and many many more documents,
> is the bitset still  kept entirely
> in memory as well ?

A Hits is implemented by caching some of the highest scoring
documents, when more documents are needed the search is
repeated to collect more documents.

The problem with negative queries is that the scores of the results
do not vary, so it is not useful to keep only the highest scoring docs.
This also means that all results will have to be processed further
in some other way.
The easiest way to do that is to use the MatchAllDocsQuery
as indicated earlier, and then use the low level search API
with your own HitCollector.
You can then use any data structure in your HitCollector.
A simple and fast collect() implementation just counts the results, and
that can already be quite informative. Setting up a BitSet 
for the matching document numbers is also possible.
It's best to avoid accessing the index via the IndexReader inside
the collect() implementation of the HitCollector.

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org