You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucy.apache.org by Marvin Humphrey <ma...@rectangular.com> on 2010/12/15 01:04:50 UTC

[lucy-dev] Re: Stopwords and AND queries

cc to lucy-dev...

On Mon, Dec 13, 2010 at 01:17:52PM -0800, Marvin Humphrey wrote:
> On Mon, Dec 13, 2010 at 06:45:08PM +0100, Nick Wellnhofer wrote:
> > 
> > If I run the attached script, I don't get any results. It seems that 
> > stopwords aren't removed from the AND query.
> 
> Thanks for the report.  I've distilled the test down to the sample below, which
> produces an ANDQuery where one clause is a NoMatchQuery.
> 
> The flaw resides somewhere in the interplay of QParser_Expand() and
> QParser_Expand_Leaf().  The stopwords *are* being removed, so that expanding
> the "foo" LeafQuery produces an ANDQuery with no clauses.  That's being
> translated to a NoMatchQuery here, within QParser_Expand_Leaf():
> 
>     if (VA_Get_Size(queries) == 0) {
>         retval = (Query*)NoMatchQuery_new();
>     }
> 
> Somehow, we have to get Expand_Leaf() to differentiate between a clause that
> can't match and a clause should be removed, and to communicate that
> information back to Expand() successfully.  Probably we want to use NULL to
> symbolize a clause that should be removed, but that causes test failures.
> 
> More as I keep digging...

We definitely need Expand() and Expand_Leaf() to distinguish between these two
types of results:

    * A query which will fail to match.
    * A query which neither matches nor fails to match.

Use Case 1: say that we have a QueryParser with no default fields, and with
"heed_colons" enabled.  This query should fail to match...

    field1:foo AND bar

... because the required 'bar' term will have no fields.

Use Case 2: say that we have a QueryParser with a single "content" field,
which has an Analyzer which includes an English stoplist.  At present, the
following query string will fail to match, because of the bug that Nick has
uncovered:

    the AND smiths

The problem is that the term 'the' is a required term thanks to the AND
operator, but since no fields have an Analyzer that produces tokens when fed
the string 'the', the current implementation of Expand_Leaf() returns a
NoMatchQuery.  We need to change that logic so that the return value from
Expand_Leaf() communicates to Expand() "this clause neither matches nor fails
to match."

FWIW, Nick's bug in Use Case 2 arose because of a change which fixed Use 
Case 1. :|

    ------------------------------------------------------------------------
    r6253 | creamyg | 2010-07-31 22:06:50 -0700 (Sat, 31 Jul 2010) | 15 lines

    Fix a bug in QParser_Expand_Leaf().  It's not allowed to return NULL, but
    it was under certain circumstances.  First, if there are no default fields
    (either because an empty array was supplied as the "fields" param to
    QParser_new, or because the Schema has no indexed fields).  Second, if the
    Analyzer returns no tokens after parsing the LeafQuery's text -- e.g. if a
    Stopalizer removes all stopwords, leaving no tokens.  

    This change may reveal latent bugs in people's AND-joined queries.  Where
    a search term was being ignored before, now it will be included as a
    NoMatchQuery, causing AND-ed clauses to return no documents (correctly)
    instead of returning documents which should have been excluded by the
    improperly-ignored required term.

    (Replay of r6089.)

It seems to me that the best way to have Expand_Leaf() communicate to Expand()
that a clause is optional is to return NULL.  That's a little confusing,
because when compiling a Query to a Matcher, a NULL Matcher is synonymous with
"fails to match".  But the only other way I can think of to deal with it would
be to introduce a new class, e.g. "VoidQuery", to carry that information.

Expand() will also have to change so that it can return NULL.  Expand() is
recursive, calling Expand() on child queries.  Because it can't know whether
it is operating on a top-level query or a subquery, it can't know whether it
should translate a NULL returned by Expand_Leaf() into a NoMatchQuery or not.

QParser_Parse(), on the other hand, *does* know that it's operating on a
top-level query.  It can check the return value of Expand() and translate
NULLs to NoMatchQueries.  Therefore, the changes under consideration only
affect the expert APIs (Expand and Expand_Leaf) and not the primary API,
Parse().

The last tricky bit is to changing the internals of Expand_Leaf() so that it
returns NULL and NoMatchQuery as appropriate.  I think we need to use a
boolean to track the special case of Analyzers which didn't output any tokens.
If that boolean is true and there are no concrete child Queries, we return
NULL.

Marvin Humphrey


Re: [lucy-dev] Re: [KinoSearch] Stopwords and AND queries

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Thu, Dec 16, 2010 at 12:42:53PM -0500, Robert Muir wrote:
> FWIW (definitely not trying to imply its the best!), NULL is what
> lucene-java does if the Analyzer returns zero tokens.  This means it has to
> be careful too in all the other processing, for example the code that
> applies boost has to handle the query == null case

I had a discussion with my colleagues here at Eventful about what should a
query like this ought to return:

    foo AND ()

We concluded that if you built such a query programmatically (i.e. using the
OO constructors), then it should return nothing.  That's how it works now,
because empty ANDQuery and empty ORQuery both compile down to NULL Matchers --
those empty parens fail to match, so the parent ANDQuery also fails.

In contrast, we agreed that how a tolerant, user-facing query parser such as
Lucy::Search::QueryParser ought to handle such a query string was ambiguous.

It seems reasonable to pursue a resolution to the current bug by with a minor
mod limited to the query parsing stage.  I don't think we ought to touch the
Query classes themselves.

Put another way... Analysis, including application of stoplists, belongs to
the query-parsing stage of compilation, and not to the lower level of
compiling a Query object down to a Matcher.  There's no way in Lucy to produce
a PolyQuery (the parent class for
ANDQuery/ORQuery/NOTQuery/RequiredOptionalQuery) which has one or more NULL
child queries.  We don't have semantics for what a NULL child query would
mean, and I don't think we should add such semantics.  We should avoid
following Lucene's example in this case.

Marvin Humphrey


Re: [lucy-dev] Re: [KinoSearch] Stopwords and AND queries

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Thu, Dec 16, 2010 at 12:04:30PM -0800, Nathan Kurz wrote:
> I've only glanced at this, but neither NULL nor a VoidQuery really
> seems like the actual solution here.  

Thanks for the high-level perspective, Nate.  In light of your comments, I
definitely feel that this merits a targeted, non-public bugfix and that adding
a new VoidQuery API would be inappropriate.

> If I search for "The Smiths", it means I'm searching for that term.
> If "The" is stop listed, there is simply no way to answer a query that
> uses it with anything but an error message.  

True, but we'll have to give them broken results anyway. :(  It's better to
provide the results for "Smiths" than no results at all.

> It seems like there also needs to be treatment at the QueryParser level to
> reject or modify queries that attempt to use stop terms.  

Point taken.  I think that would be a useful feature for a query parser.

However, the current QueryParser's design does not allow for it.  QueryParser
is not aware of stop words -- it only knows about Analyzers.  Processing the
removal of stop words at the QueryParser level (so that QueryParser could
report which stop words were removed) would require piercing at least three
levels of encapsulation.

It wouldn't be hard to build a custom query parser that processed stop words
early on, adapting the recipe in Lucy::Docs::Cookbook::CustomQueryParser.  I
think that needs to be the answer.

> More generally, it seems like Stop Lists themselves should be discouraged as
> a shortcut from earlier times when disk storage was at a premium.

We've been discouraging stoplists ever since KinoSearch 0.05.  Unlike Lucene's
"StandardAnalyzer", our PolyAnalyzer does not remove stopwords by default.  I
used '"the smiths"' and '"the who"' to illustrate why stoplists suck in my
2006 OSCON presentation.

Nevertheless, index size is still a major concern, and stoplists have their
place.  Disk is cheap, but RAM is expensive -- and if you want to run search
clusters under very heavy load, you need indexes that can fit into the OS
cache.  Stoplists can help with that.

> Which is to say, I think the current behaviour is correct.  If you
> manage to get a query through asking for a stop listed term, the
> answer is that it is not there, whether in a phrase or a AND. Courtesy
> says that you would return an error message or correct the query, but
> this should be handled by the front end and not by the index proper.

I agree that this would be the best behavior.  It can be achieved by
subclassing QueryParser and overriding Expand_Leaf().

> ps.   If you still feel you need to act, I think you need something
> like a static StopTerm and to allow the Boolean query classes to
> decide how they want to treat this.  But I'd recommend against adding
> this complexity unless you're certain it's a real problem that can't
> be handled as interface.

I would oppose adding such complexity elsewhere for the sake of stoplists.
Stoplists are a lousy band aid to begin with.  It would be a poor engineering
compromise to saddle crucial classes like ANDQuery with special-case code when
you're still never going to be able to search for '"the who"'.

Marvin Humphrey


Re: [lucy-dev] Re: [KinoSearch] Stopwords and AND queries

Posted by Nathan Kurz <na...@verse.com>.
I've only glanced at this, but neither NULL nor a VoidQuery really
seems like the actual solution here.  If a user searches for [foo
test], what they want is a list of documents that contain both.  They
don't want the documents that contain only [test], with no notice that
[foo] is stop listed.

If I search for "The Smiths", it means I'm searching for that term.
If "The" is stop listed, there is simply no way to answer a query that
uses it with anything but an error message.  It seems like there also
needs to be treatment at the QueryParser level to reject or modify
queries that attempt to use stop terms.  More generally, it seems like
Stop Lists themselves should be discouraged as a shortcut from earlier
times when disk storage was at a premium.

Which is to say, I think the current behaviour is correct.  If you
manage to get a query through asking for a stop listed term, the
answer is that it is not there, whether in a phrase or a AND. Courtesy
says that you would return an error message or correct the query, but
this should be handled by the front end and not by the index proper.

--nate

ps.   If you still feel you need to act, I think you need something
like a static StopTerm and to allow the Boolean query classes to
decide how they want to treat this.  But I'd recommend against adding
this complexity unless you're certain it's a real problem that can't
be handled as interface.

On Thu, Dec 16, 2010 at 9:42 AM, Robert Muir <rc...@gmail.com> wrote:
> On Wed, Dec 15, 2010 at 3:08 PM, Marvin Humphrey <ma...@rectangular.com> wrote:
>> On Wed, Dec 15, 2010 at 04:28:54PM +0100, Nick Wellnhofer wrote:
>>> I only had a cursory glance at the code and it seems that returning NULL
>>> is the easiest approach though it looks a bit hackish. Introducing a new
>>> VoidQuery class is probably the cleanest solution but I guess it
>>> requires a lot more additional code.
>
> FWIW (definitely not trying to imply its the best!), NULL is what
> lucene-java does if the Analyzer returns zero tokens.
> This means it has to be careful too in all the other processing, for
> example the code that applies boost has to handle the query == null
> case
>

Re: [lucy-dev] Re: [KinoSearch] Stopwords and AND queries

Posted by Robert Muir <rc...@gmail.com>.
On Wed, Dec 15, 2010 at 3:08 PM, Marvin Humphrey <ma...@rectangular.com> wrote:
> On Wed, Dec 15, 2010 at 04:28:54PM +0100, Nick Wellnhofer wrote:
>> I only had a cursory glance at the code and it seems that returning NULL
>> is the easiest approach though it looks a bit hackish. Introducing a new
>> VoidQuery class is probably the cleanest solution but I guess it
>> requires a lot more additional code.

FWIW (definitely not trying to imply its the best!), NULL is what
lucene-java does if the Analyzer returns zero tokens.
This means it has to be careful too in all the other processing, for
example the code that applies boost has to handle the query == null
case

Re: [lucy-dev] Re: [KinoSearch] Stopwords and AND queries

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Wed, Dec 15, 2010 at 04:28:54PM +0100, Nick Wellnhofer wrote:
> I only had a cursory glance at the code and it seems that returning NULL 
> is the easiest approach though it looks a bit hackish. Introducing a new 
> VoidQuery class is probably the cleanest solution but I guess it 
> requires a lot more additional code.

I wouldn't mind adding VoidQuery if I thought it would be useful outside of
this context.  But I don't think it will.

There's another way: we can add a "fails_to_match" attribute to NoMatchQuery.
It would default to true, preserving NoMatchQuery's current behavior.
However, we can have QParser_Expand_Leaf() set "fails_to_match" to false,
allowing QParser_Expand() to know when it can safely drop a clause from an
ANDQuery.

There's still ugliness in that approach, but the ugliness gets buried in
places people don't look very often -- instead of a new top-level class, we
add an obscure attribute to a reasonably obscure class.  Only QParser_Expand()
and QParser_Expand_Leaf() need to know this information.  Even QParser_Parse()
doesn't care, because it makes no difference whether a top-level NoMatchQuery
has "fails to match" or "neither matches nor fails to match" semantics.  A
search for a stopword on its own...

    the

... still returns no documents.

> Another solution might be to prune stop words in a separate pass over 
> the query string.

Unfortunately, I don't think that's feasible.  

QueryParser has two-phase tokenization: the first phase extracts leaves (terms
and phrases) and QueryParser-specific tokens like...

   + - ( ) AND NOT OR 

... but it applies no field-specific processing.  The second tokenization
phase runs once for each field, and uses the Analyzer specific to that field
(which it gets from the Schema).  

We can't use any of the field-specific Analyzers during the first phase
because that tokenization is shared across all fields -- and futhermore, such
an Analyzer would have to be specially tuned to extract QueryParser-specific
tokens properly.  We can't run the Analyzer twice during the second phase
because we don't want to e.g. stem something that's already stemmed.

Marvin Humphrey


[lucy-dev] Re: [KinoSearch] Stopwords and AND queries

Posted by Nick Wellnhofer <we...@aevum.de>.
On 15/12/2010 01:04, Marvin Humphrey wrote:
> It seems to me that the best way to have Expand_Leaf() communicate to Expand()
> that a clause is optional is to return NULL.  That's a little confusing,
> because when compiling a Query to a Matcher, a NULL Matcher is synonymous with
> "fails to match".  But the only other way I can think of to deal with it would
> be to introduce a new class, e.g. "VoidQuery", to carry that information.

I only had a cursory glance at the code and it seems that returning NULL 
is the easiest approach though it looks a bit hackish. Introducing a new 
VoidQuery class is probably the cleanest solution but I guess it 
requires a lot more additional code.

Another solution might be to prune stop words in a separate pass over 
the query string.

Nick