You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by "J. Delgado" <jo...@gmail.com> on 2007/04/09 09:46:40 UTC

Progressive Query Relaxation

Has anyone within the Lucene or Solr community attempted to code a
progressive query relaxation technique similar to the one described
here for Oracle Text?
http://www.oracle.com/technology/products/text/htdocs/prog_relax.html

Thanks,

-- J.D.

Re: Progressive Query Relaxation

Posted by Yonik Seeley <yo...@apache.org>.
On 4/9/07, J. Delgado <jo...@gmail.com> wrote:
> Has anyone within the Lucene or Solr community attempted to code a
> progressive query relaxation technique similar to the one described
> here for Oracle Text?
> http://www.oracle.com/technology/products/text/htdocs/prog_relax.html

I've thought about it in the past as an alternative to dismax.
Instead of getting the query terms and constructing all the different
clauses (disjunctionMax, pharseQuery, sloppyPhraseQuery, etc), one
could instead run a series of queries, ordered from most to least
restrictive.  Nothing has been implemented along these lines though.

-Yonik

Re: Progressive Query Relaxation

Posted by "J. Delgado" <jd...@lendingclub.com>.
Hoss,

I never got to acknowledge your analisis. Well done. I do want to hear your
opinion about the following posting I sent to the list, which aims and
looking at the anolalogy between search engines and relational/XML databases
as the progress to evolve into a single type of retrieval system:

The ever growing presence of mingled structured and unstructured data is a
fact of life and modern systems we have to deal with. Clearly, the tendency
is that full-text indexing is moving towards DB functionality, i.e.
<attribute,value> fields for projection/filtering, sorting, faceted queries,
transactional CRUD operations etc. Though set manipulation is not Lucene's
or Solr's forte, the document-object model maps very well to rows of
relational sets or tables, evermore when CLOBs and TEXT fields where
introduced.

On the other hand, relational databases with XML and OO extensions and
native XML repositories still have to deal with the problem of RANKING
unstructured text and combination of text fragments and structured
conditions, thus  dealing no longer just with a set/relational model  that
yields binary answers but extending their query languages to handled the
concept of fuzziness, relevance, etc. ( e.g. SQL/MM, XQuery-FullText).

I would like once again to open this can of worms, and perhaps think out of
the box, without classifying DB and Full-Text as simply different, as we
analyze concepts to further understand the real path for evolution of
Lucene/Sorl

Here is a very interesting attempt to create a special type of "index"
called Domain Index to query unstructured data within Oracle by Marcelo
Ochoa:
https://issues.apache.org/jira/browse/LUCENE-724

Other interesting articles:

XQuery 1.0 - Full-Text:
http://www.w3.org/TR/xquery-full-text/
SQL/MM Full-Text
http://www.wiscorp.com/2CD1R1-02-fulltext-2001-12.pdf

Discussions on *XML data model vs. relational model*
http://www.xml.com/cs/user/view/cs_msg/2645

http://www.w3.org/TR/xpath-datamodel/
http://en.wikipedia.org/wiki/Relational_model


-- J.D.
2007/4/10, Chris Hostetter <ho...@fucit.org>:
>
>
> : Agreed, but best match is not ONLY about keywords. Here is where the
> : system developer can provide extra intelligence by doing query
> : re-writing.
>
> I finally got a chance to read through the URL (disclaimer: i do not have
> "a basic working knowledge of Oracle Text, such as the operators used in
> query expressions.")
>
> At it's core what is being described here can easily be done with a custom
> request handler that takes in a multivalue "q" param, and executes them in
> order until it finds some matches ... careful math when dealing start/rows
> and the number of results from each query make it easy to ensure that you
> can seemlessly return results from any/all queries in the order described
> (allthough you'd have to do something funky with the raw score values if
> you actually wanted to return them to the client)
>
> In general though, I agree with Walter ... this seems like a very naive
> approach.  At a very low conceptually level, The DisMaxRequestHandler does
> what the early counter example in the link talks about...
>
> >>  select book_id from books
> >>      where contains (author, '(michel crichton) OR (?michel ?crichton)
> >>      OR (michel OR crichton) OR (?michel OR ?crichton)
>
> the problem is that the two critisism of this appraoch (which may be valid
> in Oracle text matching) don't really apply in Solr/Lucene...
>
> >>   1.  From the user's point of view, hits which are a poor match will
> be
> >> mixed in with hits which are a good match. The user wants to see good
> >> matches displayed first.
>
> "poor" hits won't score as high as "good" hits -- boost
> values can be assigned for hte various pieces of the DisMax query so that
> exact phrase matches can be weighted better then individual word matches,
> coordFactors will ensure that docs only matching a few words don't score
> as well as docs matching all of the words, etc...
>
> >>   2. From the system's point of view, the search is inefficient. Even
> if
> >> there were plenty of hits for exactly "Michel Crichton", it would still
> >> have to do all the work of the fuzzy expansions and fetch data for all
> the
> >> rows which satisfy the query.
>
> My problem with this claim is the assumption that once you find lots of
> hits for "Michel Crichton" you don't need to keep looking for "Michel" or
> "Crichton" ... by this logic, many docs that contain the exact phrase
> "Michel Crichton" (and are roughly the same length) will get the same
> score, and the query will stop there ... the benefit of looking for
> 8everything* as a single query, is that the scores can become more fine
> grained -- docs with 1 exact match that *also* contain things like "Mr
> Crichton" several dozen times will score higher then docs with just that
> one exact match (cosider an article about "Michel Crichton" in which his
> full name appears only once vs an article listing popular authors, in
> which "Michel Crichton" appears exactly once)
>
> : Why do you say this? The rank is still provided by the search engine
> : BASED ON THE QUERY submitted and it does consider natural language
> : text. It's just leaving the order of execution in the hands of the
> : developer who knows better what the system should return for some
> : specific cases.
>
> evaluating each of the query parts in isolation and then aggregating the
> results doesn't take into account the *cumulative* value of the parts ...
> it's like averagine the ages of people in each city, then averaging those
> averages for each state and calling that the average age per state -- it's
> a much less accurate representation of reality then averaging the ages of
> everyone in the state all at once.
>
>
>
> -Hoss
>
>

Re: Progressive Query Relaxation

Posted by Chris Hostetter <ho...@fucit.org>.
: Agreed, but best match is not ONLY about keywords. Here is where the
: system developer can provide extra intelligence by doing query
: re-writing.

I finally got a chance to read through the URL (disclaimer: i do not have
"a basic working knowledge of Oracle Text, such as the operators used in
query expressions.")

At it's core what is being described here can easily be done with a custom
request handler that takes in a multivalue "q" param, and executes them in
order until it finds some matches ... careful math when dealing start/rows
and the number of results from each query make it easy to ensure that you
can seemlessly return results from any/all queries in the order described
(allthough you'd have to do something funky with the raw score values if
you actually wanted to return them to the client)

In general though, I agree with Walter ... this seems like a very naive
approach.  At a very low conceptually level, The DisMaxRequestHandler does
what the early counter example in the link talks about...

>>  select book_id from books
>>      where contains (author, '(michel crichton) OR (?michel ?crichton)
>>      OR (michel OR crichton) OR (?michel OR ?crichton)

the problem is that the two critisism of this appraoch (which may be valid
in Oracle text matching) don't really apply in Solr/Lucene...

>>   1.  From the user's point of view, hits which are a poor match will be
>> mixed in with hits which are a good match. The user wants to see good
>> matches displayed first.

"poor" hits won't score as high as "good" hits -- boost
values can be assigned for hte various pieces of the DisMax query so that
exact phrase matches can be weighted better then individual word matches,
coordFactors will ensure that docs only matching a few words don't score
as well as docs matching all of the words, etc...

>>   2. From the system's point of view, the search is inefficient. Even if
>> there were plenty of hits for exactly "Michel Crichton", it would still
>> have to do all the work of the fuzzy expansions and fetch data for all the
>> rows which satisfy the query.

My problem with this claim is the assumption that once you find lots of
hits for "Michel Crichton" you don't need to keep looking for "Michel" or
"Crichton" ... by this logic, many docs that contain the exact phrase
"Michel Crichton" (and are roughly the same length) will get the same
score, and the query will stop there ... the benefit of looking for
8everything* as a single query, is that the scores can become more fine
grained -- docs with 1 exact match that *also* contain things like "Mr
Crichton" several dozen times will score higher then docs with just that
one exact match (cosider an article about "Michel Crichton" in which his
full name appears only once vs an article listing popular authors, in
which "Michel Crichton" appears exactly once)

: Why do you say this? The rank is still provided by the search engine
: BASED ON THE QUERY submitted and it does consider natural language
: text. It's just leaving the order of execution in the hands of the
: developer who knows better what the system should return for some
: specific cases.

evaluating each of the query parts in isolation and then aggregating the
results doesn't take into account the *cumulative* value of the parts ...
it's like averagine the ages of people in each city, then averaging those
averages for each state and calling that the average age per state -- it's
a much less accurate representation of reality then averaging the ages of
everyone in the state all at once.



-Hoss


Re: Progressive Query Relaxation

Posted by Walter Underwood <wu...@netflix.com>.
On 4/10/07 10:38 AM, "J. Delgado" <jd...@lendingclub.com> wrote:

> I think you have something personal against Oracle... Hey I have no
> interest in defending Oracle, but this no hack.

It's true, I don't have much respect for Oracle's text search.
When I was working on enterprise search, we never really worried
about them because their quality and speed just wasn't competitive.
I do not look to them as a reliable source of good ideas for search.

Oracle's problem statement has a plausible strawman, but there are
lots of better ways to deal with misspellings. Heck, my dev instance
of Solr gives Michael Crichton as the first hit for "Michel Crichton".
It is not true that "hits which are a poor match will be mixed in
with hits which are a good match."

Hmmm, "Crichton" is much more likely to be misspelled than "Michael",
so maybe their strawman isn't very good.

wunder


Re: Progressive Query Relaxation

Posted by "J. Delgado" <jd...@lendingclub.com>.
See my comments below.

2007/4/10, Walter Underwood <wu...@netflix.com>:
> On 4/10/07 10:06 AM, "J. Delgado" <jd...@lendingclub.com> wrote:
>
> > Progressive relaxation, at least as Oracle has defined it, is a
> > flexible, developer defined series of queries that are efficiently
> > executed in progression and in one trip to the engine, until minimum
> > of hits required is satisfied. It is not a self adapting precision
> > scheme nor it tries to guess what is the best match.
>
> Correct. Search engines are all about the best match. Why would
> you show anything else?

Agreed, but best match is not ONLY about keywords. Here is where the
system developer can provide extra intelligence by doing query
re-writing.

>
> This is an RDBMS flavored approach, not an approach that considers
> natural language text.

Why do you say this? The rank is still provided by the search engine
BASED ON THE QUERY submitted and it does consider natural language
text. It's just leaving the order of execution in the hands of the
developer who knows better what the system should return for some
specific cases.

> Sets of matches, not a ranked list. It fails
> as soon as one of the sets gets too big, like when someone searches
> for "laserjet" at HP.com. That happens a lot.

Nope...we are talking about the same thing: a ranked list, and all the
other cool stuff regarding automatic query expansion, hit list
clustering/faceted search, etc have solve the "laserjet" problem you
mentioned above.

>
> It assumes that all keywords are the same, something that Gerry
> Salton figured out was false thirty years ago. That is why we
> use tf.idf instead of sets of matches.

I'm totally with you. Oracle Text uses TF.IDF as well :-)

>
> I see a lot of design without any talk about what problem they are
> solving. What queries don't work? How do we make those better?
> Let's work from real logs and real data. Oracle's hack doesn't
> solve any problem I've see in real query logs.
>

I think you have something personal against Oracle... Hey I have no
interest in defending Oracle, but this no hack. It has its place for
certain applications. I'm not in favor on using Oracle Text, all I
asked was if this feature was available in Solr/Lucene because I think
it would be useful.

> I'm doing e-commerce search, and our current engine does pretty
> much what Oracle is offering. The results are not good, and we
> are replacing it with Solr and DisMax. My off-line relevance testing
> shows a big improvement.

Yep. One thing we agree on (that Netflix's engine's result is not
good). In any case, I think moving to Sorl and DisMax is a great idea
and should improve relvance. I also think that in some cases having
control of the queries that are expanded and executing them
progressively is the right way to go. For example , Nutch implements a
pretty sofisticated query rewrite in hopes of improving the relevance
ranking for their users. I think the results can be computed more
efficently if they whole query does not need to be evaluated, but just
enough of it that will return the required number of results.

Joaquin Delgado, PhD

>
> wunder
> --
> Search Guru, Netflix
>
>
>

Re: Progressive Query Relaxation

Posted by Walter Underwood <wu...@netflix.com>.
On 4/10/07 10:06 AM, "J. Delgado" <jd...@lendingclub.com> wrote:

> Progressive relaxation, at least as Oracle has defined it, is a
> flexible, developer defined series of queries that are efficiently
> executed in progression and in one trip to the engine, until minimum
> of hits required is satisfied. It is not a self adapting precision
> scheme nor it tries to guess what is the best match.

Correct. Search engines are all about the best match. Why would
you show anything else?

This is an RDBMS flavored approach, not an approach that considers
natural language text. Sets of matches, not a ranked list. It fails
as soon as one of the sets gets too big, like when someone searches
for "laserjet" at HP.com. That happens a lot.

It assumes that all keywords are the same, something that Gerry
Salton figured out was false thirty years ago. That is why we
use tf.idf instead of sets of matches.

I see a lot of design without any talk about what problem they are
solving. What queries don't work? How do we make those better?
Let's work from real logs and real data. Oracle's hack doesn't
solve any problem I've see in real query logs.

I'm doing e-commerce search, and our current engine does pretty
much what Oracle is offering. The results are not good, and we
are replacing it with Solr and DisMax. My off-line relevance testing
shows a big improvement.

wunder
--
Search Guru, Netflix



Re: Progressive Query Relaxation

Posted by "J. Delgado" <jd...@lendingclub.com>.
It looks only a handful of people actually looked at the link
provided. Indeed, it is hard to let the engine "come up" with a series
of queries that range from the most restrictive to the less
restrictive and still provides the best relevance. The problem is even
worse if what is relevant for one use case is irrelevant for others,
specially if you are dealing with "mixed queries" that combine text
with structured fields (i.e. range queries and sorting in the Lucene
case).

Let's suppose that you are dealing with the e-commerce/catalog search
case where you would like to show hits that match the query submitted
by the end user first as exact phrase, then with the terms near each
other , then AND'd then fuzzy or stemming and finally OR'd on certain
fields (i.e. title, description ,etc) and additionally you want to
always try to match  hits from those that were within the price range
specified and if not enough relax second best (i.e. just above the
price range) until you get a max of 500 hits. All this defines exactly
in what order you want to show/rank results! How many queries do you
have to build and send to Solr/Lucene to achieve this?

Progressive relaxation, at least as Oracle has defined it, is a
flexible, developer defined series of queries that are efficiently
executed in progression and in one trip to the engine, until minimum
of hits required is satisfied. It is not a self adapting precision
scheme nor it tries to guess what is the best match. This approach is
however very powerful (as powerful as the queries that are submitted)
and leaves the developer with the choice of controlling what queries
to execute and in which order. I don't think  DisMax does this.

-- Joaquin


2007/4/10, Walter Underwood <wu...@netflix.com>:
> From the name, I thought this was an adaptive precision scheme,
> where the engine automatically tries broader matching if there
> are no matches or just a few. We talked about doing that with
> Ultraseek, but it is pretty tricky. Deciding when to adjust it is
> harder than making it variable.
>
> Instead, this is an old idea that search amateurs seem to like.
> Show all exact matches, then near matches, etc. This is the
> kind of thing people suggest when they don't understand that
> a ranking algorithm combines that evidence in a much more
> powerful way. I talked customers out of this once or twice
> each year at Ultraseek.
>
> This approach fails for:
>
> * common words
> * misspellings
>
> Since both of those happen a lot, this idea fails for a lot
> of queries.
>
> I presume that Oracle implemented this to shut up some big customer,
> since it isn't a useful feature unless it closes a sale.
>
> DisMax gives you something somewhat similar to this, by
> selecting the best matching field. That is much more powerful
> and gives much better results.
>
> wunder
>
> On 4/9/07 12:46 AM, "J. Delgado" <jo...@gmail.com> wrote:
>
> > Has anyone within the Lucene or Solr community attempted to code a
> > progressive query relaxation technique similar to the one described
> > here for Oracle Text?
> > http://www.oracle.com/technology/products/text/htdocs/prog_relax.html
> >
> > Thanks,
> >
> > -- J.D.
>
>

Re: Progressive Query Relaxation

Posted by Walter Underwood <wu...@netflix.com>.
>From the name, I thought this was an adaptive precision scheme,
where the engine automatically tries broader matching if there
are no matches or just a few. We talked about doing that with
Ultraseek, but it is pretty tricky. Deciding when to adjust it is
harder than making it variable.

Instead, this is an old idea that search amateurs seem to like.
Show all exact matches, then near matches, etc. This is the
kind of thing people suggest when they don't understand that
a ranking algorithm combines that evidence in a much more
powerful way. I talked customers out of this once or twice
each year at Ultraseek.

This approach fails for:

* common words
* misspellings

Since both of those happen a lot, this idea fails for a lot
of queries.

I presume that Oracle implemented this to shut up some big customer,
since it isn't a useful feature unless it closes a sale.

DisMax gives you something somewhat similar to this, by
selecting the best matching field. That is much more powerful
and gives much better results.

wunder

On 4/9/07 12:46 AM, "J. Delgado" <jo...@gmail.com> wrote:

> Has anyone within the Lucene or Solr community attempted to code a
> progressive query relaxation technique similar to the one described
> here for Oracle Text?
> http://www.oracle.com/technology/products/text/htdocs/prog_relax.html
> 
> Thanks,
> 
> -- J.D.