You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucy.apache.org by Moritz Lenz <mo...@faui2k3.org> on 2011/08/21 21:34:58 UTC

[lucy-dev] Automagic phrase search

Hi,

in the past I've often wondered why "home made" search engines on
websites (often even on really big sites) feel much inferior compared to
the big names in search business like Google, Yahoo, Bing and DuckDuckGo.

One of my conclusions is that Google & co. usually treat the order of
search terms as an important indicator for relevance, while most other
search engines don't.

For example if you enter the words 'search engine' (without the quotes),
you'll mostly get exact matches matches first, as if you had really
searched for '"search engine"'.
Google goes even further: if you search for a number of words in a row,
it ranks documents higher that have most but not necessarily all of the
search words in the right order next to each other, even if there are
one or two other words in between.

Long rambling, short message: I'd love to have a mechanism in lucy to
provide an automagic phrase search as above, which honors the order of
search words even outside an explicit phrase search, and less
restrictively than an explicit phrase search.

Is something like that already implemented, and if no, is it on any agenda?
I know far too little about Lucy's internal workings to know if that's
easy or even possible, but for me it would be a real killer feature.
If somebody points me the direction where to start I might even give it
a try, though my C fu is mediocre at best.

Cheers,
Moritz

Re: [lucy-dev] Automagic phrase search

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Sun, Aug 21, 2011 at 07:02:18PM -0700, Nathan Kurz wrote:
> If you're willing to do some spelunking, this thread where Marvin
> quoted the same excerpt might be relevant:
> http://www.rectangular.com/pipermail/kinosearch/2006-May/005012.html

Haha, I even used the word "seminal". X-D

A lot has changed since 2006 -- we now have ProximityQuery, a subclassable
QueryParser, and expanded tutorial and cookbook documentation.  I'm glad I was
able to give Moritz a better answer today than I was able to give you back
then.

> Personally, while proximity search is one approach, I don't see it as
> essential.  I think that weighting of full and partial phrase searches
> is the more important part, with proximity being just one of these
> weighted variables.

There are many approaches and many techniques... but since Peter contributed
ProximityQuery, I think that's become the easiest one to explore -- especially
since Moritz will be able to rapid-prototype a custom QueryParser in the
language where he is most comfortable.

Marvin


Re: [lucy-dev] Automagic phrase search

Posted by Nathan Kurz <na...@verse.com>.
On Sun, Aug 21, 2011 at 5:29 PM, Marvin Humphrey <ma...@rectangular.com> wrote:
> On Sun, Aug 21, 2011 at 09:34:58PM +0200, Moritz Lenz wrote:
>> One of my conclusions is that Google & co. usually treat the order of
>> search terms as an important indicator for relevance, while most other
>> search engines don't.
>
> Yep.  A technique was described in section 4.5.1 of the seminal Brin/Page 1998
> paper, "The Anatomy of a Large-Scale Hypertextual Web Search Engine":

If you're willing to do some spelunking, this thread where Marvin
quoted the same excerpt might be relevant:
http://www.rectangular.com/pipermail/kinosearch/2006-May/005012.html

Personally, while proximity search is one approach, I don't see it as
essential.  I think that weighting of full and partial phrase searches
is the more important part, with proximity being just one of these
weighted variables.  For example, I would want a search for "one two
three four" to rank "one two [unrelated text] three four" much higher
than "four three two one [unrelated text]", even if the unrelated text
is quite long.

My thoughts are still close to these:
http://www.rectangular.com/pipermail/kinosearch/2007-June/004204.html

Nathan  Kurz
nate@verse.com

Re: [lucy-dev] Automagic phrase search

Posted by Peter Karman <pe...@peknet.com>.
Marvin Humphrey wrote on 8/21/11 7:29 PM:

> 
> What would be really useful is a query parser which automatically generates
> query structures like the one built up manually above.  
> 
> If that interests you, I suggest going through the Lucy tutorial if you
> haven't already, as the Lucy::Docs::Tutorial::QueryObjects chapter contains
> relevant material, then checking out Lucy::Docs::Cookbook::CustomQueryParser
> and Lucy::Docs::Cookbook::CustomQuery.
> 

Also look at Search::Query::Dialect::Lucy on CPAN, as that already supports
Proximity natively and includes a ignore_order_in_proximity feature for toggling
with the "strictness" of the proximity. It might also be easier to subclass and
extend if you don't already know Parse::RecDescent.

-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Re: [lucy-dev] Automagic phrase search

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Sun, Aug 21, 2011 at 09:34:58PM +0200, Moritz Lenz wrote:
> One of my conclusions is that Google & co. usually treat the order of
> search terms as an important indicator for relevance, while most other
> search engines don't.

Yep.  A technique was described in section 4.5.1 of the seminal Brin/Page 1998
paper, "The Anatomy of a Large-Scale Hypertextual Web Search Engine":

    http://infolab.stanford.edu/~backrub/google.html

    For a multi-word search, the situation is more complicated. Now multiple
    hit lists must be scanned through at once so that hits occurring close
    together in a document are weighted higher than hits occurring far apart.
    The hits from the multiple hit lists are matched up so that nearby hits
    are matched together. For every matched set of hits, a proximity is
    computed. The proximity is based on how far apart the hits are in the
    document (or anchor) but is classified into 10 different value "bins"
    ranging from a phrase match to "not even close".  Counts are computed not
    only for every type of hit but for every type and proximity. Every type
    and proximity pair has a type-prox-weight. The counts are converted into
    count-weights and we take the dot product of the count-weights and the
    type-prox-weights to compute an IR score.

Search query strings entered by humans generally carry positional information
and it is a shame to throw it away.

> I'd love to have a mechanism in lucy to provide an automagic phrase search
> as above, which honors the order of search words even outside an explicit
> phrase search, and less restrictively than an explicit phrase search.

You can get something close to that in Lucy by augmenting ordinary searches
with parallel proximity queries. 

    my $foo_query = Lucy::Search::TermQuery->new(
        field => 'content',
        term  => 'foo',
    );
    my $bar_query = Lucy::Search::TermQuery->new(
        field => 'content',
        term  => 'bar',
    );
    my $foo_or_bar_query = Lucy::Search::ORQuery->new(
        children => [ $foo_query, $bar_query ],
    );
    my $foo_near_bar_query = LucyX::Search::ProximityQuery->new(
        field  => 'content',
        terms  => [qw( foo bar )],
        within => 20,
    );
    my $top_level_query = Lucy::Search::ORQuery->new(
        children => [ $foo_or_bar_query, $foo_near_bar_query ],
    );
    my $hits = $searcher->hits( query => $top_level_query );
    ...

> Is something like that already implemented, and if no, is it on any agenda?

I tried a while back to work this into Lucy at a low level, optimizing both
index data structures and search time object hierarchies to support it.
Ultimately, I pulled that code out because the path I'd chosen wasn't going to
make automatic proximity support feasible without negatively impacting
ordinary searching.

One remnant of that attempt is Lucy::Plan::Architecture.  Part of the
motivation for allowing arbitrary index structures via Architecture
was to facilitate future experimentation with automatic proximity support.

> I know far too little about Lucy's internal workings to know if that's
> easy or even possible, but for me it would be a real killer feature.
> If somebody points me the direction where to start I might even give it
> a try, though my C fu is mediocre at best.

Can you work with Parse::RecDescent?

What would be really useful is a query parser which automatically generates
query structures like the one built up manually above.  

If that interests you, I suggest going through the Lucy tutorial if you
haven't already, as the Lucy::Docs::Tutorial::QueryObjects chapter contains
relevant material, then checking out Lucy::Docs::Cookbook::CustomQueryParser
and Lucy::Docs::Cookbook::CustomQuery.

Cheers,

Marvin Humphrey