You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@lucy.apache.org by goran kent <go...@gmail.com> on 2011/11/09 16:38:08 UTC

[lucy-user] Highlighting problem with latest trunk

Heeello,

I'm either going barking mad, or there really is a problem (I've just
had a drinkiepoo for medicinal purposes, so anything is possible).

I started seeing weird highlighting after updating to the latest trunk
(at least, that's when I think it started).

Revision: 1199738:  highlighting has gone mad
Revision: 1167124:  highlighting is sane

Searching for a single term [email], results in words adjacent to
[email] being highlighted, but not always.

What say the overlords?

Re: [lucy-user] Highlighting problem with latest trunk

Posted by goran kent <go...@gmail.com>.
On Wed, Nov 9, 2011 at 6:09 PM, Marvin Humphrey <ma...@rectangular.com> wrote:
> https://issues.apache.org/jira/browse/LUCY-182?focusedCommentId=13147127&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13147127

okay, that's a relief.  I thought I'd broken something.

Re: [lucy-user] Highlighting problem with latest trunk

Posted by goran kent <go...@gmail.com>.
On Fri, Nov 11, 2011 at 12:53 AM, Marvin Humphrey
<ma...@rectangular.com> wrote:
> Correct, that is the most important factor.  There are likely a lot of remote
> doc_freq() calls bouncing around, and those calls are being executed serially
> by PolySearcher.
>
> Highlighter requires a weighted query -- in Lucy-speak, a
> Lucy::Search::Compiler[1] -- in order to determine both which parts of the
> field matched and how much they contributed to the score.  In order to weight
> a query, you need to know how common each term is, so that in a search for
> 'the metamorphosis' the term 'metamorphosis' contributes more to the score
> than the term 'the'.  In the context of Highlighter, we need to know that
> 'metamorphosis' is more important than 'the' so that we can prefer
> selections which contain 'metamorphosis' over selections which contain 'the'
> when choosing an excerpt.
>
> However, having Highlighter weight the query means duplicated work, because
> the main Searcher has to perform exactly the same weighting routine prior to
> asking the remote nodes to score results.
>
> We can eliminate the duplicated effort by performing the weighting manually
> and supplying the weighted query object to both Searcher#hits and
> Highlighter#new.  (You'll need the latest trunk because this sample code
> requires a patch from LUCY-188 I committed earlier today.)
>
>    my $query    = $query_parser->parse($query_string);
>    my $compiler = $query->make_compiler(searcher => $searcher);
>    my $hits     = $searcher->hits(query => $compiler);
>    my $highlighter = Lucy::Highlight::Highlighter->new(
>        query    => $compiler,
>        searcher => $searcher,
>        field    => 'content',
>    );
>
> You may be able to cut down those remote doc_freq() calls further by using a
> QueryParser which has the minimum possible number of fields.  I can go into
> depth on that in another email if you like.
>
> Marvin Humphrey
>
> [1] It's called a "Compiler" because it's primary role is to compile a Query
>    to a Matcher.  Nobody likes the name, but we haven't achieved consensus on
>    what to do about it.

Crikey, I'm so accustomed to the oftentimes terse replies from FOSS
devs that receiving such detailed and obviously
time-consuming-to-compile replies leaves me staring at the headlights
dumbly;  and I dribbled a bit there.

ok, back to Lucy - that would explain the herds of mysterious
doc_freq's I was seeing in my remote debug prints.  At the time, it
was like, "what the hell?"  The dim light of comprehension blinks on,
well, dimly.

Let me chew on this stuff a bit and I'll report back with some results.

Re: [lucy-user] Highlighting problem with latest trunk

Posted by goran kent <go...@gmail.com>.
On Fri, Nov 11, 2011 at 12:53 AM, Marvin Humphrey
<ma...@rectangular.com> wrote:
> We can eliminate the duplicated effort by performing the weighting manually
> and supplying the weighted query object to both Searcher#hits and
> Highlighter#new.  (You'll need the latest trunk because this sample code
> requires a patch from LUCY-188 I committed earlier today.)

Wow, the logs don't lie - from 8s to <1s.

I think a few slaps on the back are in order!

new
2011-11-11 09:49:37: 29208: 133,158,641 hits in 20.96s
2011-11-11 09:49:37: 29208: pre-while-hits-next

old
2011-11-10 14:12:22: 18171: 133,158,641 hits in 20.82s
2011-11-10 14:12:30: 18171: pre-while-hits-next

Thank you so much for that vitally important improvement.
Now don't be too shocked by that 20s search response time...  I keep
reminding myself to always divide by $cluster_nodes, then I feel
better, knowing the solution is nigh :D

Re: [lucy-user] Highlighting problem with latest trunk

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Thu, Nov 10, 2011 at 11:31:31AM +0200, goran kent wrote:
> On 11/10/11, goran kent <go...@gmail.com> wrote:
> > Completing each new() requires 4s *each*.  Somehow I don't recall this
> > being the case before :/
> 
> The 4s x 2 penalty is probably related to the remote searching (of
> which highlighting is a part) doing things serially and not
> concurrently, and not this particular bug, no?

Correct, that is the most important factor.  There are likely a lot of remote
doc_freq() calls bouncing around, and those calls are being executed serially
by PolySearcher.

Highlighter requires a weighted query -- in Lucy-speak, a
Lucy::Search::Compiler[1] -- in order to determine both which parts of the
field matched and how much they contributed to the score.  In order to weight
a query, you need to know how common each term is, so that in a search for
'the metamorphosis' the term 'metamorphosis' contributes more to the score
than the term 'the'.  In the context of Highlighter, we need to know that
'metamorphosis' is more important than 'the' so that we can prefer
selections which contain 'metamorphosis' over selections which contain 'the'
when choosing an excerpt.

However, having Highlighter weight the query means duplicated work, because
the main Searcher has to perform exactly the same weighting routine prior to
asking the remote nodes to score results.

We can eliminate the duplicated effort by performing the weighting manually
and supplying the weighted query object to both Searcher#hits and
Highlighter#new.  (You'll need the latest trunk because this sample code
requires a patch from LUCY-188 I committed earlier today.)

    my $query    = $query_parser->parse($query_string);
    my $compiler = $query->make_compiler(searcher => $searcher);
    my $hits     = $searcher->hits(query => $compiler);
    my $highlighter = Lucy::Highlight::Highlighter->new(
        query    => $compiler,
        searcher => $searcher,
        field    => 'content',
    );

You may be able to cut down those remote doc_freq() calls further by using a
QueryParser which has the minimum possible number of fields.  I can go into
depth on that in another email if you like.

Marvin Humphrey

[1] It's called a "Compiler" because it's primary role is to compile a Query
    to a Matcher.  Nobody likes the name, but we haven't achieved consensus on
    what to do about it.


Re: [lucy-user] Highlighting problem with latest trunk

Posted by goran kent <go...@gmail.com>.
On 11/10/11, goran kent <go...@gmail.com> wrote:
> Completing each new() requires 4s *each*.  Somehow I don't recall this
> being the case before :/

The 4s x 2 penalty is probably related to the remote searching (of
which highlighting is a part) doing things serially and not
concurrently, and not this particular bug, no?

If so, this will probably go away when that gets addressed.

Re: [lucy-user] Highlighting problem with latest trunk

Posted by goran kent <go...@gmail.com>.
On 11/9/11, Marvin Humphrey <ma...@rectangular.com> wrote:
>> Searching for a single term [email], results in words adjacent to
>> [email] being highlighted, but not always.
>
> https://issues.apache.org/jira/browse/LUCY-182?focusedCommentId=13147127&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13147127

Just documenting an extra bit which might help towards debugging the above.

This bug might also account for the performance problem I've
encountered with setting up the highlighter objects - or is that not
related?

For example, I'm highlighting on title and body:

    my $body_highlighter = Lucy::Highlight::Highlighter->new(
        searcher  => $poly_searcher,
        query        => $query,
        field           => 'body',
        excerpt_length => 190,
    );

    my $title_highlighter = Lucy::Highlight::Highlighter->new(
        searcher  => $poly_searcher,
        query        => $query,
        field          => 'title',
        excerpt_length => 75,
    );

...basic stuff.

Completing each new() requires 4s *each*.  Somehow I don't recall this
being the case before :/

I'm happy to help by trying out test patches if you need - just chuck
'em my way.

cheers

Re: [lucy-user] Highlighting problem with latest trunk

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Wed, Nov 09, 2011 at 05:38:08PM +0200, goran kent wrote:
> I started seeing weird highlighting after updating to the latest trunk
> (at least, that's when I think it started).
> 
> Revision: 1199738:  highlighting has gone mad
> Revision: 1167124:  highlighting is sane
> 
> Searching for a single term [email], results in words adjacent to
> [email] being highlighted, but not always.

https://issues.apache.org/jira/browse/LUCY-182?focusedCommentId=13147127&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13147127

I've got multiple top priorites right now, sigh.

Marvin Humphrey


Re: [lucy-user] Highlighting problem with latest trunk

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Wed, Nov 09, 2011 at 05:38:08PM +0200, goran kent wrote:
> Searching for a single term [email], results in words adjacent to
> [email] being highlighted, but not always.

I've posted a diagnosis of this problem to lucy-dev:

http://markmail.org/message/3sijzyap6ohayihh

Marvin Humphrey