You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucy.apache.org by "Henry (Created) (JIRA)" <ji...@apache.org> on 2011/12/13 15:47:30 UTC

[lucy-issues] [jira] [Created] (LUCY-199) Highlighting/excerpt on URLs

Highlighting/excerpt on URLs 
-----------------------------

                 Key: LUCY-199
                 URL: https://issues.apache.org/jira/browse/LUCY-199
             Project: Lucy
          Issue Type: Bug
          Components: Core
    Affects Versions: 0.2.2 (incubating)
         Environment: Linux
            Reporter: Henry


If I explicitly specify excerpt_length:

my $hl             = Lucy::Highlight::Highlighter->new(
   searcher       => $searcher,
   query          => $query_compiler,
   field          => 'site',
   excerpt_length => 60,
);

...and the field content is longer than 60, then

$page_highlighter->create_excerpt($hit);

returns '...'.

Content which is short than 60, returns the highlighted excerpt as expected.

If I comment out "excerpt_length => 60," above, then it returns the full
non-truncated excerpt with highlighting as expected.

Some >60char samples which return &#8230;/"...", searching for [iol.co.za] or
[news24.com] (brackets are mine):

[www.iol.co.za/tonight/books/what-the-dickens-gets-a-statue-1.1130220]
[http://www.news24.com/News24v2/Travel/Mini_Site/ContentDisplay/n24TravelMiniSiteHome/0,,,00.html]
[www.news24.com/News24v2/Travel/Mini_Site/ContentDisplay/n24TravelMiniSite_TravelClub/0,,,00.html]

The following return double-ellipses ("......" - &#8230;&#8230;), searching
for [adsl mweb.com]:

[http://www.mweb.co.za/helpcentre/ADSL/ADSLGeneralIdisagreewithyourusagereport.aspx]
[http://www.mweb.co.za/helpcentre/FrequentlyAskedQuestions/MWEBHelpCentreFAQsHowdoI/FAQHowdoIHowdoImigratemyADSL/tabid/661/Default.aspx]



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: [lucy-dev] Highlighter excerpt boundaries

Posted by Peter Karman <pe...@peknet.com>.

On 1/19/12 6:52 PM, Marvin Humphrey wrote:
>
> It's rare that we need to optimize for performance.  Most of the time we
> should be optimizing for maintainability.

+1

> I suspect that at some point we will want to expose sentence boundary
> detection via a public API, because people who subclass Highlighter may want
> to use it.

+1 here too.

I have been putting some work into sentence boundary detection in 
Search::Tools, and I would love to see some thinking amongst the bright 
people here about how best to do it.

>
> It seems to me that publishing UAX #29 sentence boundary detection via an
> Analyzer is a conservative API extension, since it's so closely related to the
> UAX #29 word boundary detection we expose via StandardTokenizer.
>
> So that explains what I was thinking.  But of course refactoring sentence
> boundary detection into a string utility function also achieves the end of
> cleaning up Highlighter.c just as effectively, and might be more elegant --
> who knows?
>
> Until we actually expose this capability via a public API, either approach
> should work fine.

Agreed here too.



-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Re: [lucy-dev] Highlighter excerpt boundaries

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Thu, Jan 19, 2012 at 11:43:59AM +0100, Nick Wellnhofer wrote:
>> Not sure what I'm missing, but I don't understand the "coupling" concern.  It
>> seems to me as though it would be desirable code re-use to wrap our sentence
>> boundary detection mechanism within a battle-tested design like Analyzer,
>> rather than do something ad-hoc.
>
> The analyzers are designed so split a whole string into tokens. In the  
> highlighter we only need to find a single boundary near a certain  
> position in a string. So the analyzer interface isn't an ideal fit for  
> the highlighter. The performance hit of running a tokenizer over the  
> whole substring shouldn't be a problem but I'd still like to consider  
> alternatives.

It's rare that we need to optimize for performance.  Most of the time we
should be optimizing for maintainability.

I'm advocating using Analyzer because we have several of them, and because the
parallelism between StandardTokenizer and a StandardSentenceTokenizer based on
UAX #29 would lower the cost of maintaining them.

However, that's only one way to optimize for maintainability, and it's not
necessarily the best available stratagem.  It may be that low level code
leveraging an Analyzer is verbose... or not... we'd just have to try.

>> I'm actually very excited about getting all that sentence boundary detection
>> stuff out of Highlighter.c, which will become much easier to grok and maintain
>> as a result.  Separation of concerns FTW!
>
> We could also move the boundary detection to a string utility class.

I suspect that at some point we will want to expose sentence boundary
detection via a public API, because people who subclass Highlighter may want
to use it.  Father Chrysostomos did when he wrote KSx::Highlight::Summarizer.
(The old KinoSearch Highlighter exposed a find_sentences() method at one
point.  It was a victim of the C rewrite; Highlighter was one of the harder
modules to port.)

It seems to me that publishing UAX #29 sentence boundary detection via an
Analyzer is a conservative API extension, since it's so closely related to the
UAX #29 word boundary detection we expose via StandardTokenizer.

So that explains what I was thinking.  But of course refactoring sentence
boundary detection into a string utility function also achieves the end of
cleaning up Highlighter.c just as effectively, and might be more elegant --
who knows?

Until we actually expose this capability via a public API, either approach
should work fine.

>>> Of course, it would mean to implement a separate Unicode-capable word
>>> breaking algorithm for the highlighter. But this shouldn't be very hard as
>>> we could reuse parts of the StandardTokenizer.
>>
>> IMO, a word-breaking algo doesn't suffice for choosing excerpt boundaries.
>> It looks much better if you trim excerpts at sentence boundaries, and
>> word-break algos don't get you those.
>
> I would keep the sentence boundary detection, of course. I'm only  
> talking about the word breaking part.

Groovy, sounds like we're on the same page about that then. :)

Marvin Humphrey

Re: [lucy-dev] Highlighter excerpt boundaries

Posted by Nick Wellnhofer <we...@aevum.de>.

On 19/01/2012 03:28, Marvin Humphrey wrote:
> Phase 3 can be implemented several different ways.  It *could* reuse the
> original tokenization algo on its own, but that would produce sub-standard
> results because Lucy's tokenization algos are generally concerned with words
> rather than sentences, and excerpts chosen on word boundaries alone don't look
> very good.

You're right. I was only talking about Phase 3.

>> Such an approach wouldn't depend on the analyzer at all and it wouldn't
>> introduce additional coupling of Lucy's components.
>
> Not sure what I'm missing, but I don't understand the "coupling" concern.  It
> seems to me as though it would be desirable code re-use to wrap our sentence
> boundary detection mechanism within a battle-tested design like Analyzer,
> rather than do something ad-hoc.

The analyzers are designed so split a whole string into tokens. In the 
highlighter we only need to find a single boundary near a certain 
position in a string. So the analyzer interface isn't an ideal fit for 
the highlighter. The performance hit of running a tokenizer over the 
whole substring shouldn't be a problem but I'd still like to consider 
alternatives.

> I'm actually very excited about getting all that sentence boundary detection
> stuff out of Highlighter.c, which will become much easier to grok and maintain
> as a result.  Separation of concerns FTW!

We could also move the boundary detection to a string utility class.

>> Of course, it would mean to implement a separate Unicode-capable word
>> breaking algorithm for the highlighter. But this shouldn't be very hard as
>> we could reuse parts of the StandardTokenizer.
>
> IMO, a word-breaking algo doesn't suffice for choosing excerpt boundaries.
> It looks much better if you trim excerpts at sentence boundaries, and
> word-break algos don't get you those.

I would keep the sentence boundary detection, of course. I'm only 
talking about the word breaking part.

Nick

[lucy-dev] Highlighter excerpt boundaries

Posted by Marvin Humphrey <ma...@rectangular.com>.

(Moving this thread from the issue tracker to the dev list because it's now
about an approach rather than a specific patch...)

On Wed, Jan 18, 2012 at 10:06:41PM +0000, Nick Wellnhofer (Commented) (JIRA) wrote:
https://issues.apache.org/jira/browse/LUCY-199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13188734#comment-13188734 ] 

> Thinking more about a better fix for this problem, it's important to note
> that choosing a good excerpt is an operation that can be done without
> knowledge of the actual tokenization algorithm used in the indexing process.

There are multiple phases involved:

  1. Identify sections of text that contain relevant material -- i.e. that
     contributed to the search-time score of the document.
  2. Pick one contiguous chunk of text which seems to contain a lot of
     relevant material.
  3. Choose precise start and end points for the excerpt.

Phase 1 actually *does* require knowledge of the tokenization algorithm.  We
delegate creation of the HeatMap to our Query classes (technically, our
"Compiler" weighted Query classes).  They only handle granularity down to the
level of a token, so we need to provide them with a mapping of token-number =>
[start-offset,end-offset] in order to generate a HeatMap containing Spans
measured in code-point offsets; these code-point offsets are later used when
inserting highlight tags.

In our present implementation, however, offset information is captured at
index-time (via HighlightWriter), so our Highlighter objects don't technically
need to know about the tokenization algo (as encapsulated in the highlight
field's Analyzer).

Phase 2 does not require knowledge of the tokenization algo.

Phase 3 can be implemented several different ways.  It *could* reuse the
original tokenization algo on its own, but that would produce sub-standard
results because Lucy's tokenization algos are generally concerned with words
rather than sentences, and excerpts chosen on word boundaries alone don't look
very good.

The present implementation uses improvised sentence boundary detection then
falls back to whitespace -- and then, after your recent patch, to truncation.
IMO, it would be nice to clean up the sentence boundary detection to use the
algo described in UAX #29 instead of the current naive hack.

The remaining question is what to do when sentence boundary detection fails.
We can continue to fall back to whitespace, which works for plain text but
doesn't work well for e.g. URLs.  I think it might make sense to fall back to
the field's tokenization algorithm; we might also consider falling back to a
fixed choice of StandardTokenizer.  Both techniques will work well most of the
time but not all of the time.

> Such an approach wouldn't depend on the analyzer at all and it wouldn't
> introduce additional coupling of Lucy's components. 

Not sure what I'm missing, but I don't understand the "coupling" concern.  It
seems to me as though it would be desirable code re-use to wrap our sentence
boundary detection mechanism within a battle-tested design like Analyzer,
rather than do something ad-hoc.

I'm actually very excited about getting all that sentence boundary detection
stuff out of Highlighter.c, which will become much easier to grok and maintain
as a result.  Separation of concerns FTW!

> Of course, it would mean to implement a separate Unicode-capable word
> breaking algorithm for the highlighter. But this shouldn't be very hard as
> we could reuse parts of the StandardTokenizer.

IMO, a word-breaking algo doesn't suffice for choosing excerpt boundaries.
It looks much better if you trim excerpts at sentence boundaries, and
word-break algos don't get you those.

Marvin Humphrey

[lucy-issues] [jira] [Commented] (LUCY-199) Highlighting/excerpt on URLs

Posted by "Marvin Humphrey (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCY-199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13178535#comment-13178535 ] 

Marvin Humphrey commented on LUCY-199:
--------------------------------------

> We recompute the word boundaries during highlighting like we do now. We
> could use the analyzer of the current schema for that. But we could also use
> any other kind algorithm that is better than the current one. This might be
> very cheap since we only work on a small subset of the document.

That sounds like a really good idea for a lot of reasons! :)  I don't quite
understand how it will solve this bug, but that's partly because the boundary
detection code in Highlighter is complex and messy -- and using an Analyzer
would help to clean it up.

One thing to bear in mind is that Highlighter is not only concerned with word
boundaries, but sentence boundaries.  Take a look at the excerpts on the SERPs
for Google or any other major web search engine -- they tend to prefer
complete sentences.  Lucy's own highlighter favors sentences just because I
had a gut feeling that it was superior to the random word boundaries chosen by
the Lucene highlighter, but I'm sure there are academic papers by now which
explain why it's desirable.

I note that UAX #29 describes an algorithm for sentence boundary detection.
Our StandardTokenizer implements UAX #29 word boundary tokenization; we could
implement a new Analyzer for sentence boundary detection using the same
techniques.  (Lucy::Analysis::StandardSentenceTokenizer?)  Then we could
leverage Lucy's analysis apparatus for *both* boundary detection phases within
Highlighter, while still utilizing the existing highlighting data generated at
index-time for generating heat maps and scoring excerpt candidates.  That
would get a lot of ugly code out of Highlighter and make it much easier to
work on.

If we want a quick fix for this bug, though, I think we could also just wrap
an "if" test aroud the code which deals with the closing ellipsis and if we
eat the whole string looking for a boundary, fall back to swapping out the
last character for an ellipsis.


                
> Highlighting/excerpt on URLs 
> -----------------------------
>
>                 Key: LUCY-199
>                 URL: https://issues.apache.org/jira/browse/LUCY-199
>             Project: Lucy
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.2.2 (incubating)
>         Environment: Linux
>            Reporter: Henry
>         Attachments: hltest1.tgz
>
>
> If I explicitly specify excerpt_length:
> my $hl             = Lucy::Highlight::Highlighter->new(
>    searcher       => $searcher,
>    query          => $query_compiler,
>    field          => 'site',
>    excerpt_length => 60,
> );
> ...and the field content is longer than 60, then
> $page_highlighter->create_excerpt($hit);
> returns '...'.
> Content which is short than 60, returns the highlighted excerpt as expected.
> If I comment out "excerpt_length => 60," above, then it returns the full
> non-truncated excerpt with highlighting as expected.
> Some >60char samples which return &#8230;/"...", searching for [iol.co.za] or
> [news24.com] (brackets are mine):
> [www.iol.co.za/tonight/books/what-the-dickens-gets-a-statue-1.1130220]
> [http://www.news24.com/News24v2/Travel/Mini_Site/ContentDisplay/n24TravelMiniSiteHome/0,,,00.html]
> [www.news24.com/News24v2/Travel/Mini_Site/ContentDisplay/n24TravelMiniSite_TravelClub/0,,,00.html]
> The following return double-ellipses ("......" - &#8230;&#8230;), searching
> for [adsl mweb.com]:
> [http://www.mweb.co.za/helpcentre/ADSL/ADSLGeneralIdisagreewithyourusagereport.aspx]
> [http://www.mweb.co.za/helpcentre/FrequentlyAskedQuestions/MWEBHelpCentreFAQsHowdoI/FAQHowdoIHowdoImigratemyADSL/tabid/661/Default.aspx]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[lucy-issues] [jira] [Issue Comment Edited] (LUCY-199) Highlighting/excerpt on URLs

Posted by "Henry (Issue Comment Edited) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCY-199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13178019#comment-13178019 ] 

Henry edited comment on LUCY-199 at 12/31/11 2:18 PM:
------------------------------------------------------

I have attached a test case for this issue as requested by Nick.

To use:
mkdir test
cd test
tar zxf hltest1.tgz

./runme   # to create the test index

To show incorrect highlighting (ie, no highlighting at all) when using 60 char length for excerpt (two hits):
% ./search mweb
&#8230;
&#8230;&#8230;


Another:
./search adsl
&#8230;

More:
./search  news24.com
&#8230;
&#8230;

Finally:
./search  iol.co.za
&#8230;


If you comment out the length in ./search:

- excerpt_length => 60,
+ #excerpt_length => 60,

...and rerun the ./search tests, you'll get the expected highlighting, but obviously with an undesirable length.
                
      was (Author: henk):
    I have attached a test case for this issue as requested by Nick.

To use:
<pre>
mkdir test
cd test
tar zxf hltest1.tgz

./runme   # to create the test index
</pre>

To show incorrect highlighting (ie, no highlighting at all) when using 60 char length for excerpt (two hits):
% ./search mweb
&#8230;
&#8230;&#8230;


Another:
./search adsl
&#8230;

More:
./search  news24.com
&#8230;
&#8230;

Finally:
./search  iol.co.za
&#8230;


If you comment out the length in ./search:

- excerpt_length => 60,
+ #excerpt_length => 60,

...and rerun the ./search tests, you'll get the expected highlighting, but obviously with an undesirable length.
                  
> Highlighting/excerpt on URLs 
> -----------------------------
>
>                 Key: LUCY-199
>                 URL: https://issues.apache.org/jira/browse/LUCY-199
>             Project: Lucy
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.2.2 (incubating)
>         Environment: Linux
>            Reporter: Henry
>         Attachments: hltest1.tgz
>
>
> If I explicitly specify excerpt_length:
> my $hl             = Lucy::Highlight::Highlighter->new(
>    searcher       => $searcher,
>    query          => $query_compiler,
>    field          => 'site',
>    excerpt_length => 60,
> );
> ...and the field content is longer than 60, then
> $page_highlighter->create_excerpt($hit);
> returns '...'.
> Content which is short than 60, returns the highlighted excerpt as expected.
> If I comment out "excerpt_length => 60," above, then it returns the full
> non-truncated excerpt with highlighting as expected.
> Some >60char samples which return &#8230;/"...", searching for [iol.co.za] or
> [news24.com] (brackets are mine):
> [www.iol.co.za/tonight/books/what-the-dickens-gets-a-statue-1.1130220]
> [http://www.news24.com/News24v2/Travel/Mini_Site/ContentDisplay/n24TravelMiniSiteHome/0,,,00.html]
> [www.news24.com/News24v2/Travel/Mini_Site/ContentDisplay/n24TravelMiniSite_TravelClub/0,,,00.html]
> The following return double-ellipses ("......" - &#8230;&#8230;), searching
> for [adsl mweb.com]:
> [http://www.mweb.co.za/helpcentre/ADSL/ADSLGeneralIdisagreewithyourusagereport.aspx]
> [http://www.mweb.co.za/helpcentre/FrequentlyAskedQuestions/MWEBHelpCentreFAQsHowdoI/FAQHowdoIHowdoImigratemyADSL/tabid/661/Default.aspx]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[lucy-issues] [jira] [Commented] (LUCY-199) Highlighting/excerpt on URLs

Posted by "Marvin Humphrey (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCY-199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13184321#comment-13184321 ] 

Marvin Humphrey commented on LUCY-199:
--------------------------------------

+1 to commit to trunk.

+1 to merge to 0.3.

The patch looks right to me, and all tests pass, both on trunk
and the 0.3 branch, including while running under Valgrind.
                
> Highlighting/excerpt on URLs 
> -----------------------------
>
>                 Key: LUCY-199
>                 URL: https://issues.apache.org/jira/browse/LUCY-199
>             Project: Lucy
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.2.2 (incubating)
>         Environment: Linux
>            Reporter: Henry
>         Attachments: LUCY-199-quickfix.patch, hltest1.tgz
>
>
> If I explicitly specify excerpt_length:
> my $hl             = Lucy::Highlight::Highlighter->new(
>    searcher       => $searcher,
>    query          => $query_compiler,
>    field          => 'site',
>    excerpt_length => 60,
> );
> ...and the field content is longer than 60, then
> $page_highlighter->create_excerpt($hit);
> returns '...'.
> Content which is short than 60, returns the highlighted excerpt as expected.
> If I comment out "excerpt_length => 60," above, then it returns the full
> non-truncated excerpt with highlighting as expected.
> Some >60char samples which return &#8230;/"...", searching for [iol.co.za] or
> [news24.com] (brackets are mine):
> [www.iol.co.za/tonight/books/what-the-dickens-gets-a-statue-1.1130220]
> [http://www.news24.com/News24v2/Travel/Mini_Site/ContentDisplay/n24TravelMiniSiteHome/0,,,00.html]
> [www.news24.com/News24v2/Travel/Mini_Site/ContentDisplay/n24TravelMiniSite_TravelClub/0,,,00.html]
> The following return double-ellipses ("......" - &#8230;&#8230;), searching
> for [adsl mweb.com]:
> [http://www.mweb.co.za/helpcentre/ADSL/ADSLGeneralIdisagreewithyourusagereport.aspx]
> [http://www.mweb.co.za/helpcentre/FrequentlyAskedQuestions/MWEBHelpCentreFAQsHowdoI/FAQHowdoIHowdoImigratemyADSL/tabid/661/Default.aspx]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[lucy-issues] [jira] [Updated] (LUCY-199) Highlighting/excerpt on URLs

Posted by "Henry (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCY-199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Henry updated LUCY-199:
-----------------------

    Attachment: hltest1.tgz

test case for highlighter issue
                
> Highlighting/excerpt on URLs 
> -----------------------------
>
>                 Key: LUCY-199
>                 URL: https://issues.apache.org/jira/browse/LUCY-199
>             Project: Lucy
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.2.2 (incubating)
>         Environment: Linux
>            Reporter: Henry
>         Attachments: hltest1.tgz
>
>
> If I explicitly specify excerpt_length:
> my $hl             = Lucy::Highlight::Highlighter->new(
>    searcher       => $searcher,
>    query          => $query_compiler,
>    field          => 'site',
>    excerpt_length => 60,
> );
> ...and the field content is longer than 60, then
> $page_highlighter->create_excerpt($hit);
> returns '...'.
> Content which is short than 60, returns the highlighted excerpt as expected.
> If I comment out "excerpt_length => 60," above, then it returns the full
> non-truncated excerpt with highlighting as expected.
> Some >60char samples which return &#8230;/"...", searching for [iol.co.za] or
> [news24.com] (brackets are mine):
> [www.iol.co.za/tonight/books/what-the-dickens-gets-a-statue-1.1130220]
> [http://www.news24.com/News24v2/Travel/Mini_Site/ContentDisplay/n24TravelMiniSiteHome/0,,,00.html]
> [www.news24.com/News24v2/Travel/Mini_Site/ContentDisplay/n24TravelMiniSite_TravelClub/0,,,00.html]
> The following return double-ellipses ("......" - &#8230;&#8230;), searching
> for [adsl mweb.com]:
> [http://www.mweb.co.za/helpcentre/ADSL/ADSLGeneralIdisagreewithyourusagereport.aspx]
> [http://www.mweb.co.za/helpcentre/FrequentlyAskedQuestions/MWEBHelpCentreFAQsHowdoI/FAQHowdoIHowdoImigratemyADSL/tabid/661/Default.aspx]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[lucy-issues] [jira] [Commented] (LUCY-199) Highlighting/excerpt on URLs

Posted by "Henry (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCY-199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13178019#comment-13178019 ] 

Henry commented on LUCY-199:
----------------------------

I have attached a test case for this issue as requested by Nick.

To use:
mkdir test
cd test
tar zxf hltest1.tgz

./runme   # to create the test index

To show incorrect highlighting (ie, no highlighting at all) when using 60 char length for excerpt (two hits):
% ./search mweb
&#8230;
&#8230;&#8230;


Another:
./search adsl
&#8230;

More:
./search  news24.com
&#8230;
&#8230;

Finally:
./search  iol.co.za
&#8230;


If you comment out the length in ./search:

- excerpt_length => 60,
+ #excerpt_length => 60,

...and rerun the ./search tests, you'll get the expected highlighting, but obviously with an undesirable length.
                
> Highlighting/excerpt on URLs 
> -----------------------------
>
>                 Key: LUCY-199
>                 URL: https://issues.apache.org/jira/browse/LUCY-199
>             Project: Lucy
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.2.2 (incubating)
>         Environment: Linux
>            Reporter: Henry
>         Attachments: hltest1.tgz
>
>
> If I explicitly specify excerpt_length:
> my $hl             = Lucy::Highlight::Highlighter->new(
>    searcher       => $searcher,
>    query          => $query_compiler,
>    field          => 'site',
>    excerpt_length => 60,
> );
> ...and the field content is longer than 60, then
> $page_highlighter->create_excerpt($hit);
> returns '...'.
> Content which is short than 60, returns the highlighted excerpt as expected.
> If I comment out "excerpt_length => 60," above, then it returns the full
> non-truncated excerpt with highlighting as expected.
> Some >60char samples which return &#8230;/"...", searching for [iol.co.za] or
> [news24.com] (brackets are mine):
> [www.iol.co.za/tonight/books/what-the-dickens-gets-a-statue-1.1130220]
> [http://www.news24.com/News24v2/Travel/Mini_Site/ContentDisplay/n24TravelMiniSiteHome/0,,,00.html]
> [www.news24.com/News24v2/Travel/Mini_Site/ContentDisplay/n24TravelMiniSite_TravelClub/0,,,00.html]
> The following return double-ellipses ("......" - &#8230;&#8230;), searching
> for [adsl mweb.com]:
> [http://www.mweb.co.za/helpcentre/ADSL/ADSLGeneralIdisagreewithyourusagereport.aspx]
> [http://www.mweb.co.za/helpcentre/FrequentlyAskedQuestions/MWEBHelpCentreFAQsHowdoI/FAQHowdoIHowdoImigratemyADSL/tabid/661/Default.aspx]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[lucy-issues] [jira] [Updated] (LUCY-199) Highlighting/excerpt on URLs

Posted by "Nick Wellnhofer (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCY-199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nick Wellnhofer updated LUCY-199:
---------------------------------

    Attachment: LUCY-199-quickfix.patch

Here is the simplest partial fix I could come up with. I'm not really happy with it because it can produce excerpts that don't contain the search term. For example, if you set excerpt_length to 40 in Henry's test case, you get the following result:

$ ./search statue
www.iol.co.za/tonight/books/what-the-di&#8230;

The full URL is www.iol.co.za/tonight/books/what-the-dickens-gets-a-statue-1.1130220 and it gets truncated before the term "statue". But the patch is certainly an improvement.


                
> Highlighting/excerpt on URLs 
> -----------------------------
>
>                 Key: LUCY-199
>                 URL: https://issues.apache.org/jira/browse/LUCY-199
>             Project: Lucy
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.2.2 (incubating)
>         Environment: Linux
>            Reporter: Henry
>         Attachments: LUCY-199-quickfix.patch, hltest1.tgz
>
>
> If I explicitly specify excerpt_length:
> my $hl             = Lucy::Highlight::Highlighter->new(
>    searcher       => $searcher,
>    query          => $query_compiler,
>    field          => 'site',
>    excerpt_length => 60,
> );
> ...and the field content is longer than 60, then
> $page_highlighter->create_excerpt($hit);
> returns '...'.
> Content which is short than 60, returns the highlighted excerpt as expected.
> If I comment out "excerpt_length => 60," above, then it returns the full
> non-truncated excerpt with highlighting as expected.
> Some >60char samples which return &#8230;/"...", searching for [iol.co.za] or
> [news24.com] (brackets are mine):
> [www.iol.co.za/tonight/books/what-the-dickens-gets-a-statue-1.1130220]
> [http://www.news24.com/News24v2/Travel/Mini_Site/ContentDisplay/n24TravelMiniSiteHome/0,,,00.html]
> [www.news24.com/News24v2/Travel/Mini_Site/ContentDisplay/n24TravelMiniSite_TravelClub/0,,,00.html]
> The following return double-ellipses ("......" - &#8230;&#8230;), searching
> for [adsl mweb.com]:
> [http://www.mweb.co.za/helpcentre/ADSL/ADSLGeneralIdisagreewithyourusagereport.aspx]
> [http://www.mweb.co.za/helpcentre/FrequentlyAskedQuestions/MWEBHelpCentreFAQsHowdoI/FAQHowdoIHowdoImigratemyADSL/tabid/661/Default.aspx]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[lucy-issues] [jira] [Commented] (LUCY-199) Highlighting/excerpt on URLs

Posted by "Nick Wellnhofer (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCY-199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13188734#comment-13188734 ] 

Nick Wellnhofer commented on LUCY-199:
--------------------------------------

Thinking more about a better fix for this problem, it's important to note that choosing a good excerpt is an operation that can be done without knowledge of the actual tokenization algorithm used in the indexing process. I think it's enough to

* find boundaries that are more or less correct in a semantic and visual sense, and
* be tolerant enough to find boundaries in long substrings without whitespace that might exceed excerpt_length (considering that whitespace is the obvious place to break words like in the current implementation).

If the highlighter finds additional word breaks, it shouldn't be a problem as long as the result is visually correct.

Such an approach wouldn't depend on the analyzer at all and it wouldn't introduce additional coupling of Lucy's components. Of course, it would mean to implement a separate Unicode-capable word breaking algorithm for the highlighter. But this shouldn't be very hard as we could reuse parts of the StandardTokenizer.
                
> Highlighting/excerpt on URLs 
> -----------------------------
>
>                 Key: LUCY-199
>                 URL: https://issues.apache.org/jira/browse/LUCY-199
>             Project: Lucy
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.2.2 (incubating)
>         Environment: Linux
>            Reporter: Henry
>         Attachments: LUCY-199-quickfix.patch, hltest1.tgz
>
>
> If I explicitly specify excerpt_length:
> my $hl             = Lucy::Highlight::Highlighter->new(
>    searcher       => $searcher,
>    query          => $query_compiler,
>    field          => 'site',
>    excerpt_length => 60,
> );
> ...and the field content is longer than 60, then
> $page_highlighter->create_excerpt($hit);
> returns '...'.
> Content which is short than 60, returns the highlighted excerpt as expected.
> If I comment out "excerpt_length => 60," above, then it returns the full
> non-truncated excerpt with highlighting as expected.
> Some >60char samples which return &#8230;/"...", searching for [iol.co.za] or
> [news24.com] (brackets are mine):
> [www.iol.co.za/tonight/books/what-the-dickens-gets-a-statue-1.1130220]
> [http://www.news24.com/News24v2/Travel/Mini_Site/ContentDisplay/n24TravelMiniSiteHome/0,,,00.html]
> [www.news24.com/News24v2/Travel/Mini_Site/ContentDisplay/n24TravelMiniSite_TravelClub/0,,,00.html]
> The following return double-ellipses ("......" - &#8230;&#8230;), searching
> for [adsl mweb.com]:
> [http://www.mweb.co.za/helpcentre/ADSL/ADSLGeneralIdisagreewithyourusagereport.aspx]
> [http://www.mweb.co.za/helpcentre/FrequentlyAskedQuestions/MWEBHelpCentreFAQsHowdoI/FAQHowdoIHowdoImigratemyADSL/tabid/661/Default.aspx]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[lucy-issues] [jira] [Commented] (LUCY-199) Highlighting/excerpt on URLs

Posted by "Nick Wellnhofer (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCY-199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13178438#comment-13178438 ] 

Nick Wellnhofer commented on LUCY-199:
--------------------------------------

Thanks for the excellent test case.

The whole thing is a bug in Highlighter_raw_excerpt. When prepending or appending an ellipsis, the code tries to make sure this happens on a word boundary. So it chops off words to make place for the ellipsis. Unfortunately, it simply looks for whitespace to determine a word boundary. In case of URLs it doesn't find whitespace and deletes the whole URL from raw_excerpt.

I see two approaches to fix this:

* Since the word boundaries are already computed during analysis, we could try to reuse this data. AFAICS this would mean to loop through all the terms of the document and extract and finally sort all start and end offsets. I'm not sure how expensive this would be.
* We recompute the word boundaries during highlighting like we do now. We could use the analyzer of the current schema for that. But we could also use any other kind algorithm that is better than the current one. This might be very cheap since we only work on a small subset of the document.

                
> Highlighting/excerpt on URLs 
> -----------------------------
>
>                 Key: LUCY-199
>                 URL: https://issues.apache.org/jira/browse/LUCY-199
>             Project: Lucy
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.2.2 (incubating)
>         Environment: Linux
>            Reporter: Henry
>         Attachments: hltest1.tgz
>
>
> If I explicitly specify excerpt_length:
> my $hl             = Lucy::Highlight::Highlighter->new(
>    searcher       => $searcher,
>    query          => $query_compiler,
>    field          => 'site',
>    excerpt_length => 60,
> );
> ...and the field content is longer than 60, then
> $page_highlighter->create_excerpt($hit);
> returns '...'.
> Content which is short than 60, returns the highlighted excerpt as expected.
> If I comment out "excerpt_length => 60," above, then it returns the full
> non-truncated excerpt with highlighting as expected.
> Some >60char samples which return &#8230;/"...", searching for [iol.co.za] or
> [news24.com] (brackets are mine):
> [www.iol.co.za/tonight/books/what-the-dickens-gets-a-statue-1.1130220]
> [http://www.news24.com/News24v2/Travel/Mini_Site/ContentDisplay/n24TravelMiniSiteHome/0,,,00.html]
> [www.news24.com/News24v2/Travel/Mini_Site/ContentDisplay/n24TravelMiniSite_TravelClub/0,,,00.html]
> The following return double-ellipses ("......" - &#8230;&#8230;), searching
> for [adsl mweb.com]:
> [http://www.mweb.co.za/helpcentre/ADSL/ADSLGeneralIdisagreewithyourusagereport.aspx]
> [http://www.mweb.co.za/helpcentre/FrequentlyAskedQuestions/MWEBHelpCentreFAQsHowdoI/FAQHowdoIHowdoImigratemyADSL/tabid/661/Default.aspx]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[lucy-issues] [jira] [Issue Comment Edited] (LUCY-199) Highlighting/excerpt on URLs

Posted by "Henry (Issue Comment Edited) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCY-199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13178019#comment-13178019 ] 

Henry edited comment on LUCY-199 at 12/31/11 2:17 PM:
------------------------------------------------------

I have attached a test case for this issue as requested by Nick.

To use:
<pre>
mkdir test
cd test
tar zxf hltest1.tgz

./runme   # to create the test index
</pre>

To show incorrect highlighting (ie, no highlighting at all) when using 60 char length for excerpt (two hits):
% ./search mweb
&#8230;
&#8230;&#8230;


Another:
./search adsl
&#8230;

More:
./search  news24.com
&#8230;
&#8230;

Finally:
./search  iol.co.za
&#8230;


If you comment out the length in ./search:

- excerpt_length => 60,
+ #excerpt_length => 60,

...and rerun the ./search tests, you'll get the expected highlighting, but obviously with an undesirable length.
                
      was (Author: henk):
    I have attached a test case for this issue as requested by Nick.

To use:
mkdir test
cd test
tar zxf hltest1.tgz

./runme   # to create the test index

To show incorrect highlighting (ie, no highlighting at all) when using 60 char length for excerpt (two hits):
% ./search mweb
&#8230;
&#8230;&#8230;


Another:
./search adsl
&#8230;

More:
./search  news24.com
&#8230;
&#8230;

Finally:
./search  iol.co.za
&#8230;


If you comment out the length in ./search:

- excerpt_length => 60,
+ #excerpt_length => 60,

...and rerun the ./search tests, you'll get the expected highlighting, but obviously with an undesirable length.
                  
> Highlighting/excerpt on URLs 
> -----------------------------
>
>                 Key: LUCY-199
>                 URL: https://issues.apache.org/jira/browse/LUCY-199
>             Project: Lucy
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.2.2 (incubating)
>         Environment: Linux
>            Reporter: Henry
>         Attachments: hltest1.tgz
>
>
> If I explicitly specify excerpt_length:
> my $hl             = Lucy::Highlight::Highlighter->new(
>    searcher       => $searcher,
>    query          => $query_compiler,
>    field          => 'site',
>    excerpt_length => 60,
> );
> ...and the field content is longer than 60, then
> $page_highlighter->create_excerpt($hit);
> returns '...'.
> Content which is short than 60, returns the highlighted excerpt as expected.
> If I comment out "excerpt_length => 60," above, then it returns the full
> non-truncated excerpt with highlighting as expected.
> Some >60char samples which return &#8230;/"...", searching for [iol.co.za] or
> [news24.com] (brackets are mine):
> [www.iol.co.za/tonight/books/what-the-dickens-gets-a-statue-1.1130220]
> [http://www.news24.com/News24v2/Travel/Mini_Site/ContentDisplay/n24TravelMiniSiteHome/0,,,00.html]
> [www.news24.com/News24v2/Travel/Mini_Site/ContentDisplay/n24TravelMiniSite_TravelClub/0,,,00.html]
> The following return double-ellipses ("......" - &#8230;&#8230;), searching
> for [adsl mweb.com]:
> [http://www.mweb.co.za/helpcentre/ADSL/ADSLGeneralIdisagreewithyourusagereport.aspx]
> [http://www.mweb.co.za/helpcentre/FrequentlyAskedQuestions/MWEBHelpCentreFAQsHowdoI/FAQHowdoIHowdoImigratemyADSL/tabid/661/Default.aspx]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira