You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucy.apache.org by Marvin Humphrey <ma...@rectangular.com> on 2012/01/19 03:28:00 UTC

[lucy-dev] Highlighter excerpt boundaries

(Moving this thread from the issue tracker to the dev list because it's now
about an approach rather than a specific patch...)

On Wed, Jan 18, 2012 at 10:06:41PM +0000, Nick Wellnhofer (Commented) (JIRA) wrote:
https://issues.apache.org/jira/browse/LUCY-199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13188734#comment-13188734 ] 

> Thinking more about a better fix for this problem, it's important to note
> that choosing a good excerpt is an operation that can be done without
> knowledge of the actual tokenization algorithm used in the indexing process.

There are multiple phases involved:

  1. Identify sections of text that contain relevant material -- i.e. that
     contributed to the search-time score of the document.
  2. Pick one contiguous chunk of text which seems to contain a lot of
     relevant material.
  3. Choose precise start and end points for the excerpt.

Phase 1 actually *does* require knowledge of the tokenization algorithm.  We
delegate creation of the HeatMap to our Query classes (technically, our
"Compiler" weighted Query classes).  They only handle granularity down to the
level of a token, so we need to provide them with a mapping of token-number =>
[start-offset,end-offset] in order to generate a HeatMap containing Spans
measured in code-point offsets; these code-point offsets are later used when
inserting highlight tags.

In our present implementation, however, offset information is captured at
index-time (via HighlightWriter), so our Highlighter objects don't technically
need to know about the tokenization algo (as encapsulated in the highlight
field's Analyzer).

Phase 2 does not require knowledge of the tokenization algo.

Phase 3 can be implemented several different ways.  It *could* reuse the
original tokenization algo on its own, but that would produce sub-standard
results because Lucy's tokenization algos are generally concerned with words
rather than sentences, and excerpts chosen on word boundaries alone don't look
very good.

The present implementation uses improvised sentence boundary detection then
falls back to whitespace -- and then, after your recent patch, to truncation.
IMO, it would be nice to clean up the sentence boundary detection to use the
algo described in UAX #29 instead of the current naive hack.

The remaining question is what to do when sentence boundary detection fails.
We can continue to fall back to whitespace, which works for plain text but
doesn't work well for e.g. URLs.  I think it might make sense to fall back to
the field's tokenization algorithm; we might also consider falling back to a
fixed choice of StandardTokenizer.  Both techniques will work well most of the
time but not all of the time.

> Such an approach wouldn't depend on the analyzer at all and it wouldn't
> introduce additional coupling of Lucy's components. 

Not sure what I'm missing, but I don't understand the "coupling" concern.  It
seems to me as though it would be desirable code re-use to wrap our sentence
boundary detection mechanism within a battle-tested design like Analyzer,
rather than do something ad-hoc.

I'm actually very excited about getting all that sentence boundary detection
stuff out of Highlighter.c, which will become much easier to grok and maintain
as a result.  Separation of concerns FTW!

> Of course, it would mean to implement a separate Unicode-capable word
> breaking algorithm for the highlighter. But this shouldn't be very hard as
> we could reuse parts of the StandardTokenizer.

IMO, a word-breaking algo doesn't suffice for choosing excerpt boundaries.
It looks much better if you trim excerpts at sentence boundaries, and
word-break algos don't get you those.

Marvin Humphrey


Re: [lucy-dev] Highlighter excerpt boundaries

Posted by Peter Karman <pe...@peknet.com>.
On 1/19/12 6:52 PM, Marvin Humphrey wrote:
>
> It's rare that we need to optimize for performance.  Most of the time we
> should be optimizing for maintainability.

+1

> I suspect that at some point we will want to expose sentence boundary
> detection via a public API, because people who subclass Highlighter may want
> to use it.

+1 here too.

I have been putting some work into sentence boundary detection in 
Search::Tools, and I would love to see some thinking amongst the bright 
people here about how best to do it.

>
> It seems to me that publishing UAX #29 sentence boundary detection via an
> Analyzer is a conservative API extension, since it's so closely related to the
> UAX #29 word boundary detection we expose via StandardTokenizer.
>
> So that explains what I was thinking.  But of course refactoring sentence
> boundary detection into a string utility function also achieves the end of
> cleaning up Highlighter.c just as effectively, and might be more elegant --
> who knows?
>
> Until we actually expose this capability via a public API, either approach
> should work fine.

Agreed here too.



-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Re: [lucy-dev] Highlighter excerpt boundaries

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Thu, Jan 19, 2012 at 11:43:59AM +0100, Nick Wellnhofer wrote:
>> Not sure what I'm missing, but I don't understand the "coupling" concern.  It
>> seems to me as though it would be desirable code re-use to wrap our sentence
>> boundary detection mechanism within a battle-tested design like Analyzer,
>> rather than do something ad-hoc.
>
> The analyzers are designed so split a whole string into tokens. In the  
> highlighter we only need to find a single boundary near a certain  
> position in a string. So the analyzer interface isn't an ideal fit for  
> the highlighter. The performance hit of running a tokenizer over the  
> whole substring shouldn't be a problem but I'd still like to consider  
> alternatives.

It's rare that we need to optimize for performance.  Most of the time we
should be optimizing for maintainability.

I'm advocating using Analyzer because we have several of them, and because the
parallelism between StandardTokenizer and a StandardSentenceTokenizer based on
UAX #29 would lower the cost of maintaining them.

However, that's only one way to optimize for maintainability, and it's not
necessarily the best available stratagem.  It may be that low level code
leveraging an Analyzer is verbose... or not... we'd just have to try.

>> I'm actually very excited about getting all that sentence boundary detection
>> stuff out of Highlighter.c, which will become much easier to grok and maintain
>> as a result.  Separation of concerns FTW!
>
> We could also move the boundary detection to a string utility class.

I suspect that at some point we will want to expose sentence boundary
detection via a public API, because people who subclass Highlighter may want
to use it.  Father Chrysostomos did when he wrote KSx::Highlight::Summarizer.
(The old KinoSearch Highlighter exposed a find_sentences() method at one
point.  It was a victim of the C rewrite; Highlighter was one of the harder
modules to port.)

It seems to me that publishing UAX #29 sentence boundary detection via an
Analyzer is a conservative API extension, since it's so closely related to the
UAX #29 word boundary detection we expose via StandardTokenizer.

So that explains what I was thinking.  But of course refactoring sentence
boundary detection into a string utility function also achieves the end of
cleaning up Highlighter.c just as effectively, and might be more elegant --
who knows?

Until we actually expose this capability via a public API, either approach
should work fine.

>>> Of course, it would mean to implement a separate Unicode-capable word
>>> breaking algorithm for the highlighter. But this shouldn't be very hard as
>>> we could reuse parts of the StandardTokenizer.
>>
>> IMO, a word-breaking algo doesn't suffice for choosing excerpt boundaries.
>> It looks much better if you trim excerpts at sentence boundaries, and
>> word-break algos don't get you those.
>
> I would keep the sentence boundary detection, of course. I'm only  
> talking about the word breaking part.

Groovy, sounds like we're on the same page about that then. :)

Marvin Humphrey


Re: [lucy-dev] Highlighter excerpt boundaries

Posted by Nick Wellnhofer <we...@aevum.de>.
On 19/01/2012 03:28, Marvin Humphrey wrote:
> Phase 3 can be implemented several different ways.  It *could* reuse the
> original tokenization algo on its own, but that would produce sub-standard
> results because Lucy's tokenization algos are generally concerned with words
> rather than sentences, and excerpts chosen on word boundaries alone don't look
> very good.

You're right. I was only talking about Phase 3.

>> Such an approach wouldn't depend on the analyzer at all and it wouldn't
>> introduce additional coupling of Lucy's components.
>
> Not sure what I'm missing, but I don't understand the "coupling" concern.  It
> seems to me as though it would be desirable code re-use to wrap our sentence
> boundary detection mechanism within a battle-tested design like Analyzer,
> rather than do something ad-hoc.

The analyzers are designed so split a whole string into tokens. In the 
highlighter we only need to find a single boundary near a certain 
position in a string. So the analyzer interface isn't an ideal fit for 
the highlighter. The performance hit of running a tokenizer over the 
whole substring shouldn't be a problem but I'd still like to consider 
alternatives.

> I'm actually very excited about getting all that sentence boundary detection
> stuff out of Highlighter.c, which will become much easier to grok and maintain
> as a result.  Separation of concerns FTW!

We could also move the boundary detection to a string utility class.

>> Of course, it would mean to implement a separate Unicode-capable word
>> breaking algorithm for the highlighter. But this shouldn't be very hard as
>> we could reuse parts of the StandardTokenizer.
>
> IMO, a word-breaking algo doesn't suffice for choosing excerpt boundaries.
> It looks much better if you trim excerpts at sentence boundaries, and
> word-break algos don't get you those.

I would keep the sentence boundary detection, of course. I'm only 
talking about the word breaking part.

Nick