You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by David Smiley <da...@gmail.com> on 2017/05/26 13:57:02 UTC

Highlighting and delineating Passages (fragmenting)

I was recently asked if/how the UnifiedHighlighter can return a Passage
centered around the highlighted words.  I'm responding to a wider audience
(java-user list, ...).

Each highlighter implementation fragments the content into passages (with
highlights) using a different algorithm.

The UnifiedHighlighter (and now defunct PostingsHighlighter from which it
derives) fragment the content to create passages entirely based on a
java.text.BreakIterator.  BreakIterator only sees/knows about the content
(it's initialized with it via setText(string); it doesn't know where
highlighted words are.  This is why the default UH BreakIterator impl is a
sentence based one and most people probably will let it be.  Given how the
UH actually uses the BreakIterator, you can create a custom one that is
only designed to work with this highlighter that makes some assumptions of
how it's used, resulting in some fragmentation that isn't so rigidly based
on the content.  The LengthGoalBreakIterator is such a BreakIterator.  But
it can only "see" the first highlighted word of a passage and make
fragmentation decisions based on that alone.

The other two highlighters (the original Highlighter and I think the
FastVectorHighlighter) are more flexible in this regard; they have their
own abstraction that allows for Passages to be formed sensitive to where
exactly the highlighted words are.  Thus you could fairly easily achieve a
goal of say, 10 words before the first highlighted word, and highlight more
words within 10 words of each other until the next is too far away, then 10
more trailing words with the original Highlighter.  I suspect
FastVectorHighlighter can do it this but its API confuses me.  The
FastVectorHighlighter also uses a BreakIterator in
BreakIteratorBoundaryScanner but it's use is entirely different from how
the UnifiedHighlighter uses one.

Perhaps the UnifiedHighlighter should be enhanced to make more flexible
fragmentation algorithms possible.  Today you'd need to override
FieldHighlighter.highlightOffsetsEnums which is a lot to ask of anyone;
even doing that is annoying and then re-implemenitng that method is onerous
since it's so complex -- it's really the heart of the UH.  The UH could add
an entirely new abstraction apart from BreakIterators (with a BI based impl
available), or perhaps an optional marker interface for UH-aware
BreakIterators.  The former (a new abstraction) would be cleaner, and might
also remove a wart in the API due to the statefulness of BreakIterators.
It's also kinda hard to write a BI correctly. I've implemented a few
already and I know.  It's an old API.

~ David

-- 
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com

Re: Highlighting and delineating Passages (fragmenting)

Posted by Dawid Weiss <da...@gmail.com>.

https://issues.apache.org/jira/browse/SOLR-1105

Yes, this is spot-on what I need with regard to copyTo fields, thanks
for the link!

> Or are the overlaps coming from passage offset ranges from separate queries to the same content?

The overlaps are caused by the fact that we have multiple sources of
highlight data -- the query is one, our own scope/ features is another
(and they can overlap). So the highlighter we wrote pretty much
doesn't care about where the "highlights" come from or whether they
are contiguous, overlapping or nested -- it will figure out how to
properly reorganize them into a tree structure (there are scenarios
which require splitting a highlight into multiple chunks for example),
score them and return the best passages. We take only hit offset data
from UH (and it's a great helper here, given the complexity of the
task).

I may return to this later on, depending on how the project progresses
-- if so, I'd love to somehow help make the "default" highlighting
better (or easier to use).

Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Highlighting and delineating Passages (fragmenting)

Posted by David Smiley <da...@gmail.com>.

On Tue, May 30, 2017 at 9:25 AM Dawid Weiss <da...@gmail.com> wrote:

> > #2 & #3 is the same requirement; you elaborate on #2 with more detail in
> #3.
> > The UH can't currently do this; but with the OH (original Highlighter)
> you
> > can but it appears somewhat awkward.  See SimpleSpanFragmenter.  I had
> said
> > it was easy but I was mistaken; I'm getting rustier on the OH.
>
> Well, the requirement here is that we do want the context of a hit and
> "broaden" it to roughly X characters (total). I do see that OH does
> have something like this with regexp fragmenter (slop factor), but I
> hoped this should be somewhat easier. I just spent an hour or so
> trying to tune it, but without much success.
>

What you don't see in regex fragmenter but is critical and found in
SimpleSpanFragmenter is access to the queryScorer to access the
WeightedSpanTerm which yields the positions of the actual matched words.

>
> > With the original Highlighter, you get TextFragment instances which
> contain
> > the textStartPos & textEndPos.  You can use that info to conditionally
> add
> > ellipsis.
>
> Yup, I realize that. I just wanted something that'd do it out of the
> box in Solr because I didn't want to add custom code to the
> distribution/core. Sigh.
>

I think it'd be a nice option for the UH's DefaultPassageFormatter to add
ellipsis at the boundaries.  You could file a patch or just a feature
request in JIRA.

> > With the original Highlighter, you can easily do this by providing the
> > stored text.  When you create the QueryScorer, use "null" for field name
> to
> > highlight all query fields.  The UH can do this as well by highlighting
> the
> > fields that are stored, and call setFieldMatcher to provide a Predicate
> that
> > always return true.
>
> Wouldn't this be equivalent to requireFieldMatch=false?

Yes.

> It's not
> exactly what I had in mind -- I don't want to highlight across all
> fields, I want to highlight those that actually contributed to the
> document being selected. Imagine the following:
>
> { a: "foo bar",
>   b: "foo baz",
>   c: "foo bat" }
>
> Let's say "a" and "b" are copied to the sink field (default search
> field), but "c" is not. The highlighter is asked to highlight all
> fields. For a query: "foo" it should return a highlight on "a" and
> "b", but not "c". On the other hand, a query "c:foo" should only
> highlight "c". In other words -- the user should clearly see which
> fields actually contributed to the document being part of the search
> result. requireFieldMatch=false is a really crude cannon to solve
> this.
>

Sure I understand.  With the OH this may not be possible.  With the UH, you
could have a more selective predicate.  On the Solr side you'd need to
devise a hook to make this configurable.  See
https://issues.apache.org/jira/browse/LUCENE-7768 & SOLR-1105 for a rather
different approach.

>
> > Yeah I already explained why your snippet-centering requirement simply
> can't
> > be met with the UH.
>
> Thanks, I thought so. We actually have a custom highlighter (unrelated
> to Solr) in our commercial product that works on a slightly different
> basis than what can be found in Lucene (I think). The pipeline there
> is as follows:
>
> 1) determine "highlight" offset ranges (from, to, type). Highlight
> "types" can be different so that, for example, one can highlight two
> queries at once (and they can overlap in all kinds of ways).
> 2) process highlight ranges so that they're hierarchically nested
> (split non-tree-like overlaps into hierarchical descents). This
> permits emitting easier html markup later on.
> 3) expand each highlight range to fit certain criteria (typically the
> desired length of the snippet), this expansion here uses a break
> iterator (on words) and respects certain hard limits (like value
> boundaries for multivalue fields);
> 4) score each such expanded range; the scoring formula checks if there
> are any other highlights that fall within the same window; if so, they
> receive a higher score. This results in multi-term matches typically
> ending up at the top of the scoring list.
> 5) emit the best scoring ranges, marking highlights properly.
>
> We actually use UnifiedHighlighter for the first step above, the rest
> is custom. It can be used to pretty much highlight anything since the
> inputs are the text itself and the ranges to highlight (offsets +
> type). Note it doesn't solve the problem of the default field
> highlighting -- this is something that'd have to be addressed
> separately, but it's been working for us fairly well in practice.
>
> I'd be glad to contribute this code back to Lucene, but it's kind of
> detached from the infrastructure and it'd require some work to
> integrate. :(
>

Interesting.  Your strategy is based on the notion of highlight offset
ranges that might overlap.  Are the offset ranges for span query ranges
(including simple phrases)?  Presently the UH's PhraseHelper is designed in
such a way that it filters individual terms's positions, and thus the
resulting OffsetsEnum instances yield offset windows that are only for each
term's offsets instead of being for entire query spans.  Tim and I
discussed the idea of future improvements to redo this so that it'd be span
based, which is kinda interrelated with another TODO on more accurate
phrase highlighting since there are some rare but possible holes in the
current approach: https://issues.apache.org/jira/browse/LUCENE-5455

Or are the overlaps coming from passage offset ranges from separate queries
to the same content?  That I could understand better based on everything
you said.  I'm not sure how your code could be contributed in a way that
fits in with everything else; it seems like a very specialized case.

Since you are already using the UH, and it's not impossible to do the
centered passage thing, you could go that route.  You need to return your
own FieldHighlighter impls so that you can override highlightOffsetsEnums.
At the Solr layer you can subclass SolrExtendedUnifiedHighlighter.  It's
deliberate that these things are extensible; I've seen users need to do
this and it's why we have a TestUnifiedHighlighterExtensibility test to
help us keep this extensible.

~ David
-- 
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com

Re: Highlighting and delineating Passages (fragmenting)

Posted by Dawid Weiss <da...@gmail.com>.

> #2 & #3 is the same requirement; you elaborate on #2 with more detail in #3.
> The UH can't currently do this; but with the OH (original Highlighter) you
> can but it appears somewhat awkward.  See SimpleSpanFragmenter.  I had said
> it was easy but I was mistaken; I'm getting rustier on the OH.

Well, the requirement here is that we do want the context of a hit and
"broaden" it to roughly X characters (total). I do see that OH does
have something like this with regexp fragmenter (slop factor), but I
hoped this should be somewhat easier. I just spent an hour or so
trying to tune it, but without much success.

> With the original Highlighter, you get TextFragment instances which contain
> the textStartPos & textEndPos.  You can use that info to conditionally add
> ellipsis.

Yup, I realize that. I just wanted something that'd do it out of the
box in Solr because I didn't want to add custom code to the
distribution/core. Sigh.

> With the original Highlighter, you can easily do this by providing the
> stored text.  When you create the QueryScorer, use "null" for field name to
> highlight all query fields.  The UH can do this as well by highlighting the
> fields that are stored, and call setFieldMatcher to provide a Predicate that
> always return true.

Wouldn't this be equivalent to requireFieldMatch=false? It's not
exactly what I had in mind -- I don't want to highlight across all
fields, I want to highlight those that actually contributed to the
document being selected. Imagine the following:

{ a: "foo bar",
  b: "foo baz",
  c: "foo bat" }

Let's say "a" and "b" are copied to the sink field (default search
field), but "c" is not. The highlighter is asked to highlight all
fields. For a query: "foo" it should return a highlight on "a" and
"b", but not "c". On the other hand, a query "c:foo" should only
highlight "c". In other words -- the user should clearly see which
fields actually contributed to the document being part of the search
result. requireFieldMatch=false is a really crude cannon to solve
this.

> Yeah I already explained why your snippet-centering requirement simply can't
> be met with the UH.

Thanks, I thought so. We actually have a custom highlighter (unrelated
to Solr) in our commercial product that works on a slightly different
basis than what can be found in Lucene (I think). The pipeline there
is as follows:

1) determine "highlight" offset ranges (from, to, type). Highlight
"types" can be different so that, for example, one can highlight two
queries at once (and they can overlap in all kinds of ways).
2) process highlight ranges so that they're hierarchically nested
(split non-tree-like overlaps into hierarchical descents). This
permits emitting easier html markup later on.
3) expand each highlight range to fit certain criteria (typically the
desired length of the snippet), this expansion here uses a break
iterator (on words) and respects certain hard limits (like value
boundaries for multivalue fields);
4) score each such expanded range; the scoring formula checks if there
are any other highlights that fall within the same window; if so, they
receive a higher score. This results in multi-term matches typically
ending up at the top of the scoring list.
5) emit the best scoring ranges, marking highlights properly.

We actually use UnifiedHighlighter for the first step above, the rest
is custom. It can be used to pretty much highlight anything since the
inputs are the text itself and the ranges to highlight (offsets +
type). Note it doesn't solve the problem of the default field
highlighting -- this is something that'd have to be addressed
separately, but it's been working for us fairly well in practice.

I'd be glad to contribute this code back to Lucene, but it's kind of
detached from the infrastructure and it'd require some work to
integrate. :(

Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Highlighting and delineating Passages (fragmenting)

Posted by David Smiley <da...@gmail.com>.

Looks like you should use the original Highlighter until requirement #2,3
can be done with the UnifiedHighlighter.  Other than #2,3, the UH can
handle all these requirements, and the OH can do all.

On Sat, May 27, 2017 at 6:08 AM Dawid Weiss <da...@gmail.com> wrote:

> Thanks for your explanation, David.
>
> I actually found working with all Lucene highlighters pretty
> difficult. I have a few requirements which seemed deceptively simple:
>
> 1) highlight query hit regions (phrase, fuzzy, terms);
>

They all do this (not considering the now removed PostingsHighlighter).

> 2) try to organise the resulting snippets to visually "center" the hit
> regions so that the context of the hit is visible,
> 3) keep the snippet limited to ~x characters (this means breaking on
> word boundaries, typically, but keeping the overall length of the
> snippet close to x).
>

#2 & #3 is the same requirement; you elaborate on #2 with more detail in #3.
The UH can't currently do this; but with the OH (original Highlighter) you
can but it appears somewhat awkward.  See SimpleSpanFragmenter.  I had said
it was easy but I was mistaken; I'm getting rustier on the OH.

> 4) add visual cues whether the snippet is part of a larger text
> (ellipsis). This should be done intelligently -- if a snippet is
> actually the whole field or start/ends on the field boundary no
> ellipsis should be added.
>

With the original Highlighter, you get TextFragment instances which contain
the textStartPos & textEndPos.  You can use that info to conditionally add
ellipsis.

> 5) For performance reasons we typically have a single copy-to field
> that is used as the default field for the query parser. But for the
> user interface needs we'd have to go back and try to highlight the
> original fields that formed this content. This is probably the most
> difficult and I didn't expect it to be solved with existing
> highlighters, but it'd be a great thing to have eventually.
>

With the original Highlighter, you can easily do this by providing the
stored text.  When you create the QueryScorer, use "null" for field name to
highlight all query fields.  The UH can do this as well by highlighting the
fields that are stored, and call setFieldMatcher to provide a Predicate
that always return true.

> Some of the above are possible with existing highlighters, some are
> not. Having a limited snippet length and keeping word bounary breaks
> turned to be most confusing to me with unified highlighter, for
> example. I can't use the sentence break iterator because the text in
> question occasionally has super-long word sequences that result in
> snippets that are enormous.
>

Yeah I already explained why your snippet-centering requirement simply
can't be met with the UH.  Perhaps it might help if the UH documented it's
overall algorithm a bit more, even if it all the documentation in the world
won't enable it to meet your requirement. At least it would help you know
sooner if it can or not :-)

~ David
-- 
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com

Re: Highlighting and delineating Passages (fragmenting)

Posted by Evert Wagenaar <ev...@gmail.com>.

I always assumed this was the default behaviour of the
Lucene TermHighlighter but I could be mistaken with an older version.
I found out that there are major differences between Lucene and Solr
though, with which I have similar problems.

Best regards,

Evert Wagenaar

http://www.evertwagenaar.com/

Op za 27 mei 2017 om 12:08 schreef Dawid Weiss <da...@gmail.com>

> Thanks for your explanation, David.
>
> I actually found working with all Lucene highlighters pretty
> difficult. I have a few requirements which seemed deceptively simple:
>
> 1) highlight query hit regions (phrase, fuzzy, terms);
> 2) try to organise the resulting snippets to visually "center" the hit
> regions so that the context of the hit is visible,
> 3) keep the snippet limited to ~x characters (this means breaking on
> word boundaries, typically, but keeping the overall length of the
> snippet close to x).
> 4) add visual cues whether the snippet is part of a larger text
> (ellipsis). This should be done intelligently -- if a snippet is
> actually the whole field or start/ends on the field boundary no
> ellipsis should be added.
> 5) For performance reasons we typically have a single copy-to field
> that is used as the default field for the query parser. But for the
> user interface needs we'd have to go back and try to highlight the
> original fields that formed this content. This is probably the most
> difficult and I didn't expect it to be solved with existing
> highlighters, but it'd be a great thing to have eventually.
>
> Some of the above are possible with existing highlighters, some are
> not. Having a limited snippet length and keeping word bounary breaks
> turned to be most confusing to me with unified highlighter, for
> example. I can't use the sentence break iterator because the text in
> question occasionally has super-long word sequences that result in
> snippets that are enormous.
>
> I'll keep thinking.
>
> Dawid
>
> On Fri, May 26, 2017 at 3:57 PM, David Smiley <da...@gmail.com>
> wrote:
> > I was recently asked if/how the UnifiedHighlighter can return a Passage
> > centered around the highlighted words.  I'm responding to a wider
> audience
> > (java-user list, ...).
> >
> > Each highlighter implementation fragments the content into passages (with
> > highlights) using a different algorithm.
> >
> > The UnifiedHighlighter (and now defunct PostingsHighlighter from which it
> > derives) fragment the content to create passages entirely based on a
> > java.text.BreakIterator.  BreakIterator only sees/knows about the content
> > (it's initialized with it via setText(string); it doesn't know where
> > highlighted words are.  This is why the default UH BreakIterator impl is
> a
> > sentence based one and most people probably will let it be.  Given how
> the
> > UH actually uses the BreakIterator, you can create a custom one that is
> only
> > designed to work with this highlighter that makes some assumptions of how
> > it's used, resulting in some fragmentation that isn't so rigidly based on
> > the content.  The LengthGoalBreakIterator is such a BreakIterator.  But
> it
> > can only "see" the first highlighted word of a passage and make
> > fragmentation decisions based on that alone.
> >
> > The other two highlighters (the original Highlighter and I think the
> > FastVectorHighlighter) are more flexible in this regard; they have their
> own
> > abstraction that allows for Passages to be formed sensitive to where
> exactly
> > the highlighted words are.  Thus you could fairly easily achieve a goal
> of
> > say, 10 words before the first highlighted word, and highlight more words
> > within 10 words of each other until the next is too far away, then 10
> more
> > trailing words with the original Highlighter.  I suspect
> > FastVectorHighlighter can do it this but its API confuses me.  The
> > FastVectorHighlighter also uses a BreakIterator in
> > BreakIteratorBoundaryScanner but it's use is entirely different from how
> the
> > UnifiedHighlighter uses one.
> >
> > Perhaps the UnifiedHighlighter should be enhanced to make more flexible
> > fragmentation algorithms possible.  Today you'd need to override
> > FieldHighlighter.highlightOffsetsEnums which is a lot to ask of anyone;
> even
> > doing that is annoying and then re-implemenitng that method is onerous
> since
> > it's so complex -- it's really the heart of the UH.  The UH could add an
> > entirely new abstraction apart from BreakIterators (with a BI based impl
> > available), or perhaps an optional marker interface for UH-aware
> > BreakIterators.  The former (a new abstraction) would be cleaner, and
> might
> > also remove a wart in the API due to the statefulness of BreakIterators.
> > It's also kinda hard to write a BI correctly. I've implemented a few
> already
> > and I know.  It's an old API.
> >
> > ~ David
> >
> > --
> > Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
> > LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
> > http://www.solrenterprisesearchserver.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> --
Sent from Gmail IPad

Re: Highlighting and delineating Passages (fragmenting)

Posted by Dawid Weiss <da...@gmail.com>.

Thanks for your explanation, David.

I actually found working with all Lucene highlighters pretty
difficult. I have a few requirements which seemed deceptively simple:

1) highlight query hit regions (phrase, fuzzy, terms);
2) try to organise the resulting snippets to visually "center" the hit
regions so that the context of the hit is visible,
3) keep the snippet limited to ~x characters (this means breaking on
word boundaries, typically, but keeping the overall length of the
snippet close to x).
4) add visual cues whether the snippet is part of a larger text
(ellipsis). This should be done intelligently -- if a snippet is
actually the whole field or start/ends on the field boundary no
ellipsis should be added.
5) For performance reasons we typically have a single copy-to field
that is used as the default field for the query parser. But for the
user interface needs we'd have to go back and try to highlight the
original fields that formed this content. This is probably the most
difficult and I didn't expect it to be solved with existing
highlighters, but it'd be a great thing to have eventually.

Some of the above are possible with existing highlighters, some are
not. Having a limited snippet length and keeping word bounary breaks
turned to be most confusing to me with unified highlighter, for
example. I can't use the sentence break iterator because the text in
question occasionally has super-long word sequences that result in
snippets that are enormous.

I'll keep thinking.

Dawid

On Fri, May 26, 2017 at 3:57 PM, David Smiley <da...@gmail.com> wrote:
> I was recently asked if/how the UnifiedHighlighter can return a Passage
> centered around the highlighted words.  I'm responding to a wider audience
> (java-user list, ...).
>
> Each highlighter implementation fragments the content into passages (with
> highlights) using a different algorithm.
>
> The UnifiedHighlighter (and now defunct PostingsHighlighter from which it
> derives) fragment the content to create passages entirely based on a
> java.text.BreakIterator.  BreakIterator only sees/knows about the content
> (it's initialized with it via setText(string); it doesn't know where
> highlighted words are.  This is why the default UH BreakIterator impl is a
> sentence based one and most people probably will let it be.  Given how the
> UH actually uses the BreakIterator, you can create a custom one that is only
> designed to work with this highlighter that makes some assumptions of how
> it's used, resulting in some fragmentation that isn't so rigidly based on
> the content.  The LengthGoalBreakIterator is such a BreakIterator.  But it
> can only "see" the first highlighted word of a passage and make
> fragmentation decisions based on that alone.
>
> The other two highlighters (the original Highlighter and I think the
> FastVectorHighlighter) are more flexible in this regard; they have their own
> abstraction that allows for Passages to be formed sensitive to where exactly
> the highlighted words are.  Thus you could fairly easily achieve a goal of
> say, 10 words before the first highlighted word, and highlight more words
> within 10 words of each other until the next is too far away, then 10 more
> trailing words with the original Highlighter.  I suspect
> FastVectorHighlighter can do it this but its API confuses me.  The
> FastVectorHighlighter also uses a BreakIterator in
> BreakIteratorBoundaryScanner but it's use is entirely different from how the
> UnifiedHighlighter uses one.
>
> Perhaps the UnifiedHighlighter should be enhanced to make more flexible
> fragmentation algorithms possible.  Today you'd need to override
> FieldHighlighter.highlightOffsetsEnums which is a lot to ask of anyone; even
> doing that is annoying and then re-implemenitng that method is onerous since
> it's so complex -- it's really the heart of the UH.  The UH could add an
> entirely new abstraction apart from BreakIterators (with a BI based impl
> available), or perhaps an optional marker interface for UH-aware
> BreakIterators.  The former (a new abstraction) would be cleaner, and might
> also remove a wart in the API due to the statefulness of BreakIterators.
> It's also kinda hard to write a BI correctly. I've implemented a few already
> and I know.  It's an old API.
>
> ~ David
>
> --
> Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
> LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
> http://www.solrenterprisesearchserver.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org