You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Trejkaz <tr...@trypticon.org> on 2014/02/04 07:20:14 UTC

Highlighting text, do I seriously have to reimplement this from scratch?

Hi all.

I'm trying to find a precise and reasonably efficient way to highlight
all occurrences of terms in the query, only highlighting fields which
match the corresponding fields used in the query. This seems like it
would be a fairly common requirement in applications. We have an
existing implementation, but it works by re-reading the entire text
back through the analyser. This is slow for large text, and sometimes
we analyse the same text twice - and both variants could well be in
the query. So I'm looking for a shortcut.

Perhaps due to the name, Lucene's highlighter module got my attention,
so I tried using that. The prototype I wrote *did* produce acceptable
results for the highlighting itself, but when it came time to think
about integrating into the real application, there didn't seem to be a
single part of the highlighter API designed to allow for that.

So I guess I will be forced to categorise lucene-highlighter as a
"toy", or perhaps as a fairly complete example of how to do
highlighting, and it might be useful for that at least.

What's wrong with the API?


Issue #1 - The API forces me to pass in a String.

Just because the highlighter wants some character data, I have to pass
String. Text can be very large and I would rather not have to wait for
the entire text to read into memory before I can pass it off to the
highlighter.

String is a final class, so any API which requires it for feeding in
something like character data is committing a massive sin, in my
opinion. If your text is in a database, you will have to retrieve
*all* of the text before you can use *any* of it for highlighting.

Had the API accepted something like Reader, CharBuffer or even
CharSequence, there would be no problem. We could make an alternative
implementation which reads directly from whatever storage it's in.

I notice that PostingsHighlighter has improved on this, by removing
the need for the text entirely. That's awesome, actually. We can't use
it. We're stuck on version 3.6.2 as we are expected to be able to open
indexes created in 2.x. Plus, all our existing indexes lack the
required level of indexing to use it, and reindexing is not yet an
option. (Even if we get lucky enough to update to Lucene 4, I will
probably have to write a codec to read Lucene 2 indexes...)


Issue #2 - The API returns all results as String.

To actually integrate a highlighter, the absolute offsets are the bare
minimum requirement to highlight the text:

    http://docs.oracle.com/javase/7/docs/api/javax/swing/text/Highlighter.html

But the highlighter API only returns results as String.

Even if there were enough information in the string (and I don't think
there is!), getting the results back as String is what I call the
"pseudo-API anti-pattern." I shouldn't have to parse values out of a
string which the API I'm calling just formatted into it. In this
particular instance, it would have been nice to have a way to
programmatically get the offset of the highlights in each fragment.


As for our own requirements, the bit about computing the fragments is
completely unnecessary. We have a piece of view-time logic which
figures out the fragments based on where the highlights are
vertically. This works better than using text proximity, because using
text proximity causes the number of highlighted lines to visibly
shuffle, whereas showing consistently the same number of lines above
and below produces an effect similar to resizing a text editor window,
which people should already be used to.

For getting the highlights themselves, is there any faster way than
reading the whole text every time you want to run it?

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: Highlighting text, do I seriously have to reimplement this from scratch?

Posted by "Allison, Timothy B." <ta...@mitre.org>.
This will be of no immediate help, but in the next iteration of LUCENE-5317, which I'll post in a few weeks (if I can find the time), I'll have an option to pull concordance windows from character offsets which can be stored at index time (so you wouldn't have to re-analyze).  The current version of the non-committed patch relies on re-analysis. 

The basic strategy in LUCENE-5317 is to convert every query to a SpanQuery and then run getSpans on an index. 

This won't meet your needs for back-compat, and it also suffers from the "relying on a string" sin you mention.

You mention the first point, but may also be interested in the second... 1) depending on the highlighter and the settings, make sure that you are able to highlight variants (fuzzy, wildcard, etc) if you want to, and 2) be sure that you are able to highlight phrases (as opposed to terms that are phrasal pieces that aren't actually in a phrase).  It was a surprise to me that both weren't default and handled by all highlighters when I first came to Lucene, but they make complete sense to me now.

On your point about text being very large...is there a way to break your text into smaller documents and still meet your users' expectations (breaking books into chapters etc.).  In the highlighting/concordance realm, I've found that Lucene is still totally fast enough for my needs on large texts, but that it is far faster on lots of small docs vs fewer large docs.

Best

    Tim

-----Original Message-----
From: Trejkaz [mailto:trejkaz@trypticon.org] 
Sent: Tuesday, February 04, 2014 1:20 AM
To: Lucene Users Mailing List
Subject: Highlighting text, do I seriously have to reimplement this from scratch?

Hi all.

I'm trying to find a precise and reasonably efficient way to highlight
all occurrences of terms in the query, only highlighting fields which
match the corresponding fields used in the query. This seems like it
would be a fairly common requirement in applications. We have an
existing implementation, but it works by re-reading the entire text
back through the analyser. This is slow for large text, and sometimes
we analyse the same text twice - and both variants could well be in
the query. So I'm looking for a shortcut.

Perhaps due to the name, Lucene's highlighter module got my attention,
so I tried using that. The prototype I wrote *did* produce acceptable
results for the highlighting itself, but when it came time to think
about integrating into the real application, there didn't seem to be a
single part of the highlighter API designed to allow for that.

So I guess I will be forced to categorise lucene-highlighter as a
"toy", or perhaps as a fairly complete example of how to do
highlighting, and it might be useful for that at least.

What's wrong with the API?


Issue #1 - The API forces me to pass in a String.

Just because the highlighter wants some character data, I have to pass
String. Text can be very large and I would rather not have to wait for
the entire text to read into memory before I can pass it off to the
highlighter.

String is a final class, so any API which requires it for feeding in
something like character data is committing a massive sin, in my
opinion. If your text is in a database, you will have to retrieve
*all* of the text before you can use *any* of it for highlighting.

Had the API accepted something like Reader, CharBuffer or even
CharSequence, there would be no problem. We could make an alternative
implementation which reads directly from whatever storage it's in.

I notice that PostingsHighlighter has improved on this, by removing
the need for the text entirely. That's awesome, actually. We can't use
it. We're stuck on version 3.6.2 as we are expected to be able to open
indexes created in 2.x. Plus, all our existing indexes lack the
required level of indexing to use it, and reindexing is not yet an
option. (Even if we get lucky enough to update to Lucene 4, I will
probably have to write a codec to read Lucene 2 indexes...)


Issue #2 - The API returns all results as String.

To actually integrate a highlighter, the absolute offsets are the bare
minimum requirement to highlight the text:

    http://docs.oracle.com/javase/7/docs/api/javax/swing/text/Highlighter.html

But the highlighter API only returns results as String.

Even if there were enough information in the string (and I don't think
there is!), getting the results back as String is what I call the
"pseudo-API anti-pattern." I shouldn't have to parse values out of a
string which the API I'm calling just formatted into it. In this
particular instance, it would have been nice to have a way to
programmatically get the offset of the highlights in each fragment.


As for our own requirements, the bit about computing the fragments is
completely unnecessary. We have a piece of view-time logic which
figures out the fragments based on where the highlights are
vertically. This works better than using text proximity, because using
text proximity causes the number of highlighted lines to visibly
shuffle, whereas showing consistently the same number of lines above
and below produces an effect similar to resizing a text editor window,
which people should already be used to.

For getting the highlights themselves, is there any faster way than
reading the whole text every time you want to run it?

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Highlighting text, do I seriously have to reimplement this from scratch?

Posted by Michael Sokolov <ms...@safaribooksonline.com>.
On 2/6/2014 12:53 AM, Earl Hood wrote:
> On Tue, Feb 4, 2014 at 6:05 PM, Michael Sokolov wrote:
>
>> Thanks for the feedback.  I think it's difficult to know what to do about
>> attribute value highlighting in the general case - do you have any
>> suggestions?
> That is a challenging one since one has to know how attribute data will
> be transformed for rendering purposes.
>
> I do not know the workings of Lux, so I cannot provide any specific
> suggestions on what Lux can do.  I would need time to dive into it.
>
> However, one solution is to workaround the limitation by preprocessing
> the data in a form that is friendly to Lux (or at least the highligher).
> For example, if I have attribute data I know will be transformed into
> renderable content, I would transform it into element-style content,
> which should be more friendly for indexing and highlighting purposes.
>
Lux's XmlHighlighter wraps matching text in an XML element tag.  The 
name of the tag is configurable.  But it won't work for attribute values 
since XML doesn't allow "<" in an attribute value.  I think Olivier's 
suggestion of providing a callback is interesting; that way we can 
provide the user much greater control, and the "highlighter" can 
actually become more of a query-driven document-processing engine: you 
could imagine fairly complex document transformations driven by Lucene 
query matching.

I created http://issues.luxdb.org/browse/LUX-73 to track that.  If 
anybody is interested in continuing this discussion, I'd suggest picking 
it up over on Lux's mailing list at luxdb@luxdb.org since this seems a 
little off topic here.

-Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Highlighting text, do I seriously have to reimplement this from scratch?

Posted by Earl Hood <ea...@earlhood.com>.
On Tue, Feb 4, 2014 at 6:05 PM, Michael Sokolov wrote:

> Thanks for the feedback.  I think it's difficult to know what to do about
> attribute value highlighting in the general case - do you have any
> suggestions?

That is a challenging one since one has to know how attribute data will
be transformed for rendering purposes.

I do not know the workings of Lux, so I cannot provide any specific
suggestions on what Lux can do.  I would need time to dive into it.

However, one solution is to workaround the limitation by preprocessing
the data in a form that is friendly to Lux (or at least the highligher).
For example, if I have attribute data I know will be transformed into
renderable content, I would transform it into element-style content,
which should be more friendly for indexing and highlighting purposes.

--ewh

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Highlighting text, do I seriously have to reimplement this from scratch?

Posted by Olivier Binda <ol...@wanadoo.fr>.
On 02/05/2014 01:05 AM, Michael Sokolov wrote:
> On 2/4/2014 2:50 PM, Earl Hood wrote:
>> On Tue, Feb 4, 2014 at 1:16 PM, Michael Sokolov wrote:
>>
>>> You might be interested in looking at Lux, which layers XML services 
>>> like
>>> XQuery on top of Lucene and Solr, and includes an XML-aware 
>>> highlighter:
>>> https://github.com/msokolov/lux/blob/master/src/main/java/lux/search/highlight/XmlHighlighter.java 
>>>
>> I am aware of Lux, but moving to use it would be a major redesign effort
>> for the project I am on, something that likely would not get management
>> approval.
>>
>> BTW, just within the scope of the class you cite, doing a quick look at
>> it, it looks like I may have to modify highlighting code behavior to
>> support how the project I am transforms the XML data.  Example: we deal
>> with attribute data that gets transformed to render content in the HTML
>> served to the client, and the highlighting code cited does not appear to
>> handle XML attributes.
>>
>> There are other technical challenges also due to the nature of the
>> project.  There may be ways deal with the challenges, but any further
>> analysis is not worth it if there is never any approval for me to pursue
>> a redesign for the project.
>>
> Thanks for the feedback.  I think it's difficult to know what to do 
> about attribute value highlighting in the general case - do you have 
> any suggestions?
>

Would it be possible/interesting to have an interface that let's the 
caller decide what to do with the attribute  ?

This has nothing to do with xml higlighting but , on Android, I had to 
hack a bit (nothing comparable to what is suggested here) the 
highlighter class to enable it to directly produce an Android Spanned  
String (basically a String with spans attached to it) instead of 
producing a String with html content (that must be parsed into an 
Android Spanned String) ... The Formatter Class is nice but maybee it 
could be improved a bit to be more flexible.


Olivier

> -Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Highlighting text, do I seriously have to reimplement this from scratch?

Posted by Michael Sokolov <ms...@safaribooksonline.com>.
On 2/4/2014 2:50 PM, Earl Hood wrote:
> On Tue, Feb 4, 2014 at 1:16 PM, Michael Sokolov wrote:
>
>> You might be interested in looking at Lux, which layers XML services like
>> XQuery on top of Lucene and Solr, and includes an XML-aware highlighter:
>> https://github.com/msokolov/lux/blob/master/src/main/java/lux/search/highlight/XmlHighlighter.java
> I am aware of Lux, but moving to use it would be a major redesign effort
> for the project I am on, something that likely would not get management
> approval.
>
> BTW, just within the scope of the class you cite, doing a quick look at
> it, it looks like I may have to modify highlighting code behavior to
> support how the project I am transforms the XML data.  Example: we deal
> with attribute data that gets transformed to render content in the HTML
> served to the client, and the highlighting code cited does not appear to
> handle XML attributes.
>
> There are other technical challenges also due to the nature of the
> project.  There may be ways deal with the challenges, but any further
> analysis is not worth it if there is never any approval for me to pursue
> a redesign for the project.
>
Thanks for the feedback.  I think it's difficult to know what to do 
about attribute value highlighting in the general case - do you have any 
suggestions?

-Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Highlighting text, do I seriously have to reimplement this from scratch?

Posted by Earl Hood <ea...@earlhood.com>.
On Tue, Feb 4, 2014 at 1:16 PM, Michael Sokolov wrote:

> You might be interested in looking at Lux, which layers XML services like
> XQuery on top of Lucene and Solr, and includes an XML-aware highlighter:
> https://github.com/msokolov/lux/blob/master/src/main/java/lux/search/highlight/XmlHighlighter.java

I am aware of Lux, but moving to use it would be a major redesign effort
for the project I am on, something that likely would not get management
approval.

BTW, just within the scope of the class you cite, doing a quick look at
it, it looks like I may have to modify highlighting code behavior to
support how the project I am transforms the XML data.  Example: we deal
with attribute data that gets transformed to render content in the HTML
served to the client, and the highlighting code cited does not appear to
handle XML attributes.

There are other technical challenges also due to the nature of the
project.  There may be ways deal with the challenges, but any further
analysis is not worth it if there is never any approval for me to pursue
a redesign for the project.

--ewh

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Highlighting text, do I seriously have to reimplement this from scratch?

Posted by Michael Sokolov <ms...@safaribooksonline.com>.
On 2/4/14 12:16 PM, Earl Hood wrote:
> On Tue, Feb 4, 2014 at 12:20 AM, Trejkaz wrote:
>
>> I'm trying to find a precise and reasonably efficient way to highlight
>> all occurrences of terms in the query, only highlighting fields which
>> ...
>    [snip]
>
> I am in a similiar situation with a web-based application, plus the
> content (XML) is dynamically transformed for purposes of rendering,
> making using the highlighter features of Lucene problematic from an
> intergration perspective.
You might be interested in looking at Lux, which layers XML services 
like XQuery on top of Lucene and Solr, and includes an XML-aware 
highlighter: 
https://github.com/msokolov/lux/blob/master/src/main/java/lux/search/highlight/XmlHighlighter.java

-Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Highlighting text, do I seriously have to reimplement this from scratch?

Posted by Trejkaz <tr...@trypticon.org>.
On Wed, Feb 5, 2014 at 4:16 AM, Earl Hood <ea...@earlhood.com> wrote:
> Our current solution is to do highlighting on the client-side.  When
> search happens, the search results from the server includes the parsed
> query terms so the client has an idea of which terms to highlight vs
> trying to reimplement a complete query string parser in the client.
>
> A problem is that Lucene (we are still on v3.0.3) does not provide a
> robust mechanism for extracting the terms of a query.  The following is
> the utility method that the server uses to get the terms needed to
> support client-side highlighting:
>
>   public static Set<Term> extractTermsFromQuery(
>       Query q,
>       IndexReader r,
>       Set<Term> terms
>   )
[ ... ]

This is very similar to what we're doing now, actually.

It does avoid the mess with having to double-parse the query, but the
catch is we still have to double-parse the text (and the text is
nearly always larger.)

All the special cases for fuzzy queries, regex queries, phrase queries
and the like, having to dig inside queries to pull out filters,
sometimes having to dig inside filters to pull out queries (had to
modify the Lucene API here and there to make more of it public, as I
recall!)

I just thought it would be nice to be able to find all the matches,
pull just those bits of the text somehow and display them without
reading the rest of the text. At least, without reading the rest of
the text all the time. I think I would have to store something about
where the lines wrap in the database in order to really avoid reading
all the text. :/

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Highlighting text, do I seriously have to reimplement this from scratch?

Posted by Earl Hood <ea...@earlhood.com>.
On Tue, Feb 4, 2014 at 12:20 AM, Trejkaz wrote:

> I'm trying to find a precise and reasonably efficient way to highlight
> all occurrences of terms in the query, only highlighting fields which
> match the corresponding fields used in the query. This seems like it
> would be a fairly common requirement in applications. We have an
> existing implementation, but it works by re-reading the entire text
> back through the analyser. This is slow for large text, and sometimes
> we analyse the same text twice - and both variants could well be in
> the query. So I'm looking for a shortcut.
  [snip]

I am in a similiar situation with a web-based application, plus the
content (XML) is dynamically transformed for purposes of rendering,
making using the highlighter features of Lucene problematic from an
intergration perspective.

Our current solution is to do highlighting on the client-side.  When
search happens, the search results from the server includes the parsed
query terms so the client has an idea of which terms to highlight vs
trying to reimplement a complete query string parser in the client.

A problem is that Lucene (we are still on v3.0.3) does not provide a
robust mechanism for extracting the terms of a query.  The following is
the utility method that the server uses to get the terms needed to
support client-side highlighting:

  /**
   * Extract out terms from query.
   * <p><b>IMPLEMENTATION NOTE:</b> Lucene does not provide a robust,
   * single method from extracting the low terms of a query.
   * Experimentation has shown that some Query types
   * {@link Query#extractTerms(Set)} methods do not work, or do
   * not work as desired.  Therefore, this method checks for specific
   * Query types to extract terms.
   * </p>
   * @param   q       Query to extract terms of.
   * @param   r       {@link IndexReader} the executed the query.
   * @param   terms   {@link Term} {@link Set set} to fill; if
   *                  <tt>null</tt>, a newly allocated set will be
   *                  returned.
   * @return  Set of terms.
   */
  public static Set<Term> extractTermsFromQuery(
      Query q,
      IndexReader r,
      Set<Term> terms
  ) {
    if (terms == null) terms = new HashSet<Term>();
    if (q instanceof TermQuery) {
      terms.add(((TermQuery)q).getTerm());

    } else if (q instanceof WildcardQuery) {
      terms.add(((WildcardQuery)q).getTerm());

    } else if (q instanceof PhraseQuery) {
      PhraseQuery pq = (PhraseQuery)q;
      String s = pq.toString(null);
      int i = s.indexOf('"');
      if (i == 0) {
        terms.add(new Term(FIELD_CONTENT,s));
      } else {
        terms.add(new Term(s.substring(0,i-1),s.substring(i)));
      }

    } else if (q instanceof MultiPhraseQuery) {
      ((MultiPhraseQuery)q).extractTerms(terms);

    } else if (q instanceof PrefixQuery) {
      Term t = ((PrefixQuery)q).getPrefix();
      terms.add(new Term(t.field(), t.text()+"*"));

    } else if (q instanceof FuzzyQuery) {
      FuzzyQuery fq = (FuzzyQuery)q;
      try {
        q = fq.rewrite(r);
      } catch (Exception e) {
        log.warn("Error rewriting fuzzy query ["+fq+"]: "+e);
      }
      extractTermsFromQuery(q,r,terms);

    } else if (q instanceof BooleanQuery) {
      for (BooleanClause clause : ((BooleanQuery)q).getClauses()) {
        if (clause.getOccur() != BooleanClause.Occur.MUST_NOT) {
          extractTermsFromQuery(clause.getQuery(),r,terms);
        }
      }

    } else {
      try {
        q.extractTerms(terms);
      } catch (Exception e) {
        log.warn("Caught exception trying to extract terms from query ["+
            q+"]: ", e);
      }
    }
    return terms;
  }

There is client code then that translates the terms extracted in regular
expressions for matching purposes when walking the DOM.  The terms
provided above can contain '*' and '?' characters, so the client code
transforms to equivalent regex pattern.  Our XML->HTML transform
includes contextual information for some nodes so highlighting can be
constrained if the query was included to specific fields.

Not sure if any of this helps you,

--ewh

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org