You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by ses <st...@ssims.co.uk> on 2013/08/14 09:53:17 UTC

PostingsHighlighter returning fields which don't match

We are trying out the new PostingsHighlighter with Solr 4.2.1 and finding
that the highlighting section of the response includes self-closing tags for
all the fields in hl.fl (by default for edismax it is all fields in qf)
where there are no highlighting matches. In contrast the same query on Solr
4.0.0 without PostingsHighlighter it returns only the fields containing
highlighting matches.

here is a simplified example of the highlighting response for a document
with no matches in the fields specified by hl.fl:
with PostingsHighlighter:
<response>
  ...
  <lst name="highlighting">
    <lst name="Z123456">
      <arr name="A1"/>
      <arr name="A2"/>
      <arr name="A3"/>
      ...
    </lst>
  </lst>
</response>

without PostingsHighlighter:
<response>
  ...
  <lst name="highlighting">
    <lst name="Z123456"/>
  </lst>
</response>

This is a big problem for us as we have a large number of fields in a
dynamic field and we believe every time a highlighted response comes back it
is sending us a very large number of self-closing tags which bloats the
response to an unreasonable size (in some cases 100MB+).

We have tried using hl.requireFieldMatch=true but this seems to make no
difference.

Is there anything we can specify in the query (or solrconfig) to avoid
returning these empty tags? Or could this be a known bug?

We are considering looking at the source and modifying PostingsHighlighter
or associated classes, so any pointers on where to look would also be handy.



--
View this message in context: http://lucene.472066.n3.nabble.com/PostingsHighlighter-returning-fields-which-don-t-match-tp4084495.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: PostingsHighlighter returning fields which don't match

Posted by Jack Krupansky <ja...@basetechnology.com>.
No, there is no option to disable that feature of the postings highlighter.

This code in PostingsSolrHighlighter.java:

protected NamedList<Object> encodeSnippets(String[] keys, String[] 
fieldNames, Map<String,String[]> snippets) {
  NamedList<Object> list = new SimpleOrderedMap<Object>();
  for (int i = 0; i < keys.length; i++) {
    NamedList<Object> summary = new SimpleOrderedMap<Object>();
    for (String field : fieldNames) {
      String snippet = snippets.get(field)[i];
      // box in an array to match the format of existing highlighters,
      // even though its always one element.
      if (snippet == null) {
        summary.add(field, new String[0]);
      } else {
        summary.add(field, new String[] { snippet });
      }
    }
    list.add(keys[i], summary);
  }
  return list;
}

It sounds like you want the "summary.add(field, new String[0]);" to be a 
no-op (ignore) instead.

It would be nice to have that as a parameter, like 
"hl.ignoreUnhighlightedFields". Or maybe "hl.returnEmptyHighlights" and 
default to false to match the other highlighters.

-- Jack Krupansky

-----Original Message----- 
From: ses
Sent: Wednesday, August 14, 2013 3:53 AM
To: solr-user@lucene.apache.org
Subject: PostingsHighlighter returning fields which don't match

We are trying out the new PostingsHighlighter with Solr 4.2.1 and finding
that the highlighting section of the response includes self-closing tags for
all the fields in hl.fl (by default for edismax it is all fields in qf)
where there are no highlighting matches. In contrast the same query on Solr
4.0.0 without PostingsHighlighter it returns only the fields containing
highlighting matches.

here is a simplified example of the highlighting response for a document
with no matches in the fields specified by hl.fl:
with PostingsHighlighter:
<response>
  ...
  <lst name="highlighting">
    <lst name="Z123456">
      <arr name="A1"/>
      <arr name="A2"/>
      <arr name="A3"/>
      ...
    </lst>
  </lst>
</response>

without PostingsHighlighter:
<response>
  ...
  <lst name="highlighting">
    <lst name="Z123456"/>
  </lst>
</response>

This is a big problem for us as we have a large number of fields in a
dynamic field and we believe every time a highlighted response comes back it
is sending us a very large number of self-closing tags which bloats the
response to an unreasonable size (in some cases 100MB+).

We have tried using hl.requireFieldMatch=true but this seems to make no
difference.

Is there anything we can specify in the query (or solrconfig) to avoid
returning these empty tags? Or could this be a known bug?

We are considering looking at the source and modifying PostingsHighlighter
or associated classes, so any pointers on where to look would also be handy.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/PostingsHighlighter-returning-fields-which-don-t-match-tp4084495.html
Sent from the Solr - User mailing list archive at Nabble.com. 


Re: PostingsHighlighter returning fields which don't match

Posted by ses <st...@ssims.co.uk>.
Thanks, we tried modifying the source as suggested but found in our case
PostingsHighlighter was returning no highlighting at all once we removed the
self-closing tags. I think perhaps we were not using it in the correct way.


Robert Muir wrote
> Do you want to open a JIRA issue to just change the behavior?

Yes, I think it would be useful to have it is an optional feature which can
be triggered by a parameter as suggested. This is how we implemented it, and
if it were returning highlighting we would happily contribute this back, but
as it stands its not properly tested. I will create a JIRA ticket to cover
this desired functionality though.


Robert Muir wrote
> Unrelated: If your queries actually go against a large number of fields,
> I'm not sure how efficient this highlighter will be. Thats because at some
> number of N fields, it will be much more efficient to use a
> document-oriented term vector approach (e.g. standard
> highlighter/fast-vector-highlighter).

Yes unfortunately it is not any faster. Our original problem was
highlighting performance and in our case PostingsHighlighter is performing
similarly to the default highlighter. 

We are now trying a solution which involves running one query to obtain the
field names in the N documents retrieved (where N=rows) and then a separate
query to specify those fields in 'hl.fl' parameter. This is working on the
basis that those two seperate queries run much faster than one query with
hl.fl=my_dynamic_field_*

Thanks for your detailed responses.



--
View this message in context: http://lucene.472066.n3.nabble.com/PostingsHighlighter-returning-fields-which-don-t-match-tp4084495p4084774.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: PostingsHighlighter returning fields which don't match

Posted by Robert Muir <rc...@gmail.com>.
On Wed, Aug 14, 2013 at 3:53 AM, ses <st...@ssims.co.uk> wrote:

> We are trying out the new PostingsHighlighter with Solr 4.2.1 and finding
> that the highlighting section of the response includes self-closing tags
> for
> all the fields in hl.fl (by default for edismax it is all fields in qf)
> where there are no highlighting matches. In contrast the same query on Solr
> 4.0.0 without PostingsHighlighter it returns only the fields containing
> highlighting matches.
>
> here is a simplified example of the highlighting response for a document
> with no matches in the fields specified by hl.fl:
> with PostingsHighlighter:
> <response>
>   ...
>   <lst name="highlighting">
>     <lst name="Z123456">
>       <arr name="A1"/>
>       <arr name="A2"/>
>       <arr name="A3"/>
>       ...
>     </lst>
>   </lst>
> </response>
>
> without PostingsHighlighter:
> <response>
>   ...
>   <lst name="highlighting">
>     <lst name="Z123456"/>
>   </lst>
> </response>
>

Do you want to open a JIRA issue to just change the behavior?


> This is a big problem for us as we have a large number of fields in a
> dynamic field and we believe every time a highlighted response comes back
> it
> is sending us a very large number of self-closing tags which bloats the
> response to an unreasonable size (in some cases 100MB+).
>

Unrelated: If your queries actually go against a large number of fields,
I'm not sure how efficient this highlighter will be. Thats because at some
number of N fields, it will be much more efficient to use a
document-oriented term vector approach (e.g. standard
highlighter/fast-vector-highlighter).