You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by alexei <ac...@gmail.com> on 2011/05/20 22:55:26 UTC

return unaltered complete multivalued fields with Highlighted results

Hello,

I have been trying to return highlighted text in original order and without
removing anything from it.
Seems that highlighter sorts my text by score and skips the sections that do
not have a score. 
I am sure I am missing something simple.
Has anyone had a similar issue?

I have tried different things, nothing seems to work.
my config:
     <str name='hl'>true</str>
     <str name='hl.snippets'>1000</str>   
     <str name="hl.fl">abstract</str>
     <str name='hl.fragmenter'>regex</str>
     <str name="hl.mergeContigous">true</str>   
     <str name='hl.maxAnalyzedChars'>104400</str>
      
     <str name="f.abstract.hl.fragsize">0</str>
     
     <str name="f.abstract.hl.simple.pre"></str> 
     <str name="f.abstract.hl.simple.post"></str>

Regards,
Alexei

--
View this message in context: http://lucene.472066.n3.nabble.com/return-unaltered-complete-multivalued-fields-with-Highlighted-results-tp2967146p2967146.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: return unaltered complete multivalued fields with Highlighted results

Posted by Jonathan Rochkind <ro...@jhu.edu>.
I could use this feature too, encourage you to submit a patch in JIRA.

I wouldn't call the param "preserveOrder" though -- what it's really 
doing is returning the whole entire field, with highlighting markers, 
not just "preserving order" of fragments.  Not sure what to call it, but 
not "preserveOrder".

On 6/2/2011 11:31 AM, alexei wrote:
> Hi,
>
> Here is the code for Solr 3.1 that will preserve all the text and will
> disable sorting.
>
> This goes in solrconfig.xml request handler config or which ever way you
> pass params:
>       <str name="hl.preserveOrder">true</str>
>
> This line goes into HighlightParams class:
>    public static final String PRESERVE_ORDER = HIGHLIGHT + ".preserveOrder";
>
> Replace this method DefaultSolrHighlighter.doHighlightingByHighlighter (I
> only added 3 if blocks):
>
>    private void doHighlightingByHighlighter( Query query, SolrQueryRequest
> req, NamedList docSummaries,
>        int docId, Document doc, String fieldName ) throws IOException {
>      SolrParams params = req.getParams();
>      String[] docTexts = doc.getValues(fieldName);
>      // according to Document javadoc, doc.getValues() never returns null.
> check empty instead of null
>      if (docTexts.length == 0) return;
>
>      SolrIndexSearcher searcher = req.getSearcher();
>      IndexSchema schema = searcher.getSchema();
>      TokenStream tstream = null;
>      int numFragments = getMaxSnippets(fieldName, params);
>      boolean mergeContiguousFragments = isMergeContiguousFragments(fieldName,
> params);
>
>      String[] summaries = null;
>      List<TextFragment>  frags = new ArrayList<TextFragment>();
>
>      TermOffsetsTokenStream tots = null; // to be non-null iff we're using
> TermOffsets optimization
>      try {
>          TokenStream tvStream =
> TokenSources.getTokenStream(searcher.getReader(), docId, fieldName);
>          if (tvStream != null) {
>            tots = new TermOffsetsTokenStream(tvStream);
>          }
>      }
>      catch (IllegalArgumentException e) {
>        // No problem. But we can't use TermOffsets optimization.
>      }
>
>      for (int j = 0; j<  docTexts.length; j++) {
>        if( tots != null ) {
>          // if we're using TermOffsets optimization, then get the next
>          // field value's TokenStream (i.e. get field j's TokenStream) from
> tots:
>          tstream = tots.getMultiValuedTokenStream( docTexts[j].length() );
>        } else {
>          // fall back to analyzer
>          tstream = createAnalyzerTStream(schema, fieldName, docTexts[j]);
>        }
>
>        Highlighter highlighter;
>        if
> (Boolean.valueOf(req.getParams().get(HighlightParams.USE_PHRASE_HIGHLIGHTER,
> "true"))) {
>          // TODO: this is not always necessary - eventually we would like to
> avoid this wrap
>          //       when it is not needed.
>          tstream = new CachingTokenFilter(tstream);
>
>          // get highlighter
>          highlighter = getPhraseHighlighter(query, fieldName, req,
> (CachingTokenFilter) tstream);
>
>          // after highlighter initialization, reset tstream since
> construction of highlighter already used it
>          tstream.reset();
>        }
>        else {
>          // use "the old way"
>          highlighter = getHighlighter(query, fieldName, req);
>        }
>
>        int maxCharsToAnalyze = params.getFieldInt(fieldName,
>            HighlightParams.MAX_CHARS,
>            Highlighter.DEFAULT_MAX_CHARS_TO_ANALYZE);
>        if (maxCharsToAnalyze<  0) {
>          highlighter.setMaxDocCharsToAnalyze(docTexts[j].length());
>        } else {
>          highlighter.setMaxDocCharsToAnalyze(maxCharsToAnalyze);
>        }
>
>        try {
>          TextFragment[] bestTextFragments =
> highlighter.getBestTextFragments(tstream, docTexts[j],
> mergeContiguousFragments, numFragments);
>          for (int k = 0; k<  bestTextFragments.length; k++) {
>            if (params.getBool( HighlightParams.PRESERVE_ORDER, false ) ) {	        	
>          	if ((bestTextFragments[k] != null) ){//&&
> (bestTextFragments[k].getScore()>  0)) {
>          	  frags.add(bestTextFragments[k]);
>            	}
>            }
>            else {
>            	if ((bestTextFragments[k] != null)&&
> (bestTextFragments[k].getScore()>  0)) {
>            	  frags.add(bestTextFragments[k]);
>              }
>            }
>          }
>        } catch (InvalidTokenOffsetsException e) {
>          throw new SolrException(SolrException.ErrorCode.SERVER_ERROR, e);
>        }
>      }
>      // sort such that the fragments with the highest score come first
>      if (!params.getBool( HighlightParams.PRESERVE_ORDER, false ) ) {
> 	    Collections.sort(frags, new Comparator<TextFragment>() {
> 	      public int compare(TextFragment arg0, TextFragment arg1) {
> 	        return Math.round(arg1.getScore() - arg0.getScore());
> 	      }
> 	    });
>      }
>
>       // convert fragments back into text
>       // TODO: we can include score and position information in output as
> snippet attributes
>      if (frags.size()>  0) {
>        ArrayList<String>  fragTexts = new ArrayList<String>();
>        for (TextFragment fragment: frags) {
>          if (params.getBool( HighlightParams.PRESERVE_ORDER, false ) ) {
> 	        if ((fragment != null) ){//&&  (fragment.getScore()>  0)) {
> 	          fragTexts.add(fragment.toString());
> 	        }
> 	        if (fragTexts.size()>= numFragments) break;
>          } else {
> 	        if ((fragment != null)&&  (fragment.getScore()>  0)) {
> 	          fragTexts.add(fragment.toString());
> 		    }
> 		    if (fragTexts.size()>= numFragments) break;
>          }
>        }
>        summaries = fragTexts.toArray(new String[0]);
>        if (summaries.length>  0)
>        docSummaries.add(fieldName, summaries);
>      }
>      // no summeries made, copy text from alternate field
>      if (summaries == null || summaries.length == 0) {
>        alternateField( docSummaries, params, doc, fieldName );
>      }
>    }
>
>
> This seems to work for my purposes. If nobody has any issues with this code
> perhaps it should be a patch?
>
> Thanks,
> Alexei
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/return-unaltered-complete-multivalued-fields-with-Highlighted-results-tp2967146p3015616.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: return unaltered complete multivalued fields with Highlighted results

Posted by Erick Erickson <er...@gmail.com>.
Hmmm, I don't know a thing about the highlighter code, but if you can just make
a patch and create a JIRA (https://issues.apache.org/jira/browse/SOLR)
and attach it, it'll get "in the system".

I suspect you've seen this page, but just in case:
http://wiki.apache.org/solr/HowToContribute
See, especially, "Yonik's Law of patches" on that page...


Two questions:
1> after your changes, could you successfully run "ant test"?
2> can you supply any unit tests that illustrated the correct behavior here?

Even if both answers are "no", it's still probably a good idea to submit the
patch.

Although first it might be a good idea to discuss this on the dev list
(dev@lucene.apache.org) before opening a JIRA, it's possible that
there's something similar in the works already...

Best
Erick


On Thu, Jun 2, 2011 at 11:31 AM, alexei <ac...@gmail.com> wrote:
> Hi,
>
> Here is the code for Solr 3.1 that will preserve all the text and will
> disable sorting.
>
> This goes in solrconfig.xml request handler config or which ever way you
> pass params:
>     <str name="hl.preserveOrder">true</str>
>
> This line goes into HighlightParams class:
>  public static final String PRESERVE_ORDER = HIGHLIGHT + ".preserveOrder";
>
> Replace this method DefaultSolrHighlighter.doHighlightingByHighlighter (I
> only added 3 if blocks):
>
>  private void doHighlightingByHighlighter( Query query, SolrQueryRequest
> req, NamedList docSummaries,
>      int docId, Document doc, String fieldName ) throws IOException {
>    SolrParams params = req.getParams();
>    String[] docTexts = doc.getValues(fieldName);
>    // according to Document javadoc, doc.getValues() never returns null.
> check empty instead of null
>    if (docTexts.length == 0) return;
>
>    SolrIndexSearcher searcher = req.getSearcher();
>    IndexSchema schema = searcher.getSchema();
>    TokenStream tstream = null;
>    int numFragments = getMaxSnippets(fieldName, params);
>    boolean mergeContiguousFragments = isMergeContiguousFragments(fieldName,
> params);
>
>    String[] summaries = null;
>    List<TextFragment> frags = new ArrayList<TextFragment>();
>
>    TermOffsetsTokenStream tots = null; // to be non-null iff we're using
> TermOffsets optimization
>    try {
>        TokenStream tvStream =
> TokenSources.getTokenStream(searcher.getReader(), docId, fieldName);
>        if (tvStream != null) {
>          tots = new TermOffsetsTokenStream(tvStream);
>        }
>    }
>    catch (IllegalArgumentException e) {
>      // No problem. But we can't use TermOffsets optimization.
>    }
>
>    for (int j = 0; j < docTexts.length; j++) {
>      if( tots != null ) {
>        // if we're using TermOffsets optimization, then get the next
>        // field value's TokenStream (i.e. get field j's TokenStream) from
> tots:
>        tstream = tots.getMultiValuedTokenStream( docTexts[j].length() );
>      } else {
>        // fall back to analyzer
>        tstream = createAnalyzerTStream(schema, fieldName, docTexts[j]);
>      }
>
>      Highlighter highlighter;
>      if
> (Boolean.valueOf(req.getParams().get(HighlightParams.USE_PHRASE_HIGHLIGHTER,
> "true"))) {
>        // TODO: this is not always necessary - eventually we would like to
> avoid this wrap
>        //       when it is not needed.
>        tstream = new CachingTokenFilter(tstream);
>
>        // get highlighter
>        highlighter = getPhraseHighlighter(query, fieldName, req,
> (CachingTokenFilter) tstream);
>
>        // after highlighter initialization, reset tstream since
> construction of highlighter already used it
>        tstream.reset();
>      }
>      else {
>        // use "the old way"
>        highlighter = getHighlighter(query, fieldName, req);
>      }
>
>      int maxCharsToAnalyze = params.getFieldInt(fieldName,
>          HighlightParams.MAX_CHARS,
>          Highlighter.DEFAULT_MAX_CHARS_TO_ANALYZE);
>      if (maxCharsToAnalyze < 0) {
>        highlighter.setMaxDocCharsToAnalyze(docTexts[j].length());
>      } else {
>        highlighter.setMaxDocCharsToAnalyze(maxCharsToAnalyze);
>      }
>
>      try {
>        TextFragment[] bestTextFragments =
> highlighter.getBestTextFragments(tstream, docTexts[j],
> mergeContiguousFragments, numFragments);
>        for (int k = 0; k < bestTextFragments.length; k++) {
>          if (params.getBool( HighlightParams.PRESERVE_ORDER, false ) ) {
>                if ((bestTextFragments[k] != null) ){//&&
> (bestTextFragments[k].getScore() > 0)) {
>                  frags.add(bestTextFragments[k]);
>                }
>          }
>          else {
>                if ((bestTextFragments[k] != null) &&
> (bestTextFragments[k].getScore() > 0)) {
>                  frags.add(bestTextFragments[k]);
>            }
>          }
>        }
>      } catch (InvalidTokenOffsetsException e) {
>        throw new SolrException(SolrException.ErrorCode.SERVER_ERROR, e);
>      }
>    }
>    // sort such that the fragments with the highest score come first
>    if (!params.getBool( HighlightParams.PRESERVE_ORDER, false ) ) {
>            Collections.sort(frags, new Comparator<TextFragment>() {
>              public int compare(TextFragment arg0, TextFragment arg1) {
>                return Math.round(arg1.getScore() - arg0.getScore());
>              }
>            });
>    }
>
>     // convert fragments back into text
>     // TODO: we can include score and position information in output as
> snippet attributes
>    if (frags.size() > 0) {
>      ArrayList<String> fragTexts = new ArrayList<String>();
>      for (TextFragment fragment: frags) {
>        if (params.getBool( HighlightParams.PRESERVE_ORDER, false ) ) {
>                if ((fragment != null) ){// && (fragment.getScore() > 0)) {
>                  fragTexts.add(fragment.toString());
>                }
>                if (fragTexts.size() >= numFragments) break;
>        } else {
>                if ((fragment != null) && (fragment.getScore() > 0)) {
>                  fragTexts.add(fragment.toString());
>                    }
>                    if (fragTexts.size() >= numFragments) break;
>        }
>      }
>      summaries = fragTexts.toArray(new String[0]);
>      if (summaries.length > 0)
>      docSummaries.add(fieldName, summaries);
>    }
>    // no summeries made, copy text from alternate field
>    if (summaries == null || summaries.length == 0) {
>      alternateField( docSummaries, params, doc, fieldName );
>    }
>  }
>
>
> This seems to work for my purposes. If nobody has any issues with this code
> perhaps it should be a patch?
>
> Thanks,
> Alexei
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/return-unaltered-complete-multivalued-fields-with-Highlighted-results-tp2967146p3015616.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: return unaltered complete multivalued fields with Highlighted results

Posted by alexei <ac...@gmail.com>.
Hi,

Here is the code for Solr 3.1 that will preserve all the text and will
disable sorting.

This goes in solrconfig.xml request handler config or which ever way you
pass params:
     <str name="hl.preserveOrder">true</str>

This line goes into HighlightParams class:
  public static final String PRESERVE_ORDER = HIGHLIGHT + ".preserveOrder";

Replace this method DefaultSolrHighlighter.doHighlightingByHighlighter (I
only added 3 if blocks):

  private void doHighlightingByHighlighter( Query query, SolrQueryRequest
req, NamedList docSummaries,
      int docId, Document doc, String fieldName ) throws IOException {
    SolrParams params = req.getParams(); 
    String[] docTexts = doc.getValues(fieldName);
    // according to Document javadoc, doc.getValues() never returns null.
check empty instead of null
    if (docTexts.length == 0) return;
    
    SolrIndexSearcher searcher = req.getSearcher();
    IndexSchema schema = searcher.getSchema();
    TokenStream tstream = null;
    int numFragments = getMaxSnippets(fieldName, params);
    boolean mergeContiguousFragments = isMergeContiguousFragments(fieldName,
params);

    String[] summaries = null;
    List<TextFragment> frags = new ArrayList<TextFragment>();

    TermOffsetsTokenStream tots = null; // to be non-null iff we're using
TermOffsets optimization
    try {
        TokenStream tvStream =
TokenSources.getTokenStream(searcher.getReader(), docId, fieldName);
        if (tvStream != null) {
          tots = new TermOffsetsTokenStream(tvStream);
        }
    }
    catch (IllegalArgumentException e) {
      // No problem. But we can't use TermOffsets optimization.
    }

    for (int j = 0; j < docTexts.length; j++) {
      if( tots != null ) {
        // if we're using TermOffsets optimization, then get the next
        // field value's TokenStream (i.e. get field j's TokenStream) from
tots:
        tstream = tots.getMultiValuedTokenStream( docTexts[j].length() );
      } else {
        // fall back to analyzer
        tstream = createAnalyzerTStream(schema, fieldName, docTexts[j]);
      }
                   
      Highlighter highlighter;
      if
(Boolean.valueOf(req.getParams().get(HighlightParams.USE_PHRASE_HIGHLIGHTER,
"true"))) {
        // TODO: this is not always necessary - eventually we would like to
avoid this wrap
        //       when it is not needed.
        tstream = new CachingTokenFilter(tstream);
        
        // get highlighter
        highlighter = getPhraseHighlighter(query, fieldName, req,
(CachingTokenFilter) tstream);
         
        // after highlighter initialization, reset tstream since
construction of highlighter already used it
        tstream.reset();
      }
      else {
        // use "the old way"
        highlighter = getHighlighter(query, fieldName, req);
      }
      
      int maxCharsToAnalyze = params.getFieldInt(fieldName,
          HighlightParams.MAX_CHARS,
          Highlighter.DEFAULT_MAX_CHARS_TO_ANALYZE);
      if (maxCharsToAnalyze < 0) {
        highlighter.setMaxDocCharsToAnalyze(docTexts[j].length());
      } else {
        highlighter.setMaxDocCharsToAnalyze(maxCharsToAnalyze);
      }

      try {
        TextFragment[] bestTextFragments =
highlighter.getBestTextFragments(tstream, docTexts[j],
mergeContiguousFragments, numFragments);
        for (int k = 0; k < bestTextFragments.length; k++) {
          if (params.getBool( HighlightParams.PRESERVE_ORDER, false ) ) {	        	
        	if ((bestTextFragments[k] != null) ){//&&
(bestTextFragments[k].getScore() > 0)) {
        	  frags.add(bestTextFragments[k]);
          	}
          }
          else {
          	if ((bestTextFragments[k] != null) &&
(bestTextFragments[k].getScore() > 0)) {
          	  frags.add(bestTextFragments[k]);
            }
          }
        }
      } catch (InvalidTokenOffsetsException e) {
        throw new SolrException(SolrException.ErrorCode.SERVER_ERROR, e);
      }
    }
    // sort such that the fragments with the highest score come first
    if (!params.getBool( HighlightParams.PRESERVE_ORDER, false ) ) {
	    Collections.sort(frags, new Comparator<TextFragment>() {
	      public int compare(TextFragment arg0, TextFragment arg1) {
	        return Math.round(arg1.getScore() - arg0.getScore());
	      }
	    });
    }
    
     // convert fragments back into text
     // TODO: we can include score and position information in output as
snippet attributes
    if (frags.size() > 0) {
      ArrayList<String> fragTexts = new ArrayList<String>();
      for (TextFragment fragment: frags) {
        if (params.getBool( HighlightParams.PRESERVE_ORDER, false ) ) {  
	        if ((fragment != null) ){// && (fragment.getScore() > 0)) {
	          fragTexts.add(fragment.toString());
	        }
	        if (fragTexts.size() >= numFragments) break;
        } else {
	        if ((fragment != null) && (fragment.getScore() > 0)) {
	          fragTexts.add(fragment.toString());
		    }
		    if (fragTexts.size() >= numFragments) break;
        }
      }
      summaries = fragTexts.toArray(new String[0]);
      if (summaries.length > 0) 
      docSummaries.add(fieldName, summaries);
    }
    // no summeries made, copy text from alternate field
    if (summaries == null || summaries.length == 0) {
      alternateField( docSummaries, params, doc, fieldName );
    }
  }


This seems to work for my purposes. If nobody has any issues with this code
perhaps it should be a patch?

Thanks,
Alexei


--
View this message in context: http://lucene.472066.n3.nabble.com/return-unaltered-complete-multivalued-fields-with-Highlighted-results-tp2967146p3015616.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: return unaltered complete multivalued fields with Highlighted results

Posted by lboutros <bo...@gmail.com>.
Hi Alexei,

We have the same issue/behavior.
The highlighting component fragments the fields to highlight and choose the
bests to be returned and highlighted.
You can return all fragments with the maximum size for each one, but it will
never return fragments with scores equal to 0, I mean without any words
found.

To return the whole mutli valued field, the Highlighting component needs to
be modified for this specific case.
That is something we should do in the next weeks.

If I missed something, I would be happy to find another solution too :)

Ludovic.

-----
Jouve
France.
--
View this message in context: http://lucene.472066.n3.nabble.com/return-unaltered-complete-multivalued-fields-with-Highlighted-results-tp2967146p3002357.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: return unaltered complete multivalued fields with Highlighted results

Posted by alexei <ac...@gmail.com>.
Thank you for the reply Erick. 
I can return the stored content but I would like to show the highlighted
results. 
With multivalued fields there seems to be some sorting of highlighed results
(in order of importance?) going on.
The problem is: 
1 - I could not find a way to keep the original order of my text. 
2 - I could not display all of the values in my multivalued field.

So if I have a multivalued field with four values:
value1
value2 with text
value3 
value4 and something

and the search is: "value2 something"

the highlighted result would be:
value2 with text
value4 and something

value1 and value3 will be skipped completely. When a field is not
multivalued everything works as advertised.

Any suggestions? 

Regards,
Alexei

--
View this message in context: http://lucene.472066.n3.nabble.com/return-unaltered-complete-multivalued-fields-with-Highlighted-results-tp2967146p3002248.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: return unaltered complete multivalued fields with Highlighted results

Posted by Erick Erickson <er...@gmail.com>.
Not quite sure I understand, but would just returning the stored field in
the <doc> work?

Best
Erick
 On May 20, 2011 4:55 PM, "alexei" <ac...@gmail.com> wrote:
> Hello,
>
> I have been trying to return highlighted text in original order and
without
> removing anything from it.
> Seems that highlighter sorts my text by score and skips the sections that
do
> not have a score.
> I am sure I am missing something simple.
> Has anyone had a similar issue?
>
> I have tried different things, nothing seems to work.
> my config:
> <str name='hl'>true</str>
> <str name='hl.snippets'>1000</str>
> <str name="hl.fl">abstract</str>
> <str name='hl.fragmenter'>regex</str>
> <str name="hl.mergeContigous">true</str>
> <str name='hl.maxAnalyzedChars'>104400</str>
>
> <str name="f.abstract.hl.fragsize">0</str>
>
> <str name="f.abstract.hl.simple.pre"></str>
> <str name="f.abstract.hl.simple.post"></str>
>
> Regards,
> Alexei
>
> --
> View this message in context:
http://lucene.472066.n3.nabble.com/return-unaltered-complete-multivalued-fields-with-Highlighted-results-tp2967146p2967146.html
> Sent from the Solr - User mailing list archive at Nabble.com.