You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Jan Høydahl (JIRA)" <ji...@apache.org> on 2015/08/14 16:04:46 UTC

[jira] [Commented] (SOLR-7926) Hit highlighting with EdgeNGramFilterFactory

    [ https://issues.apache.org/jira/browse/SOLR-7926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14697053#comment-14697053 ] 

Jan Høydahl commented on SOLR-7926:
-----------------------------------

Hi. 

This kind of questions is more suited for the solr-user mailing list. Most likely this is not a bug. Please ask the question on the list, and also tell which highlighter implementation you use, with what configuration, and why you expect it to do what you want (refer to documentation)? I'll close this jira as "Invalid".

If it ends up being a suspected bug or you find out your wanted result is not easily configurable with any of the existing highlighter implementations, then please re-open.

> Hit highlighting with EdgeNGramFilterFactory
> --------------------------------------------
>
>                 Key: SOLR-7926
>                 URL: https://issues.apache.org/jira/browse/SOLR-7926
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>    Affects Versions: 5.1, 5.2.1
>         Environment: CentOS 7 (5.2.1), OS X 10.10.5 (5.1)
>            Reporter: Bjørn Hjelle
>            Priority: Critical
>              Labels: EdgeNGramTokenFilter, highlighting
>
> Hit highlight highlights the whole word, not just the part that matches the search term when using EdgeNGramFilterFactory in the field type.
> In schema.xml I have field type text_ngram:
>                 <fieldType name="text_ngram" class="solr.TextField">
>                         <analyzer type="index">
>                                 <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
>                                 <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>                            <!--tokenizer class="solr.StandardTokenizerFactory"/-->
>                                 <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>                                 <filter class="solr.LowerCaseFilterFactory"/>
>                                 <filter class="solr.EdgeNGramFilterFactory" maxGramSize="20" minGramSize="3" luceneMatchVersion="4.3"/>
>                                 <filter class="solr.PatternReplaceFilterFactory" pattern="([^\w\d\*æ?~F?~E])" replacement="" replace="all"/>
>                         </analyzer>
>                         <analyzer type="query">
>                                 <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
>                                 <tokenizer class="solr.StandardTokenizerFactory"/>
>                                 <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
>                                 <filter class="solr.LowerCaseFilterFactory"/>
>                                 <filter class="solr.PatternReplaceFilterFactory" pattern="([^\w\d\*æ?~F?~E])" replacement="" replace="all"/>
>                                 <filter class="solr.PatternReplaceFilterFactory" pattern="^(.{20})(.*)?" replacement="$1" replace="all"/>
>                         </analyzer>
>                 </fieldType>
> In Solr Admin analyse, with index value "lucene" and query value "luc" it shows this: 
> LENGTF text             luc         luce            lucen               lucene
>        raw_bytes        [6c 75 63]  [6c 75 63 65]   [6c 75 63 65 6e]    [6c 75 63 65 6e 65]
>        start            0           0               0                   0
>        end              6           6               6                   6   
>        positionLength   1           1               1                   1    
>        type             word        word            word                word
>        position         1           1               1                   1    
> Since the end position is 6 in this case the whole word ("lucene" is highlighted). 
> 	
> If I change to use NGramFilterFactory it shows me this (for the first three items):
> LENGTF text             luc         uce             cen               
>        raw_bytes        [6c 75 63]  [6c 75 63 65]   [6c 75 63 65 6e]    
>        start            0           1               2                 
>        end              3           4               5                   
>        positionLength   1           1               1                    
>        type             word        word            word            
>        position         1           1               1               
> The end position is correct then and the highlighter highlights only the search term. Note that I have specified luceneMatchVersion="4.3". Without this, the end positions goes back to 6 also for the NGramFilterFactory. 
> 	



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org