You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Bjørn Hjelle <bj...@gmail.com> on 2014/12/19 15:26:49 UTC
(Edge)NGramFilterFactory and highlight
Hi,
based on this example:
http://www.cominvent.com/2012/01/25/super-flexible-autocomplete-with-solr/
I have earlier successfully implemented highlight of terms in
(Edge)NGram-analyzed fields.
In a new project, however, with Solr 4.10.2 it does not work.
In the Solr admin analysis page I see the following in Solr 4.10.2 (simplified):
ENGTF text t te tes test
start 0 0 0 0
end 4 4 4 4
But if I change to LUCENE_43 in solrconfig.xml, and reload the
analysis page I get this:
ENGTF text t te tes test
start 0 0 0 0
end 1 2 3 4
So, in 4.10.2 it is not able to find the correct end-positions and the
highlighter will instead highlight the complete word ("test" in this
case).
To reproduce this:
1. download Solr 4.10.2
2. In the collection1 schema.xml, add field type:
<fieldType name="autocomplete_ngram" class="solr.TextField">
<analyzer type="index">
<charFilter class="solr.MappingCharFilterFactory"
mapping="mapping-ISOLatin1Accent.txt"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory"
maxGramSize="20" minGramSize="1"/>
<filter class="solr.PatternReplaceFilterFactory"
pattern="([^\w\d\*æøåÆØÅ ])" replacement="" replace="all"/>
</analyzer>
<analyzer type="query">
<charFilter class="solr.MappingCharFilterFactory"
mapping="mapping-ISOLatin1Accent.txt"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="0" generateNumberParts="0" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PatternReplaceFilterFactory"
pattern="([^\w\d\*æøåÆØÅ ])" replacement="" replace="all"/>
<filter class="solr.PatternReplaceFilterFactory"
pattern="^(.{20})(.*)?" replacement="$1" replace="all"/>
</analyzer>
</fieldType>
3. Start solr and in analysis page add "Test" to Field Value (Index)
-field and check the output.
4. Then change to this in solrconfig.xml
<luceneMatchVersion>LUCENE_43</luceneMatchVersion>
5. reload the core and reload the analyis page.
6. you will now see that the end-positions are correct.
Any ideas on how to make this work with Solr 4.10.2 without resorting
to changing lucene version in solrconfig.xml?
Thanks,
Bjørn
Re: (Edge)NGramFilterFactory and highlight
Posted by Bjørn Hjelle <bj...@gmail.com>.
Mingchun,
yes, that is better, and it works fine.
Thank you!
Bjørn
On Sat, Dec 20, 2014 at 1:26 PM, Mingchun Zhao
<mi...@gmail.com> wrote:
> Hi Bjørn,
>
> From solr4.4, the behavior of end offsets in EdgeNGramFilterFactory
> was changed due to the following issue,
> https://issues.apache.org/jira/browse/LUCENE-3907
> The related source code in this patch as below,
> ==
> + if (version.onOrAfter(Version.LUCENE_44)) {
> + // Never update offsets
> + updateOffsets = false;
> + } else {
> + // if length by start + end offsets doesn't match the
> term text then assume
> + // this is a synonym and don't adjust the offsets.
> + updateOffsets = (tokStart + curTermLength) == tokEnd;
> + }
> ==
>
> It seems that there is no any property for specifying the previous
> behavior of offsets as in LUCENE_43.
> Therefore, you might have to set luceneMatchVersion to deal with it as
> you mentioned.
> However, it would be better to apply luceneMatchVersion just on the
> EdgeNGramFilterFactory as below,
> ==
> <filter class="solr.EdgeNGramFilterFactory"
> maxGramSize="20" minGramSize="1" luceneMatchVersion="4.3"/>
> ==
> The setting of <luceneMatchVersion>LUCENE_43</luceneMatchVersion> in
> solrconfig.xml
> will also affect other configurations.
>
> Regards,
> Mingchun
>
>
> 2014-12-19 23:26 GMT+09:00 Bjørn Hjelle <bj...@gmail.com>:
>> Hi,
>>
>> based on this example:
>> http://www.cominvent.com/2012/01/25/super-flexible-autocomplete-with-solr/
>> I have earlier successfully implemented highlight of terms in
>> (Edge)NGram-analyzed fields.
>>
>> In a new project, however, with Solr 4.10.2 it does not work.
>>
>> In the Solr admin analysis page I see the following in Solr 4.10.2 (simplified):
>>
>> ENGTF text t te tes test
>> start 0 0 0 0
>> end 4 4 4 4
>>
>> But if I change to LUCENE_43 in solrconfig.xml, and reload the
>> analysis page I get this:
>>
>> ENGTF text t te tes test
>> start 0 0 0 0
>> end 1 2 3 4
>>
>> So, in 4.10.2 it is not able to find the correct end-positions and the
>> highlighter will instead highlight the complete word ("test" in this
>> case).
>>
>>
>> To reproduce this:
>> 1. download Solr 4.10.2
>> 2. In the collection1 schema.xml, add field type:
>>
>>
>> <fieldType name="autocomplete_ngram" class="solr.TextField">
>> <analyzer type="index">
>> <charFilter class="solr.MappingCharFilterFactory"
>> mapping="mapping-ISOLatin1Accent.txt"/>
>> <tokenizer class="solr.StandardTokenizerFactory"/>
>> <filter class="solr.WordDelimiterFilterFactory"
>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>> <filter class="solr.LowerCaseFilterFactory"/>
>> <filter class="solr.EdgeNGramFilterFactory"
>> maxGramSize="20" minGramSize="1"/>
>> <filter class="solr.PatternReplaceFilterFactory"
>> pattern="([^\w\d\*æøåÆØÅ ])" replacement="" replace="all"/>
>> </analyzer>
>> <analyzer type="query">
>> <charFilter class="solr.MappingCharFilterFactory"
>> mapping="mapping-ISOLatin1Accent.txt"/>
>> <tokenizer class="solr.StandardTokenizerFactory"/>
>> <filter class="solr.WordDelimiterFilterFactory"
>> generateWordParts="0" generateNumberParts="0" catenateWords="0"
>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
>> <filter class="solr.LowerCaseFilterFactory"/>
>> <filter class="solr.PatternReplaceFilterFactory"
>> pattern="([^\w\d\*æøåÆØÅ ])" replacement="" replace="all"/>
>> <filter class="solr.PatternReplaceFilterFactory"
>> pattern="^(.{20})(.*)?" replacement="$1" replace="all"/>
>> </analyzer>
>> </fieldType>
>>
>> 3. Start solr and in analysis page add "Test" to Field Value (Index)
>> -field and check the output.
>> 4. Then change to this in solrconfig.xml
>>
>> <luceneMatchVersion>LUCENE_43</luceneMatchVersion>
>>
>> 5. reload the core and reload the analyis page.
>> 6. you will now see that the end-positions are correct.
>>
>>
>>
>> Any ideas on how to make this work with Solr 4.10.2 without resorting
>> to changing lucene version in solrconfig.xml?
>>
>>
>> Thanks,
>> Bjørn
Re: (Edge)NGramFilterFactory and highlight
Posted by Mingchun Zhao <mi...@gmail.com>.
Hi Bjørn,
>From solr4.4, the behavior of end offsets in EdgeNGramFilterFactory
was changed due to the following issue,
https://issues.apache.org/jira/browse/LUCENE-3907
The related source code in this patch as below,
==
+ if (version.onOrAfter(Version.LUCENE_44)) {
+ // Never update offsets
+ updateOffsets = false;
+ } else {
+ // if length by start + end offsets doesn't match the
term text then assume
+ // this is a synonym and don't adjust the offsets.
+ updateOffsets = (tokStart + curTermLength) == tokEnd;
+ }
==
It seems that there is no any property for specifying the previous
behavior of offsets as in LUCENE_43.
Therefore, you might have to set luceneMatchVersion to deal with it as
you mentioned.
However, it would be better to apply luceneMatchVersion just on the
EdgeNGramFilterFactory as below,
==
<filter class="solr.EdgeNGramFilterFactory"
maxGramSize="20" minGramSize="1" luceneMatchVersion="4.3"/>
==
The setting of <luceneMatchVersion>LUCENE_43</luceneMatchVersion> in
solrconfig.xml
will also affect other configurations.
Regards,
Mingchun
2014-12-19 23:26 GMT+09:00 Bjørn Hjelle <bj...@gmail.com>:
> Hi,
>
> based on this example:
> http://www.cominvent.com/2012/01/25/super-flexible-autocomplete-with-solr/
> I have earlier successfully implemented highlight of terms in
> (Edge)NGram-analyzed fields.
>
> In a new project, however, with Solr 4.10.2 it does not work.
>
> In the Solr admin analysis page I see the following in Solr 4.10.2 (simplified):
>
> ENGTF text t te tes test
> start 0 0 0 0
> end 4 4 4 4
>
> But if I change to LUCENE_43 in solrconfig.xml, and reload the
> analysis page I get this:
>
> ENGTF text t te tes test
> start 0 0 0 0
> end 1 2 3 4
>
> So, in 4.10.2 it is not able to find the correct end-positions and the
> highlighter will instead highlight the complete word ("test" in this
> case).
>
>
> To reproduce this:
> 1. download Solr 4.10.2
> 2. In the collection1 schema.xml, add field type:
>
>
> <fieldType name="autocomplete_ngram" class="solr.TextField">
> <analyzer type="index">
> <charFilter class="solr.MappingCharFilterFactory"
> mapping="mapping-ISOLatin1Accent.txt"/>
> <tokenizer class="solr.StandardTokenizerFactory"/>
> <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.EdgeNGramFilterFactory"
> maxGramSize="20" minGramSize="1"/>
> <filter class="solr.PatternReplaceFilterFactory"
> pattern="([^\w\d\*æøåÆØÅ ])" replacement="" replace="all"/>
> </analyzer>
> <analyzer type="query">
> <charFilter class="solr.MappingCharFilterFactory"
> mapping="mapping-ISOLatin1Accent.txt"/>
> <tokenizer class="solr.StandardTokenizerFactory"/>
> <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="0" generateNumberParts="0" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.PatternReplaceFilterFactory"
> pattern="([^\w\d\*æøåÆØÅ ])" replacement="" replace="all"/>
> <filter class="solr.PatternReplaceFilterFactory"
> pattern="^(.{20})(.*)?" replacement="$1" replace="all"/>
> </analyzer>
> </fieldType>
>
> 3. Start solr and in analysis page add "Test" to Field Value (Index)
> -field and check the output.
> 4. Then change to this in solrconfig.xml
>
> <luceneMatchVersion>LUCENE_43</luceneMatchVersion>
>
> 5. reload the core and reload the analyis page.
> 6. you will now see that the end-positions are correct.
>
>
>
> Any ideas on how to make this work with Solr 4.10.2 without resorting
> to changing lucene version in solrconfig.xml?
>
>
> Thanks,
> Bjørn