You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Jérôme Bernardes <je...@mappy.com> on 2015/10/05 10:57:47 UTC
Highlight with NGram and German S Sharp "ß"
Dear Solr Users,
I am facing a problem with highligting on ngram fields.
Highlighting is working well, except for words with german character
"ß".
Eg : with q=rosen&
"highlighting": {
"gcl3r:12723710:6643": {
"textng": [
"<em>Rosen</em>steinpark (Métro), Stuttgart (Allemagne)"
]
},
"gcl3r:2267495:780930": {
"textng": [
"<em>Rosenstraße</em>, 94554 Moos (Allemagne)"
]
}
}
Without "ß" words are highlight partially <em>Rosen</em>steinpark but
with "ß", the whole word is highlighted (<em>Rosenstraße</em>)
-------------
This characters ß is mapped to "ss" at query and index time (using
<charFilter class="solr.MappingCharFilterFactory"
mapping="mapping-ISOLatin1Accent.txt"/>
)
.
Here the schema.xml for the highlighted field.
<fieldType name="autocomplete_ngram" class="solr.TextField">
<analyzer type="index">
<charFilter class="solr.MappingCharFilterFactory"
mapping="mapping-ISOLatin1Accent.txt"/>
<!--<tokenizer class="solr.StandardTokenizerFactory"/>-->
<tokenizer class="solr.PatternTokenizerFactory" pattern="[\s,;:
\-\']"/>
<filter class="solr.WordDelimiterFilterFactory"
splitOnNumerics="0"
generateWordParts="1"
generateNumberParts="1"
catenateWords="0"
catenateNumbers="0"
catenateAll="0"
splitOnCaseChange="1"
preserveOriginal="1"
types="wdfftypes.txt"
/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonym.txt"
ignoreCase="true" expand="true"/>
<filter class="solr.EdgeNGramFilterFactory" maxGramSize="20"
minGramSize="1"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="([^\w\d
\*&æøåÆØÅ ])" replacement="" replace="all"/>
</analyzer>
<analyzer type="query">
<charFilter class="solr.MappingCharFilterFactory"
mapping="mapping-ISOLatin1Accent.txt"/>
<!--<tokenizer class="solr.StandardTokenizerFactory"/>-->
<tokenizer class="solr.PatternTokenizerFactory" pattern="[\s,;:
\-\']"/>
<filter class="solr.WordDelimiterFilterFactory"
splitOnNumerics="0"
generateWordParts="1"
generateNumberParts="0"
catenateWords="0"
catenateNumbers="0"
catenateAll="0"
splitOnCaseChange="0"
preserveOriginal="1"
types="wdfftypes.txt"
/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="([^\w\d
\*&æøåÆØÅ ])" replacement="" replace="all"/>
<filter class="solr.PatternReplaceFilterFactory"
pattern="^(.{20})(.*)?" replacement="$1" replace="all"/>
</analyzer>
</fieldType>
Is it a problem in our configuration or a known bug ?
Regards
Jérôme
Re: Highlight with NGram and German S Sharp "ß"
Posted by Scott Stults <ss...@opensourceconnections.com>.
Yep, I misunderstood the problem.
The multiple tokens at the same offset might be messing things up. One
thing you can do is copyField to a field that doesn't have n-grams and do
something like f.textng.hl.alternateField= in your solrconfig. That'll use
the other field during highlighting. Yeah, that'll increase your index size
on disk.
On Fri, Oct 16, 2015 at 10:07 AM, Jérôme Bernardes <
jerome.bernardes@mappy.com> wrote:
> Thanks for your reply Scott.
>
> I tried
>
> bs.language=de&bs.country=de
>
> Unfortunately the problem still occurs.
> I have just discovered that the problem does not only affect "ß" but also
> "æ" (which is mapped to "ae"
> at query and index time)
> q=hae --> <em>hæna<em>
> So it seems to me that the problem is related to any single character that
> is map to several characters using <charFilter
> class="solr.MappingCharFilterFactory"
> mapping="mapping-ISOLatin1Accent.txt"/>
>
> Jérôme
>
>
> Le 13/10/2015 07:46, Scott Stults a écrit :
>
>> My guess is that the boundary scanner isn't configured right for your
>> highlighter. Try setting the bs.language and bs.country parameters either
>> in your request or in the requestHandler.
>>
>>
>> k/r,
>> Scott
>>
>> On Mon, Oct 5, 2015 at 4:57 AM, Jérôme Bernardes <
>> jerome.bernardes@mappy.com
>>
>>> wrote:
>>> Dear Solr Users,
>>> I am facing a problem with highligting on ngram fields.
>>> Highlighting is working well, except for words with german character
>>> "ß".
>>> Eg : with q=rosen&
>>> "highlighting": {
>>> "gcl3r:12723710:6643": {
>>> "textng": [
>>> "<em>Rosen</em>steinpark (Métro), Stuttgart (Allemagne)"
>>> ]
>>> },
>>> "gcl3r:2267495:780930": {
>>> "textng": [
>>> "<em>Rosenstraße</em>, 94554 Moos (Allemagne)"
>>> ]
>>> }
>>> }
>>> Without "ß" words are highlight partially <em>Rosen</em>steinpark but
>>> with "ß", the whole word is highlighted (<em>Rosenstraße</em>)
>>>
>>> -------------
>>> This characters ß is mapped to "ss" at query and index time (using
>>> <charFilter class="solr.MappingCharFilterFactory"
>>> mapping="mapping-ISOLatin1Accent.txt"/>
>>>
>>> )
>>> .
>>> Here the schema.xml for the highlighted field.
>>> <fieldType name="autocomplete_ngram" class="solr.TextField">
>>> <analyzer type="index">
>>> <charFilter class="solr.MappingCharFilterFactory"
>>> mapping="mapping-ISOLatin1Accent.txt"/>
>>> <!--<tokenizer class="solr.StandardTokenizerFactory"/>-->
>>> <tokenizer class="solr.PatternTokenizerFactory"
>>> pattern="[\s,;:
>>> \-\']"/>
>>> <filter class="solr.WordDelimiterFilterFactory"
>>> splitOnNumerics="0"
>>> generateWordParts="1"
>>> generateNumberParts="1"
>>> catenateWords="0"
>>> catenateNumbers="0"
>>> catenateAll="0"
>>> splitOnCaseChange="1"
>>> preserveOriginal="1"
>>> types="wdfftypes.txt"
>>> />
>>> <filter class="solr.LowerCaseFilterFactory"/>
>>> <filter class="solr.SynonymFilterFactory" synonyms="synonym.txt"
>>> ignoreCase="true" expand="true"/>
>>> <filter class="solr.EdgeNGramFilterFactory" maxGramSize="20"
>>> minGramSize="1"/>
>>> <filter class="solr.PatternReplaceFilterFactory" pattern="([^\w\d
>>> \*&æøåÆØÅ ])" replacement="" replace="all"/>
>>> </analyzer>
>>> <analyzer type="query">
>>> <charFilter class="solr.MappingCharFilterFactory"
>>> mapping="mapping-ISOLatin1Accent.txt"/>
>>> <!--<tokenizer class="solr.StandardTokenizerFactory"/>-->
>>> <tokenizer class="solr.PatternTokenizerFactory"
>>> pattern="[\s,;:
>>> \-\']"/>
>>> <filter class="solr.WordDelimiterFilterFactory"
>>> splitOnNumerics="0"
>>> generateWordParts="1"
>>> generateNumberParts="0"
>>> catenateWords="0"
>>> catenateNumbers="0"
>>> catenateAll="0"
>>> splitOnCaseChange="0"
>>> preserveOriginal="1"
>>> types="wdfftypes.txt"
>>> />
>>> <filter class="solr.LowerCaseFilterFactory"/>
>>> <filter class="solr.PatternReplaceFilterFactory" pattern="([^\w\d
>>> \*&æøåÆØÅ ])" replacement="" replace="all"/>
>>> <filter class="solr.PatternReplaceFilterFactory"
>>> pattern="^(.{20})(.*)?" replacement="$1" replace="all"/>
>>> </analyzer>
>>> </fieldType>
>>>
>>> Is it a problem in our configuration or a known bug ?
>>> Regards
>>> Jérôme
>>>
>>>
>>>
>>
>
--
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com
Re: Highlight with NGram and German S Sharp "ß"
Posted by Jérôme Bernardes <je...@mappy.com>.
Thanks for your reply Scott.
I tried
bs.language=de&bs.country=de
Unfortunately the problem still occurs.
I have just discovered that the problem does not only affect "ß" but
also "æ" (which is mapped to "ae"
at query and index time)
q=hae --> <em>hæna<em>
So it seems to me that the problem is related to any single character
that is map to several characters using <charFilter
class="solr.MappingCharFilterFactory"
mapping="mapping-ISOLatin1Accent.txt"/>
Jérôme
Le 13/10/2015 07:46, Scott Stults a écrit :
> My guess is that the boundary scanner isn't configured right for your
> highlighter. Try setting the bs.language and bs.country parameters either
> in your request or in the requestHandler.
>
>
> k/r,
> Scott
>
> On Mon, Oct 5, 2015 at 4:57 AM, Jérôme Bernardes <jerome.bernardes@mappy.com
>> wrote:
>> Dear Solr Users,
>> I am facing a problem with highligting on ngram fields.
>> Highlighting is working well, except for words with german character
>> "ß".
>> Eg : with q=rosen&
>> "highlighting": {
>> "gcl3r:12723710:6643": {
>> "textng": [
>> "<em>Rosen</em>steinpark (Métro), Stuttgart (Allemagne)"
>> ]
>> },
>> "gcl3r:2267495:780930": {
>> "textng": [
>> "<em>Rosenstraße</em>, 94554 Moos (Allemagne)"
>> ]
>> }
>> }
>> Without "ß" words are highlight partially <em>Rosen</em>steinpark but
>> with "ß", the whole word is highlighted (<em>Rosenstraße</em>)
>>
>> -------------
>> This characters ß is mapped to "ss" at query and index time (using
>> <charFilter class="solr.MappingCharFilterFactory"
>> mapping="mapping-ISOLatin1Accent.txt"/>
>>
>> )
>> .
>> Here the schema.xml for the highlighted field.
>> <fieldType name="autocomplete_ngram" class="solr.TextField">
>> <analyzer type="index">
>> <charFilter class="solr.MappingCharFilterFactory"
>> mapping="mapping-ISOLatin1Accent.txt"/>
>> <!--<tokenizer class="solr.StandardTokenizerFactory"/>-->
>> <tokenizer class="solr.PatternTokenizerFactory"
>> pattern="[\s,;:
>> \-\']"/>
>> <filter class="solr.WordDelimiterFilterFactory"
>> splitOnNumerics="0"
>> generateWordParts="1"
>> generateNumberParts="1"
>> catenateWords="0"
>> catenateNumbers="0"
>> catenateAll="0"
>> splitOnCaseChange="1"
>> preserveOriginal="1"
>> types="wdfftypes.txt"
>> />
>> <filter class="solr.LowerCaseFilterFactory"/>
>> <filter class="solr.SynonymFilterFactory" synonyms="synonym.txt"
>> ignoreCase="true" expand="true"/>
>> <filter class="solr.EdgeNGramFilterFactory" maxGramSize="20"
>> minGramSize="1"/>
>> <filter class="solr.PatternReplaceFilterFactory" pattern="([^\w\d
>> \*&æøåÆØÅ ])" replacement="" replace="all"/>
>> </analyzer>
>> <analyzer type="query">
>> <charFilter class="solr.MappingCharFilterFactory"
>> mapping="mapping-ISOLatin1Accent.txt"/>
>> <!--<tokenizer class="solr.StandardTokenizerFactory"/>-->
>> <tokenizer class="solr.PatternTokenizerFactory"
>> pattern="[\s,;:
>> \-\']"/>
>> <filter class="solr.WordDelimiterFilterFactory"
>> splitOnNumerics="0"
>> generateWordParts="1"
>> generateNumberParts="0"
>> catenateWords="0"
>> catenateNumbers="0"
>> catenateAll="0"
>> splitOnCaseChange="0"
>> preserveOriginal="1"
>> types="wdfftypes.txt"
>> />
>> <filter class="solr.LowerCaseFilterFactory"/>
>> <filter class="solr.PatternReplaceFilterFactory" pattern="([^\w\d
>> \*&æøåÆØÅ ])" replacement="" replace="all"/>
>> <filter class="solr.PatternReplaceFilterFactory"
>> pattern="^(.{20})(.*)?" replacement="$1" replace="all"/>
>> </analyzer>
>> </fieldType>
>>
>> Is it a problem in our configuration or a known bug ?
>> Regards
>> Jérôme
>>
>>
>
Re: Highlight with NGram and German S Sharp "ß"
Posted by Scott Stults <ss...@opensourceconnections.com>.
My guess is that the boundary scanner isn't configured right for your
highlighter. Try setting the bs.language and bs.country parameters either
in your request or in the requestHandler.
k/r,
Scott
On Mon, Oct 5, 2015 at 4:57 AM, Jérôme Bernardes <jerome.bernardes@mappy.com
> wrote:
> Dear Solr Users,
> I am facing a problem with highligting on ngram fields.
> Highlighting is working well, except for words with german character
> "ß".
> Eg : with q=rosen&
> "highlighting": {
> "gcl3r:12723710:6643": {
> "textng": [
> "<em>Rosen</em>steinpark (Métro), Stuttgart (Allemagne)"
> ]
> },
> "gcl3r:2267495:780930": {
> "textng": [
> "<em>Rosenstraße</em>, 94554 Moos (Allemagne)"
> ]
> }
> }
> Without "ß" words are highlight partially <em>Rosen</em>steinpark but
> with "ß", the whole word is highlighted (<em>Rosenstraße</em>)
>
> -------------
> This characters ß is mapped to "ss" at query and index time (using
> <charFilter class="solr.MappingCharFilterFactory"
> mapping="mapping-ISOLatin1Accent.txt"/>
>
> )
> .
> Here the schema.xml for the highlighted field.
> <fieldType name="autocomplete_ngram" class="solr.TextField">
> <analyzer type="index">
> <charFilter class="solr.MappingCharFilterFactory"
> mapping="mapping-ISOLatin1Accent.txt"/>
> <!--<tokenizer class="solr.StandardTokenizerFactory"/>-->
> <tokenizer class="solr.PatternTokenizerFactory"
> pattern="[\s,;:
> \-\']"/>
> <filter class="solr.WordDelimiterFilterFactory"
> splitOnNumerics="0"
> generateWordParts="1"
> generateNumberParts="1"
> catenateWords="0"
> catenateNumbers="0"
> catenateAll="0"
> splitOnCaseChange="1"
> preserveOriginal="1"
> types="wdfftypes.txt"
> />
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.SynonymFilterFactory" synonyms="synonym.txt"
> ignoreCase="true" expand="true"/>
> <filter class="solr.EdgeNGramFilterFactory" maxGramSize="20"
> minGramSize="1"/>
> <filter class="solr.PatternReplaceFilterFactory" pattern="([^\w\d
> \*&æøåÆØÅ ])" replacement="" replace="all"/>
> </analyzer>
> <analyzer type="query">
> <charFilter class="solr.MappingCharFilterFactory"
> mapping="mapping-ISOLatin1Accent.txt"/>
> <!--<tokenizer class="solr.StandardTokenizerFactory"/>-->
> <tokenizer class="solr.PatternTokenizerFactory"
> pattern="[\s,;:
> \-\']"/>
> <filter class="solr.WordDelimiterFilterFactory"
> splitOnNumerics="0"
> generateWordParts="1"
> generateNumberParts="0"
> catenateWords="0"
> catenateNumbers="0"
> catenateAll="0"
> splitOnCaseChange="0"
> preserveOriginal="1"
> types="wdfftypes.txt"
> />
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.PatternReplaceFilterFactory" pattern="([^\w\d
> \*&æøåÆØÅ ])" replacement="" replace="all"/>
> <filter class="solr.PatternReplaceFilterFactory"
> pattern="^(.{20})(.*)?" replacement="$1" replace="all"/>
> </analyzer>
> </fieldType>
>
> Is it a problem in our configuration or a known bug ?
> Regards
> Jérôme
>
>
--
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com