You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Gunnlaugur Thor Briem (JIRA)" <ji...@apache.org> on 2013/05/23 02:45:20 UTC
[jira] [Updated] (SOLR-4851) Highlighter duplicates numeric token
in snippet when term vectors/positions/offsets on
[ https://issues.apache.org/jira/browse/SOLR-4851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Gunnlaugur Thor Briem updated SOLR-4851:
----------------------------------------
Description:
With original text {{Population 5.000 - 9.999}} indexed with {{termVectors}}, {{termPositions}} and {{termOffsets}}, the Highlighter produces snippets like {{Population 5<em class="match">5.000</em> - 9.999}} for a query of {{5000}}. Note the duplicated {{5}} before the {{<em}}; that's the bug.
This does not happen when {{useFastVectorHighlighter=true}}.
It also does not happen in a field without {{termVectors}}, {{termPositions}} and {{termOffsets}}.
To reproduce, field definitions:
{code:xml}
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" splitOnNumerics="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" splitOnNumerics="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
...
<field name="name" type="text" indexed="true" stored="true" />
<field name="descr" type="text" indexed="true" stored="true" termVectors="true" termOffsets="true" termPositions="true" />
{code}
All configured and explicit parameters, from {{echoParams=all}}:
{code:javascript}
{
"defType": "edismax",
"echoParams": "all",
"facet.mincount": "1",
"fl": "id",
"hl.fl": "id name tag cat descr dim dimvalue provider source_source text",
"hl.fragsize": "200",
"hl.mergeContiguous": "true",
"hl.simple.post": "</em>",
"hl.simple.pre": "<em class="match">",
"hl.snippets": "4",
"hl.usePhraseHighlighter": "true",
"hl": "true",
"q.alt": "*:*",
"q": "5000",
"qf": " id_a^10.0 name^6 granularity_a^5 tag^4 cat^3 descr^3 dim^2 dimvalue^2 provider^2 source_source^2 text^2 ",
"qt": "dismax",
"rows": "10",
"sort": "score desc"
}
{code}
and a document containing numbers with thousand separators, e.g.:
{code:javascript}
{
"name": "Demographics and income: Income distribution: Number of HHs earning > US$5,000 p.a. (constant 2005 prices) by country"
"descr": "Number of households with disposable income of more than US$5,000 per annum at constant 2005 prices"
}
{code}
The highlight snippets I get:
{code:javascript}
{
name: [
"Demographics and income: Income distribution: Number of HHs earning > US$<em class="match">5,000</em> p.a. (constant 2005 prices) by country"
],
descr: [
"Number of households with disposable income of more than US$5<em class="match">5,000</em> per annum at constant 2005 prices"
]
}
{code}
Note that the {{5}} gets duplicated only in the {{descr}} field snippet, not in the {{name}} field snippet. The only difference between these fields is termVectors, termPositions and termOffsets, so those settings are presumably relevant.
was:
With original text {{Population 5.000 - 9.999}} indexed with termVectors, termPositions and termOffsets, the Highlighter produces snippets like {{Population 5<em class="match">5.000</em> - 9.999}} for a query of {{5000}}. Note the duplicated {{5}} before the {{<em}}; that's the bug.
This does not happen when {{useFastVectorHighlighter=true}}.
It also does not happen in a field without termVectors, termPositions and termOffsets.
To reproduce, field definitions:
{code:xml}
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" splitOnNumerics="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" splitOnNumerics="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
...
<field name="name" type="text" indexed="true" stored="true" />
<field name="descr" type="text" indexed="true" stored="true" termVectors="true" termOffsets="true" termPositions="true" />
{code}
All configured and explicit parameters, from {{echoParams=all}}:
{code:javascript}
{
defType: "edismax",
echoParams: "all",
facet.mincount: "1",
fl: "id",
hl.fl: "id name tag cat descr dim dimvalue provider source_source text",
hl.fragsize: "200",
hl.mergeContiguous: "true",
hl.simple.post: "</em>",
hl.simple.pre: "<em class="match">",
hl.snippets: "4",
hl.usePhraseHighlighter: "true",
hl: "true",
q.alt: "*:*",
q: "5000",
qf: " id_a^10.0 name^6 granularity_a^5 tag^4 cat^3 descr^3 dim^2 dimvalue^2 provider^2 source_source^2 text^2 ",
qt: "dismax",
rows: "10",
sort: "score desc"
}
{code}
and a piece of text containing numbers with thousand separators, e.g. “Demographics and income: Income distribution: Number of HHs earning > US$5,000 p.a. (constant 2005 prices) by country”
The highlighting response I get:
{code:javascript}
{
name: [
"Demographics and income: Income distribution: Number of HHs earning > US$<em class="match">5,000</em> p.a. (constant 2005 prices) by country"
],
descr: [
"Number of households with disposable income of more than US$5<em class="match">5,000</em> per annum at constant 2005 prices"
]
}
{code}
Note that the {{5}} gets duplicated only in the {{descr}} field snippet, not in the {{name}} field snippet. The only difference between these fields is termVectors, termPositions and termOffsets, so those settings are presumably relevant.
> Highlighter duplicates numeric token in snippet when term vectors/positions/offsets on
> --------------------------------------------------------------------------------------
>
> Key: SOLR-4851
> URL: https://issues.apache.org/jira/browse/SOLR-4851
> Project: Solr
> Issue Type: Bug
> Components: highlighter
> Affects Versions: 3.6.2
> Reporter: Gunnlaugur Thor Briem
>
> With original text {{Population 5.000 - 9.999}} indexed with {{termVectors}}, {{termPositions}} and {{termOffsets}}, the Highlighter produces snippets like {{Population 5<em class="match">5.000</em> - 9.999}} for a query of {{5000}}. Note the duplicated {{5}} before the {{<em}}; that's the bug.
> This does not happen when {{useFastVectorHighlighter=true}}.
> It also does not happen in a field without {{termVectors}}, {{termPositions}} and {{termOffsets}}.
> To reproduce, field definitions:
> {code:xml}
> <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
> <analyzer type="index">
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="true"/>
> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" splitOnNumerics="0"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> </analyzer>
> <analyzer type="query">
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" splitOnNumerics="0"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> </analyzer>
> </fieldType>
> ...
> <field name="name" type="text" indexed="true" stored="true" />
> <field name="descr" type="text" indexed="true" stored="true" termVectors="true" termOffsets="true" termPositions="true" />
> {code}
> All configured and explicit parameters, from {{echoParams=all}}:
> {code:javascript}
> {
> "defType": "edismax",
> "echoParams": "all",
> "facet.mincount": "1",
> "fl": "id",
> "hl.fl": "id name tag cat descr dim dimvalue provider source_source text",
> "hl.fragsize": "200",
> "hl.mergeContiguous": "true",
> "hl.simple.post": "</em>",
> "hl.simple.pre": "<em class="match">",
> "hl.snippets": "4",
> "hl.usePhraseHighlighter": "true",
> "hl": "true",
> "q.alt": "*:*",
> "q": "5000",
> "qf": " id_a^10.0 name^6 granularity_a^5 tag^4 cat^3 descr^3 dim^2 dimvalue^2 provider^2 source_source^2 text^2 ",
> "qt": "dismax",
> "rows": "10",
> "sort": "score desc"
> }
> {code}
> and a document containing numbers with thousand separators, e.g.:
> {code:javascript}
> {
> "name": "Demographics and income: Income distribution: Number of HHs earning > US$5,000 p.a. (constant 2005 prices) by country"
> "descr": "Number of households with disposable income of more than US$5,000 per annum at constant 2005 prices"
> }
> {code}
> The highlight snippets I get:
> {code:javascript}
> {
> name: [
> "Demographics and income: Income distribution: Number of HHs earning > US$<em class="match">5,000</em> p.a. (constant 2005 prices) by country"
> ],
> descr: [
> "Number of households with disposable income of more than US$5<em class="match">5,000</em> per annum at constant 2005 prices"
> ]
> }
> {code}
> Note that the {{5}} gets duplicated only in the {{descr}} field snippet, not in the {{name}} field snippet. The only difference between these fields is termVectors, termPositions and termOffsets, so those settings are presumably relevant.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org