You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Gunnlaugur Thor Briem (JIRA)" <ji...@apache.org> on 2013/05/23 02:45:20 UTC

[jira] [Updated] (SOLR-4851) Highlighter duplicates numeric token in snippet when term vectors/positions/offsets on

     [ https://issues.apache.org/jira/browse/SOLR-4851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gunnlaugur Thor Briem updated SOLR-4851:
----------------------------------------

    Description: 
With original text {{Population 5.000 - 9.999}} indexed with {{termVectors}}, {{termPositions}} and {{termOffsets}}, the Highlighter produces snippets like {{Population 5<em class="match">5.000</em> - 9.999}} for a query of {{5000}}. Note the duplicated {{5}} before the {{<em}}; that's the bug.

This does not happen when {{useFastVectorHighlighter=true}}.

It also does not happen in a field without {{termVectors}}, {{termPositions}} and {{termOffsets}}.

To reproduce, field definitions:

{code:xml}
    <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" splitOnNumerics="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" splitOnNumerics="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>

    ...

    <field name="name" type="text" indexed="true" stored="true" />
    <field name="descr" type="text" indexed="true" stored="true" termVectors="true" termOffsets="true" termPositions="true" />
{code}

All configured and explicit parameters, from {{echoParams=all}}:

{code:javascript}
{
"defType": "edismax",
"echoParams": "all",
"facet.mincount": "1",
"fl": "id",
"hl.fl": "id name tag cat descr dim dimvalue provider source_source text",
"hl.fragsize": "200",
"hl.mergeContiguous": "true",
"hl.simple.post": "</em>",
"hl.simple.pre": "<em class="match">",
"hl.snippets": "4",
"hl.usePhraseHighlighter": "true",
"hl": "true",
"q.alt": "*:*",
"q": "5000",
"qf": " id_a^10.0 name^6 granularity_a^5 tag^4 cat^3 descr^3 dim^2 dimvalue^2 provider^2 source_source^2 text^2 ",
"qt": "dismax",
"rows": "10",
"sort": "score desc"
}
{code}

and a document containing numbers with thousand separators, e.g.:

{code:javascript}
{
"name": "Demographics and income: Income distribution: Number of HHs earning > US$5,000 p.a. (constant 2005 prices) by country"
"descr": "Number of households with disposable income of more than US$5,000 per annum at constant 2005 prices"
}
{code}

The highlight snippets I get:

{code:javascript}
{
name: [
  "Demographics and income: Income distribution: Number of HHs earning &gt; US$<em class="match">5,000</em> p.a. (constant 2005 prices) by country"
],
descr: [
  "Number of households with disposable income of more than US$5<em class="match">5,000</em> per annum at constant 2005 prices"
]
}
{code}

Note that the {{5}} gets duplicated only in the {{descr}} field snippet, not in the {{name}} field snippet. The only difference between these fields is termVectors, termPositions and termOffsets, so those settings are presumably relevant.

  was:
With original text {{Population 5.000 - 9.999}} indexed with termVectors, termPositions and termOffsets, the Highlighter produces snippets like {{Population 5<em class="match">5.000</em> - 9.999}} for a query of {{5000}}. Note the duplicated {{5}} before the {{<em}}; that's the bug.

This does not happen when {{useFastVectorHighlighter=true}}.

It also does not happen in a field without termVectors, termPositions and termOffsets.

To reproduce, field definitions:

{code:xml}
    <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" splitOnNumerics="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" splitOnNumerics="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>

    ...

    <field name="name" type="text" indexed="true" stored="true" />
    <field name="descr" type="text" indexed="true" stored="true" termVectors="true" termOffsets="true" termPositions="true" />
{code}

All configured and explicit parameters, from {{echoParams=all}}:

{code:javascript}
{
defType: "edismax",
echoParams: "all",
facet.mincount: "1",
fl: "id",
hl.fl: "id name tag cat descr dim dimvalue provider source_source text",
hl.fragsize: "200",
hl.mergeContiguous: "true",
hl.simple.post: "</em>",
hl.simple.pre: "<em class="match">",
hl.snippets: "4",
hl.usePhraseHighlighter: "true",
hl: "true",
q.alt: "*:*",
q: "5000",
qf: " id_a^10.0 name^6 granularity_a^5 tag^4 cat^3 descr^3 dim^2 dimvalue^2 provider^2 source_source^2 text^2 ",
qt: "dismax",
rows: "10",
sort: "score desc"
}
{code}

and a piece of text containing numbers with thousand separators, e.g. “Demographics and income: Income distribution: Number of HHs earning &gt; US$5,000 p.a. (constant 2005 prices) by country”

The highlighting response I get:

{code:javascript}
{
name: [
  "Demographics and income: Income distribution: Number of HHs earning &gt; US$<em class="match">5,000</em> p.a. (constant 2005 prices) by country"
],
descr: [
  "Number of households with disposable income of more than US$5<em class="match">5,000</em> per annum at constant 2005 prices"
]
}
{code}

Note that the {{5}} gets duplicated only in the {{descr}} field snippet, not in the {{name}} field snippet. The only difference between these fields is termVectors, termPositions and termOffsets, so those settings are presumably relevant.

    
> Highlighter duplicates numeric token in snippet when term vectors/positions/offsets on
> --------------------------------------------------------------------------------------
>
>                 Key: SOLR-4851
>                 URL: https://issues.apache.org/jira/browse/SOLR-4851
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>    Affects Versions: 3.6.2
>            Reporter: Gunnlaugur Thor Briem
>
> With original text {{Population 5.000 - 9.999}} indexed with {{termVectors}}, {{termPositions}} and {{termOffsets}}, the Highlighter produces snippets like {{Population 5<em class="match">5.000</em> - 9.999}} for a query of {{5000}}. Note the duplicated {{5}} before the {{<em}}; that's the bug.
> This does not happen when {{useFastVectorHighlighter=true}}.
> It also does not happen in a field without {{termVectors}}, {{termPositions}} and {{termOffsets}}.
> To reproduce, field definitions:
> {code:xml}
>     <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="true"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
>         <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" splitOnNumerics="0"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
>         <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" splitOnNumerics="0"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>       </analyzer>
>     </fieldType>
>     ...
>     <field name="name" type="text" indexed="true" stored="true" />
>     <field name="descr" type="text" indexed="true" stored="true" termVectors="true" termOffsets="true" termPositions="true" />
> {code}
> All configured and explicit parameters, from {{echoParams=all}}:
> {code:javascript}
> {
> "defType": "edismax",
> "echoParams": "all",
> "facet.mincount": "1",
> "fl": "id",
> "hl.fl": "id name tag cat descr dim dimvalue provider source_source text",
> "hl.fragsize": "200",
> "hl.mergeContiguous": "true",
> "hl.simple.post": "</em>",
> "hl.simple.pre": "<em class="match">",
> "hl.snippets": "4",
> "hl.usePhraseHighlighter": "true",
> "hl": "true",
> "q.alt": "*:*",
> "q": "5000",
> "qf": " id_a^10.0 name^6 granularity_a^5 tag^4 cat^3 descr^3 dim^2 dimvalue^2 provider^2 source_source^2 text^2 ",
> "qt": "dismax",
> "rows": "10",
> "sort": "score desc"
> }
> {code}
> and a document containing numbers with thousand separators, e.g.:
> {code:javascript}
> {
> "name": "Demographics and income: Income distribution: Number of HHs earning > US$5,000 p.a. (constant 2005 prices) by country"
> "descr": "Number of households with disposable income of more than US$5,000 per annum at constant 2005 prices"
> }
> {code}
> The highlight snippets I get:
> {code:javascript}
> {
> name: [
>   "Demographics and income: Income distribution: Number of HHs earning &gt; US$<em class="match">5,000</em> p.a. (constant 2005 prices) by country"
> ],
> descr: [
>   "Number of households with disposable income of more than US$5<em class="match">5,000</em> per annum at constant 2005 prices"
> ]
> }
> {code}
> Note that the {{5}} gets duplicated only in the {{descr}} field snippet, not in the {{name}} field snippet. The only difference between these fields is termVectors, termPositions and termOffsets, so those settings are presumably relevant.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org