You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Shawn Heisey (JIRA)" <ji...@apache.org> on 2015/07/20 17:43:04 UTC
[jira] [Updated] (LUCENE-6689) Odd analysis problem with WDF, appears to be triggered by preceding analysis components

     [ https://issues.apache.org/jira/browse/LUCENE-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shawn Heisey updated LUCENE-6689:
---------------------------------
    Description: 
This problem shows up for me in Solr, but I believe the issue is down at the Lucene level, so I've opened the issue in the LUCENE project.  We can move it if necessary.

I've boiled the problem down to this minimum Solr fieldType:

{noformat}
    <fieldType name="testType" class="solr.TextField"
sortMissingLast="true" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer
class="org.apache.lucene.analysis.icu.segmentation.ICUTokenizerFactory"
rulefiles="Latn:Latin-break-only-on-whitespace.rbbi"/>
        <filter class="solr.PatternReplaceFilterFactory"
          pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$"
          replacement="$2"
        />
        <filter class="solr.WordDelimiterFilterFactory"
          splitOnCaseChange="1"
          splitOnNumerics="1"
          stemEnglishPossessive="1"
          generateWordParts="1"
          generateNumberParts="1"
          catenateWords="1"
          catenateNumbers="1"
          catenateAll="0"
          preserveOriginal="1"
        />
      </analyzer>
      <analyzer type="query">
        <tokenizer
class="org.apache.lucene.analysis.icu.segmentation.ICUTokenizerFactory"
rulefiles="Latn:Latin-break-only-on-whitespace.rbbi"/>
        <filter class="solr.PatternReplaceFilterFactory"
          pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$"
          replacement="$2"
        />
        <filter class="solr.WordDelimiterFilterFactory"
          splitOnCaseChange="1"
          splitOnNumerics="1"
          stemEnglishPossessive="1"
          generateWordParts="1"
          generateNumberParts="1"
          catenateWords="0"
          catenateNumbers="0"
          catenateAll="0"
          preserveOriginal="0"
        />
      </analyzer>
    </fieldType>
{noformat}

On Solr 4.7, if this type is given the input "aaa-bbb: ccc" then aaa ends up at term position 1 and bbb at term position 2.  This seems perfectly reasonable to me.  In Solr 4.9, both terms end up at position 2.  This causes phrase queries which used to work to return zero hits.  The exact text of the phrase query is in the original documents that match on 4.7.

If the custom rbbi (which is included unmodified from the lucene icu analysis source code) is not used, then the problem doesn't happen, because the punctuation doesn't make it to the PRF.  If the PatternReplaceFilterFactory is not present, then the problem doesn't happen.

I can work around the problem by setting luceneMatchVersion to 4.7, but I think the behavior is a bug, and I would rather not continue to use 4.7 analysis when I upgrade to 5.x, which I hope to do soon.


  was:
This problem shows up for me in Solr, but I believe the issue is down at the Lucene level, so I've opened the issue in the LUCENE project.  We can move it if necessary.

I've boiled the problem down to this minimum Solr fieldType:

{noformat}
    <fieldType name="testType" class="solr.TextField"
sortMissingLast="true" positionIncrementGap="100">
      <analyzer>
        <tokenizer
class="org.apache.lucene.analysis.icu.segmentation.ICUTokenizerFactory"
rulefiles="Latn:Latin-break-only-on-whitespace.rbbi"/>
        <filter class="solr.PatternReplaceFilterFactory"
          pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$"
          replacement="$2"
        />
        <filter class="solr.WordDelimiterFilterFactory"
          splitOnCaseChange="1"
          splitOnNumerics="1"
          stemEnglishPossessive="1"
          generateWordParts="1"
          generateNumberParts="1"
          catenateWords="1"
          catenateNumbers="1"
          catenateAll="0"
          preserveOriginal="1"
        />
      </analyzer>
    </fieldType>
{noformat}

On Solr 4.7, if this type is given the input "aaa-bbb: ccc" then aaa ends up at term position 1 and bbb at term position 2.  This seems perfectly reasonable to me.  In Solr 4.9, both terms end up at position 2.  This causes phrase queries which used to work to return zero hits.  The exact text of the phrase query is in the original documents that match on 4.7.

If the custom rbbi (which is included unmodified from the lucene icu analysis source code) is not used, then the problem doesn't happen, because the punctuation doesn't make it to the PRF.  If the PatternReplaceFilterFactory is not present, then the problem doesn't happen.

I can work around the problem by setting luceneMatchVersion to 4.7, but I think the behavior is a bug, and I would rather not continue to use 4.7 analysis when I upgrade to 5.x, which I hope to do soon.



> Odd analysis problem with WDF, appears to be triggered by preceding analysis components
> ---------------------------------------------------------------------------------------
>
>                 Key: LUCENE-6689
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6689
>             Project: Lucene - Core
>          Issue Type: Bug
>    Affects Versions: 4.8
>            Reporter: Shawn Heisey
>
> This problem shows up for me in Solr, but I believe the issue is down at the Lucene level, so I've opened the issue in the LUCENE project.  We can move it if necessary.
> I've boiled the problem down to this minimum Solr fieldType:
> {noformat}
>     <fieldType name="testType" class="solr.TextField"
> sortMissingLast="true" positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer
> class="org.apache.lucene.analysis.icu.segmentation.ICUTokenizerFactory"
> rulefiles="Latn:Latin-break-only-on-whitespace.rbbi"/>
>         <filter class="solr.PatternReplaceFilterFactory"
>           pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$"
>           replacement="$2"
>         />
>         <filter class="solr.WordDelimiterFilterFactory"
>           splitOnCaseChange="1"
>           splitOnNumerics="1"
>           stemEnglishPossessive="1"
>           generateWordParts="1"
>           generateNumberParts="1"
>           catenateWords="1"
>           catenateNumbers="1"
>           catenateAll="0"
>           preserveOriginal="1"
>         />
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer
> class="org.apache.lucene.analysis.icu.segmentation.ICUTokenizerFactory"
> rulefiles="Latn:Latin-break-only-on-whitespace.rbbi"/>
>         <filter class="solr.PatternReplaceFilterFactory"
>           pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$"
>           replacement="$2"
>         />
>         <filter class="solr.WordDelimiterFilterFactory"
>           splitOnCaseChange="1"
>           splitOnNumerics="1"
>           stemEnglishPossessive="1"
>           generateWordParts="1"
>           generateNumberParts="1"
>           catenateWords="0"
>           catenateNumbers="0"
>           catenateAll="0"
>           preserveOriginal="0"
>         />
>       </analyzer>
>     </fieldType>
> {noformat}
> On Solr 4.7, if this type is given the input "aaa-bbb: ccc" then aaa ends up at term position 1 and bbb at term position 2.  This seems perfectly reasonable to me.  In Solr 4.9, both terms end up at position 2.  This causes phrase queries which used to work to return zero hits.  The exact text of the phrase query is in the original documents that match on 4.7.
> If the custom rbbi (which is included unmodified from the lucene icu analysis source code) is not used, then the problem doesn't happen, because the punctuation doesn't make it to the PRF.  If the PatternReplaceFilterFactory is not present, then the problem doesn't happen.
> I can work around the problem by setting luceneMatchVersion to 4.7, but I think the behavior is a bug, and I would rather not continue to use 4.7 analysis when I upgrade to 5.x, which I hope to do soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org