You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-issues@jackrabbit.apache.org by "Dave Hughes (Jira)" <ji...@apache.org> on 2020/11/24 18:55:00 UTC
[jira] [Comment Edited] (OAK-9145) OakAnalyzer applies LowerCaseFilter and WordDelimiterFilter in wrong order

    [ https://issues.apache.org/jira/browse/OAK-9145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17238274#comment-17238274 ] 

Dave Hughes edited comment on OAK-9145 at 11/24/20, 6:54 PM:
-------------------------------------------------------------

Thank you [~thomasm], much appreciated.

In response to the problems you mentioned in your last comment:

 1. Yes, I considered the backwards compatibility issue.  In my mind, this fixes a bug, and I don't generally put a lot of value on preserving backwards compatibility of bugs.  But I also understand that this is a widely used project and many consumers may, possibly, depend on the current incorrect functionality.

I'm not sure I follow your suggestion to use a different version number.  Are you proposing that this fix should wait for the next major version release of Oak?  Or that this should become a new (non-default) OakAnalyzer2, in order to provide an analyzer which works correctly, but would have to be manually selected by consumers?  If the latter, I feel like it defeats the purpose of this bugfix, since the effort to switch to the corrected OakAnalyzer2 would be practically identical to manually configuring the analyzer's filter chain (the workaround that I described).


 2. As I commented on the Github PR, I did try to create a test case for this, but I struggled a lot.  As you mentioned in your earlier comment, "Not sure where to put it best".  I think the problem is largely that the OakAnalyzer has not been adequately tested, which is why there's no obvious place to put the new test case.  I would greatly appreciate if any other contributors are more familiar with the code base and could help in adding testing around the OakAnalyzer class.


was (Author: dave.l.hughes):
Thank you [~thomasm], much appreciated.

In response to the problems you mentioned in your last comment:
 1. Yes, I considered the backwards compatibility issue.  In my mind, this fixes a bug, and I don't generally put a lot of value on preserving backwards compatibility of bugs.  But I also understand that this is a widely used project and many consumers may, possibly, depend on the current incorrect functionality.

I'm not sure I follow your suggestion to use a different version number.  Are you proposing that this fix should wait for the next major version release of Oak?  Or that this should become a new (non-default) OakAnalyzer2, in order to provide an analyzer which works correctly, but would have to be manually selected by consumers?  If the latter, I feel like it defeats the purpose of this bugfix, since the effort to switch to the corrected OakAnalyzer2 would be practically identical to manually configuring the analyzer's filter chain (the workaround that I described).


 2. As I commented on the Github PR, I did try to create a test case for this, but I struggled a lot.  As you mentioned in your earlier comment, "Not sure where to put it best".  I think the problem is largely that the OakAnalyzer has not been adequately tested, which is why there's no obvious place to put the new test case.  I would greatly appreciate if any other contributors are more familiar with the code base and could help in adding testing around the OakAnalyzer class.

> OakAnalyzer applies LowerCaseFilter and WordDelimiterFilter in wrong order
> --------------------------------------------------------------------------
>
>                 Key: OAK-9145
>                 URL: https://issues.apache.org/jira/browse/OAK-9145
>             Project: Jackrabbit Oak
>          Issue Type: Bug
>          Components: indexing, jcr, lucene
>         Environment: Discovered while performing DAM searches in Adobe Experience Manager. 
>            Reporter: Dave Hughes
>            Assignee: Thomas Mueller
>            Priority: Minor
>              Labels: easyfix, pull-request-available
>
> I believe OakAnalyzer applies LowerCaseFilter and WordDelimiterFilter in the wrong order.  WordDelimiterFilter is invoked with the GENERATE_WORD_PARTS flag, which splits camelCase/PascalCase into multiple terms, but since the LowerCaseFilter is applied first, the mixed-case is lost and the terms can't be split.
> Searching for savings, the damAssetLucene index (which uses the default OakAnalyzer) does not find an asset named savingsAccount.svg.
> Upon configuring the index's analyzers (/oak:index/damAssetLucene/analyzers) to apply WordDelimiterFilter before LowerCaseFilter, the correct behaviour was seen.
> {noformat}
> {
>   "jcr:primaryType": "nt:unstructured",
>   "default": {
>     "jcr:primaryType": "nt:unstructured",
>     "tokenizer": {
>       "jcr:primaryType": "nt:unstructured",
>       "name": "Standard"
>     },
>     "filters": {
>       "jcr:primaryType": "nt:unstructured",
>       "WordDelimiter": {"jcr:primaryType": "nt:unstructured"},
>       "LowerCase": {"jcr:primaryType": "nt:unstructured"}
>     }
>   }
> }
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)