You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-issues@jackrabbit.apache.org by "Chetan Mehrotra (JIRA)" <ji...@apache.org> on 2017/02/17 06:22:41 UTC

[jira] [Commented] (OAK-5692) Oak Lucene analyzers docs unclear on viable configurations

    [ https://issues.apache.org/jira/browse/OAK-5692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15871246#comment-15871246 ] 

Chetan Mehrotra commented on OAK-5692:
--------------------------------------

bq. The docs mention the "default" analyzer ([oak:queryIndexDefinition]/analyzers/default). Can other analyzers be defined? How are they selected for use? is the selection configurable?

Currently only one analyzer canbe configured. That can be done by either
# Specifying a default analyzer directly via its classname
# OR via composing Tokenizers, TokenFilters and CharFilters

bq. By default is the analyzer index AND query time, unless specified by `type=index|query` property?

Currently same analyzer is used for both query and index.

Reason for having a config name of "anakyzers" was to allow support for configuring more analyzers and having them used in different context for different properties. So far no one asked for that hence this part is not extended!

bq. The Stop filters words property must be a String not String[] and the value is a comma delimited String value. 

Yes. All config values must be string. No other JCR type should be used here

> Oak Lucene analyzers docs unclear on viable configurations
> ----------------------------------------------------------
>
>                 Key: OAK-5692
>                 URL: https://issues.apache.org/jira/browse/OAK-5692
>             Project: Jackrabbit Oak
>          Issue Type: Documentation
>            Reporter: David Gonzalez
>
> The Oak lucene docs [1] > Analyzers section would benefit from clarification:
> Combining analyzer-based topics into a single ticket
> * If no analyzer is specified, what analyzer setup is used (at the vert least some tokenizer must be used)
> * The docs mention the "default" analyzer ([oak:queryIndexDefinition]/analyzers/default). Can other analyzers be defined? How are they selected for use? is the selection configurable?
> * By default is the analyzer index AND query time, unless specified by `type=index|query` property?
> * What is the naming for multiple analyzer nodes? Are all children of analyzers assumed to be an analyzer? Ex. If i want a special configuration or index and another for query, could i create:
> {noformat}
> ../myIndex/analyzers/indexAnalyzer@type=index
> .. define the index-time analyzer ...
> ../myIndex/analyzers/queryAnalyzer@type=query
> .. define the query-time analyzer ...
> {noformat}
> * How are languages handled? Ex. language specific stop words, synonyms, char mapping,  and Stemming.
> * If [oak:queryIndexDefinition]/analyzers/default@class=org.apache.lucene.analysis.standard.StandardAnalyzer it appears the Standard Tokenizer and Standard Lowercase and Stop Filters are used. The Stop filter can be augmented w the well-named stopwords file.
> ** Can other charFilters/filters be layered on top of this "named" Analyzer (it seems not).
> * When the Stop Filter is used it provided the OOTB language-based stop words. If a custom stopwords file is provided, that list replaced the OOTB lang-based, requiring the developer to provide their own language based Stop words. Is this correct? This should be called out and link out to the catalog of OOTB stopword txt files for easy inclusion)
> * The Stop filters words property must be a String not String[] and the value is a comma delimited String value. Would be good to call this out.
> * What are all the CharFilters/Filters available? Is there a concise list w/ their params? (Ex. i think the PorterStem might support and ignoreCase param?)
> * Synonym Filter syntax is unclear; It seems like here are 2 formats; directional x -> y and bi-directional (comma delimited); i could only get the latter to work.
> * Are all the options in the link [2] supported. Its unclear if there is a 1:1 between oak lucene and solr's capabilities or if [2] is a loose example of the "types" of supported analyzers.
> * For things something like the PatternReplaceCharFilterFactory [3], how do you define multiple pattern mappings, as IIUC the charFilter node MUST be named:
> {noformat}.../charFilters/PatternReplace{noformat} so you can't have multiple "PatternReplace" named nodes, each with its own "@pattern" and "@replace" properties.  It seems like there is only support for a single object for each Factory type?
> Generally this seems like the handiest resource: https://cwiki.apache.org/confluence/display/solr/Understanding+Analyzers%2C+Tokenizers%2C+and+Filters
> [1]  http://jackrabbit.apache.org/oak/docs/query/lucene.html
> [2] https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Specifying_an_Analyzer_in_the_schema
> [3] https://cwiki.apache.org/confluence/display/solr/CharFilterFactories



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)