You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-dev@jackrabbit.apache.org by Piotr Tajduś <pi...@skg.pl> on 2019/06/24 15:14:07 UTC

Lucene index - problem with indexOriginalTerm

Hi,

In latest OAK versions I have noticed problems with indexing words with 
special characters. indexOriginalTerm was set on index, however I 
couldn't find things like "xxx-yyy*" (some versions ago it was working 
fine I think). I have checked sources and it seems that 
"indexOriginalTerm" is used as parameter of the WordDelimiterFilter, 
however OakAnalyzer uses StandardTokenizer which split words on almost 
every special character before WordDelimiterFilter. Here is fragment if 
the the class description:

/ * One use for {@link WordDelimiterFilter} is to help match words with 
different//
// * subword delimiters. For example, if the source text contained 
"wi-fi" one may//
// * want "wifi" "WiFi" "wi-fi" "wi+fi" queries to all match. One way of 
doing so//
// * is to specify combinations="1" in the analyzer used for indexing, and//
// * combinations="0" (the default) in the analyzer used for querying. 
Given that//
// * the current {@link StandardTokenizer} immediately removes many 
intra-word//
// * delimiters, it is recommended that this filter be used after a 
tokenizer that//
// * does not do this (such as {@link WhitespaceTokenizer})./

Is this intended functionality ?


Best regards,

Piotr