You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-dev@jackrabbit.apache.org by Piotr Tajduś <pi...@skg.pl> on 2019/06/24 15:14:07 UTC
Lucene index - problem with indexOriginalTerm
Hi,
In latest OAK versions I have noticed problems with indexing words with
special characters. indexOriginalTerm was set on index, however I
couldn't find things like "xxx-yyy*" (some versions ago it was working
fine I think). I have checked sources and it seems that
"indexOriginalTerm" is used as parameter of the WordDelimiterFilter,
however OakAnalyzer uses StandardTokenizer which split words on almost
every special character before WordDelimiterFilter. Here is fragment if
the the class description:
/ * One use for {@link WordDelimiterFilter} is to help match words with
different//
// * subword delimiters. For example, if the source text contained
"wi-fi" one may//
// * want "wifi" "WiFi" "wi-fi" "wi+fi" queries to all match. One way of
doing so//
// * is to specify combinations="1" in the analyzer used for indexing, and//
// * combinations="0" (the default) in the analyzer used for querying.
Given that//
// * the current {@link StandardTokenizer} immediately removes many
intra-word//
// * delimiters, it is recommended that this filter be used after a
tokenizer that//
// * does not do this (such as {@link WhitespaceTokenizer})./
Is this intended functionality ?
Best regards,
Piotr