You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Parit Bansal <Pa...@sib.swiss> on 2017/12/22 11:51:10 UTC
WordDelimiterIterator word splitting usecase
Hi,
I have been migrating and maintaining lucene indexing code for our use
case since 2.x version (now we are are 6.6.1 migrating to 7.x) .
One problem I am constantly facing is regarding
org.apache.lucene.analysis.miscellaneous.WordDelimiterIterator class
that is defined final in lucene codebase. In this class, there is a
isBreak() method that defines when to split a word into subwords. One of
the cases is *ALPHA->NUMERIC, NUMERIC->ALPHA :Don't split* (in the same
if condition) .
Unfortunately, in my use case we strictly want *NUMERIC->ALPHA :Don't
split* and there is no way around to change this behavior using the
configurationFlags.
Since this isBreak() method is private and WordDelimiterFilterIterator
class final therefore there is no possibility for subclassing and
overriding this method.
Also, WordDelimiterFilterIterator is tightly coupled with
WordDelimiterFilter (WordDelimiterGraphFilter in 7.x) and both are
final. So this leaves me with only one option to copy paste their code
into custom classes and change the behaviour. Clearly this is not a
maintainable solution.
So, I am looking for advise what else is possible? OR is there a
possibility of a patch/refactoring to fix isBreak() to use some new
configuration flags?
- Best
Parit Bansal
(Developer www.uniprot.org)
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org