You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Parit Bansal <Pa...@sib.swiss> on 2017/12/22 11:51:10 UTC

WordDelimiterIterator word splitting usecase

Hi,

I have been migrating and maintaining lucene indexing code for our use 
case since 2.x version (now we are are 6.6.1 migrating to 7.x) .

One problem I am constantly facing is regarding 
org.apache.lucene.analysis.miscellaneous.WordDelimiterIterator class 
that is defined final in lucene codebase.  In this class, there is a 
isBreak() method that defines when to split a word into subwords. One of 
the cases is *ALPHA->NUMERIC, NUMERIC->ALPHA :Don't split* (in the same 
if condition) .

Unfortunately, in my use case we strictly want *NUMERIC->ALPHA :Don't 
split* and there is no way around to change this behavior using the 
configurationFlags.

Since this isBreak() method is private and WordDelimiterFilterIterator 
class final therefore there is no possibility for subclassing and 
overriding this method.

Also, WordDelimiterFilterIterator is tightly coupled with 
WordDelimiterFilter (WordDelimiterGraphFilter in 7.x) and both are 
final. So this leaves me with only one option to copy paste their code 
into custom classes and change the behaviour. Clearly this is not a 
maintainable solution.

So, I am looking for advise what else is possible? OR is there a 
possibility of a patch/refactoring to fix isBreak() to use some new 
configuration flags?

- Best

Parit Bansal

(Developer www.uniprot.org)



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org