You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "David Smiley (JIRA)" <ji...@apache.org> on 2016/06/16 19:43:05 UTC

[jira] [Commented] (LUCENE-7342) WordDelimiterFilter should observe KeywordAttribute to pass these tokens through

    [ https://issues.apache.org/jira/browse/LUCENE-7342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15334528#comment-15334528 ] 

David Smiley commented on LUCENE-7342:
--------------------------------------

I considered wrapping WDF; it was an interesting thought experiment with a possible solution but our TokenStream API makes doing this very complex.  It would involve an additional collaborating TokenStream instance to provide as the input to the delegated TokenFilter, thus intercepting the input. The input intercepting TokenFilter would detect a token should pass through and then captureState() in a loop until it finds a token not matching the predicate.  The wrapping TokenFilter would call delegateTokenFilter.incrementToken() but then would see if there are any cached tokens.  If there are, it would captureState, replay the cached tokens, then replay the just captured state. This is a big mess, awkward to use, and has some overhead.

> WordDelimiterFilter should observe KeywordAttribute to pass these tokens through
> --------------------------------------------------------------------------------
>
>                 Key: LUCENE-7342
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7342
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: David Smiley
>
> I have a text analysis requirement in which I want certain tokens to not be processed by WordDelimiterFilter -- i.e. they should pass through that filter.  WDF, like several other TokenFilters, has a configurable word list but this list is static producing a concrete CharArraySet.  Thus, for example, I can't filter by a regexp nor can I filter based on other attributes.
> A simple solution that makes sense to me is to have WDF use KeywordAttribute to know if it should skip the token.  KeywordAttribute seems fairly generic as to how it can be used, although granted today it's only used by the stemmers.  That attribute isn't named "StemmerIgnoreAttribute" or some-such; it's generic so I think it's fine for WDF to use it in a similar way.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org