You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Mike Sokolov (JIRA)" <ji...@apache.org> on 2018/04/23 15:03:00 UTC

[jira] [Comment Edited] (LUCENE-8265) WordDelimiterFilter should pass through terms marked as keywords

    [ https://issues.apache.org/jira/browse/LUCENE-8265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16448257#comment-16448257 ] 

Mike Sokolov edited comment on LUCENE-8265 at 4/23/18 3:02 PM:
---------------------------------------------------------------

Good point [~khitrin]. While I am having difficulty coming up with an example of something that would be marked as keyword in order to protect it from stemming, but that *should* be split up by WDF (or conversely something that should be stemmed, but not split), I agree that in principle these are different decisions, and adding an option is low-complexity. I'll post a change later today I hope.

[~romseygeek] that idea is really cool, but I think it is more complexity than I want to take on for this issue? I have tried making a wrapping filter before, and it is pretty tricky. In my experience you have to be very careful about (1)  how reset() calls propagate, and (2) the signal to switch behavior.  EG you probably want to be able to call incrementToken() on either of two different upstream filters that both share the same input, based on the value of some attribute. However, by the time you see the attribute, it is already too late to change! So you have to introduce a delay token that only carries this "switching" info. 


was (Author: sokolov):
Good point [~khitrin]. I agree, we should add an option. I'll post a change later today I hope.

[~romseygeek] that idea is really cool, but I think it is more complexity than I want to take on for this issue? I have tried making a wrapping filter before, and it is pretty tricky. In my experience you have to be very careful about (1)  how reset() calls propagate, and (2) the signal to switch behavior.  EG you probably want to be able to call incrementToken() on either of two different upstream filters that both share the same input, based on the value of some attribute. However, by the time you see the attribute, it is already too late to change! So you have to introduce a delay token that only carries this "switching" info. 

> WordDelimiterFilter should pass through terms marked as keywords
> ----------------------------------------------------------------
>
>                 Key: LUCENE-8265
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8265
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Mike Sokolov
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> This will help in cases where some terms containing separator characters should be split, but others should not.  For example, this will enable a filter that identifies things that look like fractions and identifies them as keywords so that 1/2 does not become 12, while doing splitting and joining on terms that look like part numbers containing slashes, eg something like "sn-999123/1" might sometimes be written "sn-999123-1".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org