You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Shawn Heisey <so...@elyograg.org> on 2014/05/02 05:44:35 UTC

Re: CJKBigramFilter - position bug with outputUnigrams?

On 4/21/2014 12:47 PM, Robert Muir wrote:
> I think you misunderstand what the filter does. It does not "output unigrams".
> 
> In the case you choose this option, the positions are from the
> unigrams omitted by your tokenizer (StandardTokenizer or whatever),
> and it just adds bigrams as synonyms to those. It cannot safely do
> anything else.
> 
> There can be only one "n".

I took a quick look at the code.  I'm sure it's easy to grasp once
you're really familiar with everything, but I'm having a hard time
decoding exactly how the filter works.  I don't have any more time to
plow through it tonight.

Would it be possible to implement an option with a name similar to
"lastUnigramAtPreviousPosition" so that I can optionally get the
behavior I'm after when the input is two or more characters, without
changing current behavior for anyone else?  This would completely solve
my current problem.

Thanks,
Shawn


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: CJKBigramFilter - position bug with outputUnigrams?

Posted by Robert Muir <rc...@gmail.com>.
>
> Would it be possible to implement an option with a name similar to
> "lastUnigramAtPreviousPosition" so that I can optionally get the
> behavior I'm after when the input is two or more characters, without
> changing current behavior for anyone else?  This would completely solve
> my current problem.
>

This is really not feasible. It sounds like multi-level n-grams in the
same field are a bad match for what you are doing (phrase queries
etc). This just doesnt work, and wont work, based on the mathematics.

Try another approach like removing this filter completely, maybe the
word segmentation by ICU is good enough.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org