You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Mck <mi...@semb.wever.org> on 2008/09/15 11:56:29 UTC

RE: Re: Replacing FAST functionality atsesam.no-ShingleFilter+exactmatching

Steve,
> Your solution, on the one hand, however, is a kludge: you are
> disabling position information (by assigning the same position to all
> tokens) in order to induce a particular behavior in the query parser,
> which may change in the future.

I disagree.

I'm not disabling position information to induce particular behaviour in
the query parser.

I'm intentionally setting position information to zero as I wish _all_
shingles and unigrams to be synonyms of each other.

The query parser expects you to assign positionIncrement=0 for synonyms
in this manner.

The one kludge i see is that the QueryParser expects the total positions
found to be greater than or equal to one. It might not be intentionally
dealing with the total position count being zero. But the situation
where you have many synonyms is the same as having one token and it
having many synonyms, so positionCount=0 == positionCount=1.

I would think that both should lead to a BooleanQuery being constructed
by the QueryParser. (But the synonyms generated by the ShingleFilter are
in fact phrases so maybe it is wiser to use the MultiPhraseQuery.)

So all in all the QueryParser is behaving exactly as i would expect it
to.
The only logic being induced is setting positionIncrement=0 to indicate
the token is a synonym of the previous token, and this logic is being
completely encapsulated to the ShingleFilter.

~mck

ps i cross-posted as i thought this was better for the dev list but am
not sure.

-- 
"Enlightenment is your ego's biggest disappointment." Yoginanda 
| semb.wever.org | sesat.no | sesam.no |

Re: Replacing FAST functionality atsesam.no-ShingleFilter+exactmatching

Posted by Chris Hostetter <ho...@fucit.org>.
: Does it make sense for me to rewrite the ShingleFilter patch to ensure
: the first token returned always has positionIncrement=1 regardless if
: enablePositions is true or false?

I don't know anything about ShingleFilter -- i've never looked at it and 
i'm not entirely certain i even understand what it does, let alone how 
your patch is attempting to modify it -- but if you've got code that takes 
in text and it produces multiple tokens as a result, then that first token 
it produces should (probably) have a non-zero positionIncrement.  if your 
code takes in "a" Token and produces multiple tokens to replace it, then 
the first token you produce should (probably) have the same 
positionIncrement as the input Token.


(Disclaimer; it's been a lon time since i worked with the TokenStream 
APIs, so i'm hoping my memory isn't faulty and someone else backs me up 
with a "yeah, that's correct")

-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Replacing FAST functionality atsesam.no-ShingleFilter+exactmatching

Posted by Mck <mi...@semb.wever.org>.
> (i'll test too that BooleanQuery works as presumed in my case...)

Indeed it works beautifully with a BooleanQuery.
I've updated the patch to LUCENE-1380

~mck

-- 
"If you have any trouble sounding condescending, find a Unix user to
show you how it's done." Scott Adams 
| semb.wever.org | sesat.no | sesam.no |

Re: Replacing FAST functionality atsesam.no-ShingleFilter+exactmatching

Posted by Mck <mi...@semb.wever.org>.
Chris,

> the safe thing to do is make sure the first token has 
> a positionIncrement of "1" and the 'synonyms" after that use an
> increment of "0"

Yes it makes sense to have at minimum the first token with
positionIncrement=1
I didn't see outside "the vacuum" at all, thank you for explaining.

Does it make sense for me to rewrite the ShingleFilter patch to ensure
the first token returned always has positionIncrement=1 regardless if
enablePositions is true or false?

(i'll test too that BooleanQuery works as presumed in my case...)

Would such a rewrite of the ShingleFilter patch be a substitute for the
custom Analyzer you talk about?
(i'm pushing to keep any patch restricted to the ShingleFilter since my
gut feeling is still that's where the change in behaviour is).

~mck

-- 
"Between two evils, I always pick the one I never tried before." Mae
West 
| semb.wever.org | sesat.no | sesam.no |

RE: Re: Replacing FAST functionality atsesam.no-ShingleFilter+exactmatching

Posted by Chris Hostetter <ho...@fucit.org>.
: The query parser expects you to assign positionIncrement=0 for synonyms
: in this manner.

correct.

: The one kludge i see is that the QueryParser expects the total positions
: found to be greater than or equal to one. It might not be intentionally
: dealing with the total position count being zero. But the situation
: where you have many synonyms is the same as having one token and it
: having many synonyms, so positionCount=0 == positionCount=1.

there has definitely been some wonkiness in various places in the code 
relating to the first token not having a positionIncremenet of "1" ... i 
don't rememebr the details, and maybe it works fine even if every token in 
a stream is "0" but the safe thing to do is make sure the first token has 
a positionIncrement of "1" and the 'synonyms" after that use an increment 
of "0"

This is important not only in case the Lucene internals freak out when 
the "first" token has an increment of "0" but also because you have no way 
of knowing if the first token you produce is really the first token being 
given to the IndexWriter (or QueryParser or what have you)

To be a well behave TokenStream producer you can't assume you opperate in 
a vacume:

1) multiple "Field" instances with the same field name could be added to a 
document, with an Analyzer that uses your Filter but doesn't define any 
particular positionIncrementGap ... if every token you produce has an 
increment of "0" all the tokens from the second Field instance will have 
the same resulting positions as all the tokens from the first Field 
instance (ie: they will all be considered synonyms of each other)

2) I could write an Analyzer that uses your Filter but always adds a 
starting "marker token" to the front of the TokenStream and a differnt 
ending marker token to the end of hte stream (for doing creative things 
with SpanNearQueries) ... if all the tokens you produce have a 
positionIncrement of 0, the result would be that they would be considered 
synonyms of the starting marker token.

: I would think that both should lead to a BooleanQuery being constructed
: by the QueryParser. (But the synonyms generated by the ShingleFilter are
: in fact phrases so maybe it is wiser to use the MultiPhraseQuery.)

If QueryParser gives an Analyzer a chunk of text, and it produces a stream 
of tokens that all exist at the same position, it produces a a 
BooleanQuery, if they are *all* at differnet positions it produces a 
PhraseQuery, if *some* are at the same position, it produces a 
MultiPhraseQuery ... this is fairly fundamental to how QueryParser works, 
and can be relied upon.




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org