You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by viz <vi...@gmail.com> on 2007/07/23 21:02:06 UTC

wrong query when using token expansion

Hallo

I'm working with morphologically rich languages that use lots of accenting
as well.  Thus stemming and accent in-/sensitivity matter a lot. They affect
recall and precision greatly.

My approach (simplified here): Using nutch 0.9. The Analyzer/Tokenizer can
return two tokens per each input word: 
 -the original word
 -and the token which is an unaccented stem of the original (with the
PositionIncrement set to 0 !!)

An artificial example:
füümöö -> füümöö, fuum

At cost of this index expansion I expect to gain a high recall (due to the
stem), yet allowing for high precision (if searched for exact match with
original word, e.g. by using quotes)

However if I execute query 'füümöö' (without quotes) the query parser in
NutchAnalysis generates a boolean query like:
   boolean query:+(url:"füümöö fuum"^4.0) .....

which, of course, returns no hits!

Why is the parser doing this - ignoring posIncrement and creating a *phrase*
instead? Is this intended behavior?  If yes, what is then the way to go in
nutch? I assume my usecase is common for most of the languages.

Thanks,

Viktor
-- 
View this message in context: http://www.nabble.com/wrong-query-when-using-token-expansion-tf4131766.html#a11750644
Sent from the Nutch - User mailing list archive at Nabble.com.