You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by "Embry, Clay" <Cl...@vignette.com> on 2008/03/10 19:28:50 UTC

phrase search with custom TokenFilter

Hi, I have written a TokenFilter which breaks up words with internal dot characters and adds the whole word plus the pieces as tokens in the stream. I am using that TokenFilter with the StandardAnalyzer to index my documents. Then I do searches using the StandardAnalyzer. Everything is working great except for some phrase searches. Here's an example:

Document string
---------------
entity-cache.size-limit

StandardAnalyzer token - position increment
-------------------------------------------

(entity,0,6,type=<alphanum>) - 1

(cache.size,7,17,type=<host>) - 1

(limit,18,23,type=<alphanum>) - 1


MyAnalyzer token - position increment
-------------------------------------

(entity,0,6,type=<alphanum>) - 1

(cache.size,7,17,type=<host>) - 1

(limit,18,23,type=<alphanum>) - 1

(cache,7,12,type=<alphanum>) - 1

(size,13,17,type=<alphanum>) - 1



Search string (StandardAnalyzer)
--------------------------------
"cache.size limit"



The search finds the doc if I use the StandardAnalyzer to index, but not if I use MyAnalyzer to index. Can anyone see why that would be true? The first three Tokens of each TokenStream are exactly the same and it looks like both would be found by that search phrase. Do I need to change the position offsets on my extra Tokens or something?



Thanks for any help.

==

Clay Embry

Re: phrase search with custom TokenFilter

Posted by Chris Hostetter <ho...@fucit.org>.

You're going to want to change your TokenFilter so that it emits the split 
pieces tokens immediately after the original token and with a 
positionIncrement of "0" .. don't buffer then up and wait for the entire 
stream to finish first.

it true order of the tokens in the tokenstream and the positionIncrement 
are what matter when doing a PhraseQuery -- not the start/end offsets

Incidently: you might want to take a look at Solr's WordDelimiterFilter, 
both as an example of how to do this, and because it may already meet all 
the needs you've anticipated and some you might not have thought of but 
might want to use once you take a look at them...

http://svn.apache.org/viewvc/lucene/solr/trunk/src/java/org/apache/solr/analysis/WordDelimiterFilter.java?view=markup
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#WordDelimiterFilter



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org