You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Lox <lo...@gmail.com> on 2011/07/04 09:25:22 UTC

Payload doesn't apply to WordDelimiterFilterFactory-generated tokens

Hi, I have a problem with the WordDelimiterFilterFactory and the
DelimitedPayloadTokenFilterFactory.
It seems that the payloads are applied only to the original word that I
index and the WordDelimiterFilter doesn't apply the payloads to the tokens
it generates.

For example, imagine I index the string JavaProject|1.7, 
at the end of my analyzer pipeline will be transformed like this:
JavaProject|1.7 -----> javaproject|1.7 java project

Instead, what I would is a result like this:
JavaProject|1.7 -----> javaproject|1.7 java|1.7 project|1.7

This way the payload would be applied to the document even in case of
partial matches on the original word.
Now I have used the pipe notation but imagine those payloads already stored
in solr internally.

How can I do this?

If it is needed, my analyzer looks like this:
<fieldType name="text_C" class="solr.TextField" positionIncrementGap="100"
stored="false" indexed="true">
      <analyzer type="index">		
		<tokenizer class="solr.WhitespaceTokenizerFactory"/>
		<filter class="solr.DelimitedPayloadTokenFilterFactory" encoder="float"/>
		<filter class="solr.PatternReplaceFilterFactory"
                pattern="^[a-z]{2,5}[0-9]{1,4}?([.]|[a-z])?(.*)"
replacement="" replace="all" />
		<filter class="solr.WordDelimiterFilterFactory" preserveOriginal="1"
generateNumberParts="1"/>
		<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"  enablePositionIncrements="true" />	
        <filter class="solr.TrimFilterFactory" />	
		<filter class="solr.LowerCaseFilterFactory"/>
		<filter class="solr.LengthFilterFactory" min="1" max="30" />
		<filter class="solr.SnowballPorterFilterFactory" language="English"
protected="protwords.txt"/>
	  </analyzer>
		.
		.
		.

Thank you.


--
View this message in context: http://lucene.472066.n3.nabble.com/Payload-doesn-t-apply-to-WordDelimiterFilterFactory-generated-tokens-tp3136748p3136748.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Payload doesn't apply to WordDelimiterFilterFactory-generated tokens

Posted by Chris Hostetter <ho...@fucit.org>.

: It seems that the payloads are applied only to the original word that I
: index and the WordDelimiterFilter doesn't apply the payloads to the tokens
: it generates.

I believe you are correct.  I think the general rule for most TokenFilters 
that you will find in Lucene/Solr is that they don't typically "clone" 
attributes (like payloads) when generating new Tokens -- it may be what 
you want in your use case, but there's no hard & fast rule that it would 
always make sense to do so.

If you'd like to opne a jira (or submit a patch) i suspect a new 
"clonePayload" attribute could be added to the WDF Factory to drive this 
kind of behavior so people with use cases where it made sense could enable 
this -- but i haven't looked at that code (or the current TokenStream API) 
enough to have any idea how hard it would be.



-Hoss