You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Shanmugavel SRD <sr...@gmail.com> on 2010/11/23 12:15:05 UTC

copyField is not tokenizing the values at index time

schema.xml config:

<fieldType name="textWordSpell" class="solr.TextField"
positionIncrementGap="100" >
      <analyzer>
        <tokenizer class="solr.PatternTokenizerFactory" pattern=", *" />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
</fieldType>
</types>

<fields>
    <field name="spellword" type="textWordSpell" indexed="true"
stored="true" multiValued="true"/>
    <copyField source="keywords_t" dest="spellword"/>

feed.xml

    <field name="keywords_t"><![CDATA[Internet, Songs, Canada]]></field>

After index if I search for spellword:[* TO *], it displays result like
below.

Actual :
<arr name="spellword">
     <str>Internet, Songs, Canada</str>
</arr>


Expected :
<arr name="spellword">
     <str>Internet</str>
     <str>Songs</str>
     <str>Canada</str>
</arr>

Could anyone help me on what configuration I have to make to get the above
mentioned expected output?

-- 
View this message in context: http://lucene.472066.n3.nabble.com/copyField-is-not-tokenizing-the-values-at-index-time-tp1952756p1952756.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: copyField is not tokenizing the values at index time

Posted by Shanmugavel SRD <sr...@gmail.com>.

Thanks Erick.
-- 
View this message in context: http://lucene.472066.n3.nabble.com/copyField-is-not-tokenizing-the-values-at-index-time-tp1952756p1958946.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: copyField is not tokenizing the values at index time

Posted by Erick Erickson <er...@gmail.com>.

I think you got fooled by what's returned as a field value. When you store a
field and
later return that field as part of a document, your exact input is returned
*regardless* of what analysis has been done. So your *query* of spellword:[*
to *]
returns the stored value, not the indexed tokens.

I claim that if you examine your index via the admin page for spellword,
you'll see
three distinct tokens. I further claim that if you interrogate your
spellword field with
the spellcheck component, you'll get what you expect. The proof is left as
an
exercise for the reader <G>...

Best
Erick

On Tue, Nov 23, 2010 at 6:15 AM, Shanmugavel SRD
<sr...@gmail.com>wrote:

>
> schema.xml config:
>
> <fieldType name="textWordSpell" class="solr.TextField"
> positionIncrementGap="100" >
>      <analyzer>
>        <tokenizer class="solr.PatternTokenizerFactory" pattern=", *" />
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>      </analyzer>
> </fieldType>
> </types>
>
> <fields>
>    <field name="spellword" type="textWordSpell" indexed="true"
> stored="true" multiValued="true"/>
>    <copyField source="keywords_t" dest="spellword"/>
>
> feed.xml
>
>    <field name="keywords_t"><![CDATA[Internet, Songs, Canada]]></field>
>
> After index if I search for spellword:[* TO *], it displays result like
> below.
>
> Actual :
> <arr name="spellword">
>     <str>Internet, Songs, Canada</str>
> </arr>
>
>
> Expected :
> <arr name="spellword">
>     <str>Internet</str>
>     <str>Songs</str>
>     <str>Canada</str>
> </arr>
>
> Could anyone help me on what configuration I have to make to get the above
> mentioned expected output?
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/copyField-is-not-tokenizing-the-values-at-index-time-tp1952756p1952756.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>