You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by Nathan Folkman <na...@gmail.com> on 2009/02/04 23:41:37 UTC

Tokenizer Question

I'm having trouble getting the following queries to work as I'd expect:

tag_calais:"company" -> should match: company:IBM Business Partners
tag_calais:"products" -> should match: industryterm:business products,  
industryterm:Industrial products, industryterm:Consumer products
domain:"com.*"
domain:"com.ibm*"

I thought it might have something to do with how the indexed data was  
getting tokenized?

schema.xml:

<types>
     <fieldType name="calais" class="solr.StrField">
         <analyzer>
         <tokenizer class="solr.PatternTokenizerFactory" pattern=": *"  
group="-1" />
     </analyzer>
     </fieldType>
     <fieldType name="domain" class="solr.StrField">
     <analyzer>
         <tokenizer class="solr.PatternTokenizerFactory" pattern=". *"  
group="-1" />
     </analyzer>
     </fieldType>
     ...
</types>
<fields>
     <field name="domain" type="domain" indexed="true" stored="true"  
required="true" />
     <field name="tag_calais" type="calais" indexed="true"  
stored="true" multiValued="true" />
	...
</fields>

Example document:

<?xml version="1.0" ?>
<add>
   <doc>
     <field name="domain">
       com.ibm
     </field>
     <field name="tag_calais">
       industryterm:business products
     </field>
     <field name="tag_calais">
       industryterm:Industrial products
     </field>
     <field name="tag_calais">
       industryterm:Consumer products
     </field>
     <field name="tag_calais">
       country:United States
     </field>
     <field name="tag_calais">
       company:IBM Business Partners
     </field>
     ...
   </doc>
</add>

Any suggestions? Thanks!

- n

Re: Tokenizer Question

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Feb 4, 2009, at 5:41 PM, Nathan Folkman wrote:
> I'm having trouble getting the following queries to work as I'd  
> expect:
>
> tag_calais:"company" -> should match: company:IBM Business Partners
> tag_calais:"products" -> should match: industryterm:business  
> products, industryterm:Industrial products, industryterm:Consumer  
> products
> domain:"com.*"
> domain:"com.ibm*"
>
> I thought it might have something to do with how the indexed data  
> was getting tokenized?

Or not tokenized in your case....

>
> schema.xml:
>
> <types>
>    <fieldType name="calais" class="solr.StrField">
>        <analyzer>
>        <tokenizer class="solr.PatternTokenizerFactory" pattern=": *"  
> group="-1" />
>    </analyzer>
>    </fieldType>
>    <fieldType name="domain" class="solr.StrField">
>    <analyzer>
>        <tokenizer class="solr.PatternTokenizerFactory" pattern=". *"  
> group="-1" />
>    </analyzer>
>    </fieldType>
>    ...
> </types>


StrField is not tokenized, even if you specify the analyzer.  Use  
TextField instead.

Use Solr's analysis tool in the admin (/admin/analysis.jsp) - set the  
field name or type appropriately and put in some sample text and see  
how things get analyzed.

	Erik