You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by Nathan Folkman <na...@gmail.com> on 2009/02/04 23:41:37 UTC
Tokenizer Question
I'm having trouble getting the following queries to work as I'd expect:
tag_calais:"company" -> should match: company:IBM Business Partners
tag_calais:"products" -> should match: industryterm:business products,
industryterm:Industrial products, industryterm:Consumer products
domain:"com.*"
domain:"com.ibm*"
I thought it might have something to do with how the indexed data was
getting tokenized?
schema.xml:
<types>
<fieldType name="calais" class="solr.StrField">
<analyzer>
<tokenizer class="solr.PatternTokenizerFactory" pattern=": *"
group="-1" />
</analyzer>
</fieldType>
<fieldType name="domain" class="solr.StrField">
<analyzer>
<tokenizer class="solr.PatternTokenizerFactory" pattern=". *"
group="-1" />
</analyzer>
</fieldType>
...
</types>
<fields>
<field name="domain" type="domain" indexed="true" stored="true"
required="true" />
<field name="tag_calais" type="calais" indexed="true"
stored="true" multiValued="true" />
...
</fields>
Example document:
<?xml version="1.0" ?>
<add>
<doc>
<field name="domain">
com.ibm
</field>
<field name="tag_calais">
industryterm:business products
</field>
<field name="tag_calais">
industryterm:Industrial products
</field>
<field name="tag_calais">
industryterm:Consumer products
</field>
<field name="tag_calais">
country:United States
</field>
<field name="tag_calais">
company:IBM Business Partners
</field>
...
</doc>
</add>
Any suggestions? Thanks!
- n
Re: Tokenizer Question
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Feb 4, 2009, at 5:41 PM, Nathan Folkman wrote:
> I'm having trouble getting the following queries to work as I'd
> expect:
>
> tag_calais:"company" -> should match: company:IBM Business Partners
> tag_calais:"products" -> should match: industryterm:business
> products, industryterm:Industrial products, industryterm:Consumer
> products
> domain:"com.*"
> domain:"com.ibm*"
>
> I thought it might have something to do with how the indexed data
> was getting tokenized?
Or not tokenized in your case....
>
> schema.xml:
>
> <types>
> <fieldType name="calais" class="solr.StrField">
> <analyzer>
> <tokenizer class="solr.PatternTokenizerFactory" pattern=": *"
> group="-1" />
> </analyzer>
> </fieldType>
> <fieldType name="domain" class="solr.StrField">
> <analyzer>
> <tokenizer class="solr.PatternTokenizerFactory" pattern=". *"
> group="-1" />
> </analyzer>
> </fieldType>
> ...
> </types>
StrField is not tokenized, even if you specify the analyzer. Use
TextField instead.
Use Solr's analysis tool in the admin (/admin/analysis.jsp) - set the
field name or type appropriately and put in some sample text and see
how things get analyzed.
Erik