You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by RW...@wiley.com on 2007/05/31 16:12:50 UTC

Schema question: overriding fieldType attributes in field element

I am trying to override the tokenized attribute of a single FieldType from 
the field attribute in schema.xml, but it doesn't seem to work and I can't 
figure out why. For example, if I define various fields to be of type 
solr.TextField, and use tokenized="false" for some and tokenized="true" 
for others, the fields are defined properly when schema.xml is read, but 
when documents are added to the index, all indexed fields are 
Field.Index.TOKENIZED, which is the default for solr.TextField (as if I 
had not used the tokenized attribute in the field element). And if I use 
solr.StrField as the field type, all indexed fields turn out to be 
Field.Index.UN_TOKENIZED: the default for solr.StrField. I am confirming 
the tokenized state of the fields by using Luke and by executing searches.

Any clues as to what I'm doing wrong?

-- Robert

PS: Yes, I know I could use solr.StrField for those fields I would like to 
be Field.Index.UN_TOKENIZED and solr.TextField for those I would like to 
be Field.Index.TOKENIZED, but my reading of the documentation and the code 
is that I should be able to do things the way I'm attempting them, and I 
have other reasons for wanting to consolidate all field attribute 
definitions to the field element.

Solr version: 1.1.0

==================================

Extract from schema.xml
-----------------------
    <types>
        <fieldtype name="wpsField" class="solr.StrField">
            <analyzer type="index" class="&index_analyzer;" />
            <analyzer type="query" class="&query_analyzer;" />
        </fieldtype>
    </types>
    <fields>
        <field name="term" type="wpsField" indexed="true" stored="true" 
tokenized="true" />
        <field name="termType" type="wpsField" indexed="true" 
stored="true" tokenized="false" />
        <field name="unstm_termExact" type="wpsField" indexed="true" 
stored="false" tokenized="false" />
        <field name="descriptorType" type="wpsField" indexed="true" 
stored="true" tokenized="false" />
        <field name="aqs" type="wpsField" indexed="true" stored="true" 
tokenized="false" multiValued="true" />
        <field name="treeNums" type="wpsField" indexed="true" 
stored="true" tokenized="false" multiValued="true" />
        <field name="scopeNote" type="wpsField" indexed="false" 
stored="true" tokenized="false" />
        <field name="see" type="wpsField" indexed="false" stored="true" 
tokenized="false" />
        <field name="permutation" type="wpsField" indexed="true" 
stored="true" tokenized="true" />
        <field name="all-fields" type="wpsField" indexed="false" 
stored="false" tokenized="true" />
        <field name="WPS_UNID_FIELD_CONSTANT" type="wpsField" 
indexed="true" stored="true" tokenized="false" />
        <dynamicField name="unstm_*" type="wpsField" indexed="true" 
stored="false" tokenized="true" multiValued="true" />
    </fields>
    <copyField source="term" dest="unstm_termExact" />

Output to catalina.out
----------------------
May 31, 2007 9:25:49 AM org.apache.solr.schema.IndexSchema readConfig
FINE: field defined: 
term{type=wpsField,properties=indexed,tokenized,stored}
May 31, 2007 9:25:49 AM org.apache.solr.schema.IndexSchema readConfig
FINE: field defined: termType{type=wpsField,properties=indexed,stored}
May 31, 2007 9:25:49 AM org.apache.solr.schema.IndexSchema readConfig
FINE: field defined: unstm_termExact{type=wpsField,properties=indexed}
May 31, 2007 9:25:49 AM org.apache.solr.schema.IndexSchema readConfig
FINE: field defined: 
descriptorType{type=wpsField,properties=indexed,stored}
May 31, 2007 9:25:49 AM org.apache.solr.schema.IndexSchema readConfig
FINE: field defined: 
aqs{type=wpsField,properties=indexed,stored,multiValued}
May 31, 2007 9:25:49 AM org.apache.solr.schema.IndexSchema readConfig
FINE: field defined: 
treeNums{type=wpsField,properties=indexed,stored,multiValued}
May 31, 2007 9:25:49 AM org.apache.solr.schema.IndexSchema readConfig
FINE: field defined: scopeNote{type=wpsField,properties=stored}
May 31, 2007 9:25:49 AM org.apache.solr.schema.IndexSchema readConfig
FINE: field defined: see{type=wpsField,properties=stored}
May 31, 2007 9:25:49 AM org.apache.solr.schema.IndexSchema readConfig
FINE: field defined: 
permutation{type=wpsField,properties=indexed,tokenized,stored,multiValued}
May 31, 2007 9:25:49 AM org.apache.solr.schema.IndexSchema readConfig
FINE: field defined: all-fields{type=wpsField,properties=tokenized}
May 31, 2007 9:25:49 AM org.apache.solr.schema.IndexSchema readConfig
FINE: field defined: 
WPS_UNID_FIELD_CONSTANT{type=wpsField,properties=indexed,stored}
May 31, 2007 9:25:49 AM org.apache.solr.schema.IndexSchema readConfig
FINE: dynamic field defined: 
unstm_*{type=wpsField,properties=indexed,multiValued}

Screenshot from Luke
--------------------
o shows that field "term" is not tokenized even though it should be 
(according to schema.xml and field definition from 
org.apache.solr.schema.IndexSchema.readConfig).


Re: Schema question: overriding fieldType attributes in field element

Posted by Yonik Seeley <yo...@apache.org>.
On 5/31/07, Chris Hostetter <ho...@fucit.org> wrote:
> ...maybe there is some terminology confusion here

I think the issue is the additional restrictions:

"consolidate all field properties to the field element. The reason for
this is that the schema is read by another class to give access to
field properties and more outside the Solr context."

But I think either a better or smarter schema reader is warranted
instead.  If it can't read fieldType declarations, perhaps use a field
naming convention like solr flare does for which fields to facet on.

-Yonik

Re: Schema question: overriding fieldType attributes in field element

Posted by Chris Hostetter <ho...@fucit.org>.
: Unfortunately, unless I've missed something obvious, the "tokenized"
: property is not available to classes that extend FieldType: the setArgs()
: method of FieldType strips "tokenized" and other standard properties away
: before calling the init() method. Yes, of course one could override
: setArgs(), but that's not a robust solution.

in an ideal world Solr would not strip that property from the Map, since
it doesn't care about it, but sicne it does can't your init method just
call "isTokenized()" to determine it's value (like any of hte other
properties handled automaticly) ... the build in field types ignore it,
but you could write a custom FieldType that inspects it.

: The terminology confusion stems (sorry, pun sort of not intended) from the
: frequent overlap of the terms "tokenize" and "analyze". As I mentioned in
: an earlier message on this thread, it is quite possible to create an
: Analyzer that does all sorts of things without tokenizing, or, more
: precisely, creates a single Token from the field value. I would posit that
: tokenization and analysis are two separate things, albeit most frequently
: done together.

The semi-equivilece of the word "tokenize" when refering to fields and the
broader concept of "Analysis" orriginates with Lucene: in lucene you
declare a field TOKENIZED if you want the Analyzer used at all --
regardless of what the Analyzer does.  While i agreed "ANALYZED" would
have been a better name for that constant, in practice the istinction is
so subtle it almost doesn't matter: what you desribe as "an Analyzer that
does all sorts of things without tokenizing" i would call "an Analyzer
that tokenizes it's input into a single token, and then does all
sorts of things"  KeywordTokenizer works exactly like this.



-Hoss


Re: Schema question: overriding fieldType attributes in field element

Posted by RW...@wiley.com.
Chris Hostetter <ho...@fucit.org> wrote on 05/31/2007 02:28:58 
PM:

> I'm having a little trouble following this discussion, first off as to
> your immediate issue...
> 
> : Thanks, but I think I'm going to have to work out a different 
solution. I
> : have written my own analyzer that does everything I need: it's not a
> : different analyzer I need but a way to specify that certain fields 
should
> : be tokenized and others not -- while still leaving all other options 
open.
> 
> ...maybe there is some terminology confusion here ... if you've already
> got an "Analyzer" (capital A Lucene classname) then you can specify it 
for
> one fieldType, and use that field type for the fields you want analysis
> done.  if you have other fields were you don't want tokenizing/analysis
> done, use a different fieldType (with a StrField).
>
This is precisely what I've done (but see below for more).

> As for your followup question...
> 
> : As far as the generic options parsing resulting in unused properties 
in a
> : ShcemaField object, not it is not specifically documented anywhere, 
but
> : the Solr Wiki lists, for both fields and field types: "Common options 
that
> : fields can have are...". I could not find anywhere a definitive list 
of
> : what is allowed/used or excluded, so I went to the code and found that 
the
> 
> That's because there is no definitive list.  Every FieldType can define
> it's own list of attributes that can be declared and handled by it's own
> init method.
> 
Unfortunately, unless I've missed something obvious, the "tokenized" 
property is not available to classes that extend FieldType: the setArgs() 
method of FieldType strips "tokenized" and other standard properties away 
before calling the init() method. Yes, of course one could override 
setArgs(), but that's not a robust solution.

The terminology confusion stems (sorry, pun sort of not intended) from the 
frequent overlap of the terms "tokenize" and "analyze". As I mentioned in 
an earlier message on this thread, it is quite possible to create an 
Analyzer that does all sorts of things without tokenizing, or, more 
precisely, creates a single Token from the field value. I would posit that 
tokenization and analysis are two separate things, albeit most frequently 
done together.

-- Robert


Re: Schema question: overriding fieldType attributes in field element

Posted by Chris Hostetter <ho...@fucit.org>.

I'm having a little trouble following this discussion, first off as to
your immediate issue...

: Thanks, but I think I'm going to have to work out a different solution. I
: have written my own analyzer that does everything I need: it's not a
: different analyzer I need but a way to specify that certain fields should
: be tokenized and others not -- while still leaving all other options open.

...maybe there is some terminology confusion here ... if you've already
got an "Analyzer" (capital A Lucene classname) then you can specify it for
one fieldType, and usethat field type for the fields you want analysis
done.  if you have other fields were you don't want tokenizing/analysis
done, use a differnet fieldType (with a StrField).

As for your followup question...

: As far as the generic options parsing resulting in unused properties in a
: ShcemaField object, not it is not specifically documented anywhere, but
: the Solr Wiki lists, for both fields and field types: "Common options that
: fields can have are...". I could not find anywhere a definitive list of
: what is allowed/used or excluded, so I went to the code and found that the

That's because there is no definitive list.  Every FieldType can define
it's own list of attributes that can be declared and handled by it's own
init method.



-Hoss


Re: Schema question: overriding fieldType attributes in field element

Posted by Mike Klaas <mi...@gmail.com>.
On 31-May-07, at 8:47 AM, RWatkins@wiley.com wrote:

> Thanks, but I think I'm going to have to work out a different  
> solution. I
> have written my own analyzer that does everything I need: it's not a
> different analyzer I need but a way to specify that certain fields  
> should
> be tokenized and others not -- while still leaving all other  
> options open.

Define two fieldTypes, and use one for "tokenized" analysis and  
another for "untokenized"?

-Mike

Re: Schema question: overriding fieldType attributes in field element

Posted by RW...@wiley.com.
Thanks, but I think I'm going to have to work out a different solution. I 
have written my own analyzer that does everything I need: it's not a 
different analyzer I need but a way to specify that certain fields should 
be tokenized and others not -- while still leaving all other options open.

As far as the generic options parsing resulting in unused properties in a 
ShcemaField object, not it is not specifically documented anywhere, but 
the Solr Wiki lists, for both fields and field types: "Common options that 
fields can have are...". I could not find anywhere a definitive list of 
what is allowed/used or excluded, so I went to the code and found that the 
"tokenized" would indeed be respected in SchemaField.

-- Robert

yseeley@gmail.com wrote on 05/31/2007 11:30:04 AM:

> On 5/31/07, RWatkins@wiley.com <RW...@wiley.com> wrote:
> > You say the "tokenized" attribute is not settable from the schema, but 
the
> > output from IndexSchema.readConfig shows that the properties are 
indeed
> > read, and the resulting SchemaField object retains these properties: 
are
> > they then ignored?
> 
> Not sure off the top of my head, but don't use it... it's shouldn't be
> documented anywhere.
> It probably slipped through as part of generic options parsing.
> 
> > > "untokenized" means don't use the analyzer.   If you don't want an
> > > analyzer, then use the "string" type.
> > >
> > This is true only in the simplest of cases. An analyzer can do far 
more
> > than tokenize: it can stem, change to lower case, etc. What if you 
want
> > one or more of these things to happen, but you don't want 
tokenization?
> 
> From a Lucene perspective, if you create an untokenized field, the
> analyzer will not be used at all.  It should have probably been named
> unanalyzed, as that's more accurate.
> 
> KeywordTokenizer (via KeywordTokenizerFactory) is probably what you
> are looking for.
> Create a new text field type with that as the tokenizer, followed by
> whatever filters you want (like lowercasing).
> 
> -Yonik


Re: Schema question: overriding fieldType attributes in field element

Posted by Yonik Seeley <yo...@apache.org>.
On 5/31/07, RWatkins@wiley.com <RW...@wiley.com> wrote:
> You say the "tokenized" attribute is not settable from the schema, but the
> output from IndexSchema.readConfig shows that the properties are indeed
> read, and the resulting SchemaField object retains these properties: are
> they then ignored?

Not sure off the top of my head, but don't use it... it's shouldn't be
documented anywhere.
It probably slipped through as part of generic options parsing.

> > "untokenized" means don't use the analyzer.   If you don't want an
> > analyzer, then use the "string" type.
> >
> This is true only in the simplest of cases. An analyzer can do far more
> than tokenize: it can stem, change to lower case, etc. What if you want
> one or more of these things to happen, but you don't want tokenization?

>From a Lucene perspective, if you create an untokenized field, the
analyzer will not be used at all.  It should have probably been named
unanalyzed, as that's more accurate.

KeywordTokenizer (via KeywordTokenizerFactory) is probably what you
are looking for.
Create a new text field type with that as the tokenizer, followed by
whatever filters you want (like lowercasing).

-Yonik

Re: Schema question: overriding fieldType attributes in field element

Posted by RW...@wiley.com.
Thanks for the prompt response. Comments below ...

yseeley@gmail.com wrote on 05/31/2007 10:55:57 AM:

> On 5/31/07, RWatkins@wiley.com <RW...@wiley.com> wrote:
> > I am trying to override the tokenized attribute of a single FieldType 
from
> > the field attribute in schema.xml, but it doesn't seem to work
> 
> The "tokenized" attribute is not settable from the schema, and there
> is no reason I can think of why this would be useful rather than
> confusing.
>
You say the "tokenized" attribute is not settable from the schema, but the 
output from IndexSchema.readConfig shows that the properties are indeed 
read, and the resulting SchemaField object retains these properties: are 
they then ignored?

> "untokenized" means don't use the analyzer.   If you don't want an
> analyzer, then use the "string" type.
>
This is true only in the simplest of cases. An analyzer can do far more 
than tokenize: it can stem, change to lower case, etc. What if you want 
one or more of these things to happen, but you don't want tokenization? In 
this particular case I want to be able to make exact matches on the entire 
field, so that a search for "+termExact:pain" (remember that my searches 
are case insensitive, thanks to my analyzer (and regardless of 
tokenization)) will return _only_ the document in which the termExact 
field contains the single word "Pain" or "pain", and not "Back Pain", etc.

> > PS: Yes, I know I could use solr.StrField for those fields
> 
> Could you provide a use-case why you don't want to use StrField
> (normally type "string" in the schema)?  What is the external behaviour
> you are looking for?
>
Part of the answer to this question is in the last paragraph, but perhaps 
you want to know why I would like to consolidate all field properties to 
the field element. The reason for this is that the schema is read by 
another class to give access to field properties and more outside the Solr 
context.

-- Robert



Re: Schema question: overriding fieldType attributes in field element

Posted by Yonik Seeley <yo...@apache.org>.
On 5/31/07, RWatkins@wiley.com <RW...@wiley.com> wrote:
> I am trying to override the tokenized attribute of a single FieldType from
> the field attribute in schema.xml, but it doesn't seem to work

The "tokenized" attribute is not settable from the schema, and there
is no reason I can think of why this would be useful rather than
confusing.

"untokenized" means don't use the analyzer.   If you don't want an
analyzer, then use the "string" type.

> PS: Yes, I know I could use solr.StrField for those fields

Could you provide a use-case why you don't want to use StrField
(normally type "string" in the schema)?  What is the external behavior
you are looking for?

-Yonik