You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by RW...@wiley.com on 2007/05/31 16:12:50 UTC
Schema question: overriding fieldType attributes in field element
I am trying to override the tokenized attribute of a single FieldType from
the field attribute in schema.xml, but it doesn't seem to work and I can't
figure out why. For example, if I define various fields to be of type
solr.TextField, and use tokenized="false" for some and tokenized="true"
for others, the fields are defined properly when schema.xml is read, but
when documents are added to the index, all indexed fields are
Field.Index.TOKENIZED, which is the default for solr.TextField (as if I
had not used the tokenized attribute in the field element). And if I use
solr.StrField as the field type, all indexed fields turn out to be
Field.Index.UN_TOKENIZED: the default for solr.StrField. I am confirming
the tokenized state of the fields by using Luke and by executing searches.
Any clues as to what I'm doing wrong?
-- Robert
PS: Yes, I know I could use solr.StrField for those fields I would like to
be Field.Index.UN_TOKENIZED and solr.TextField for those I would like to
be Field.Index.TOKENIZED, but my reading of the documentation and the code
is that I should be able to do things the way I'm attempting them, and I
have other reasons for wanting to consolidate all field attribute
definitions to the field element.
Solr version: 1.1.0
==================================
Extract from schema.xml
-----------------------
<types>
<fieldtype name="wpsField" class="solr.StrField">
<analyzer type="index" class="&index_analyzer;" />
<analyzer type="query" class="&query_analyzer;" />
</fieldtype>
</types>
<fields>
<field name="term" type="wpsField" indexed="true" stored="true"
tokenized="true" />
<field name="termType" type="wpsField" indexed="true"
stored="true" tokenized="false" />
<field name="unstm_termExact" type="wpsField" indexed="true"
stored="false" tokenized="false" />
<field name="descriptorType" type="wpsField" indexed="true"
stored="true" tokenized="false" />
<field name="aqs" type="wpsField" indexed="true" stored="true"
tokenized="false" multiValued="true" />
<field name="treeNums" type="wpsField" indexed="true"
stored="true" tokenized="false" multiValued="true" />
<field name="scopeNote" type="wpsField" indexed="false"
stored="true" tokenized="false" />
<field name="see" type="wpsField" indexed="false" stored="true"
tokenized="false" />
<field name="permutation" type="wpsField" indexed="true"
stored="true" tokenized="true" />
<field name="all-fields" type="wpsField" indexed="false"
stored="false" tokenized="true" />
<field name="WPS_UNID_FIELD_CONSTANT" type="wpsField"
indexed="true" stored="true" tokenized="false" />
<dynamicField name="unstm_*" type="wpsField" indexed="true"
stored="false" tokenized="true" multiValued="true" />
</fields>
<copyField source="term" dest="unstm_termExact" />
Output to catalina.out
----------------------
May 31, 2007 9:25:49 AM org.apache.solr.schema.IndexSchema readConfig
FINE: field defined:
term{type=wpsField,properties=indexed,tokenized,stored}
May 31, 2007 9:25:49 AM org.apache.solr.schema.IndexSchema readConfig
FINE: field defined: termType{type=wpsField,properties=indexed,stored}
May 31, 2007 9:25:49 AM org.apache.solr.schema.IndexSchema readConfig
FINE: field defined: unstm_termExact{type=wpsField,properties=indexed}
May 31, 2007 9:25:49 AM org.apache.solr.schema.IndexSchema readConfig
FINE: field defined:
descriptorType{type=wpsField,properties=indexed,stored}
May 31, 2007 9:25:49 AM org.apache.solr.schema.IndexSchema readConfig
FINE: field defined:
aqs{type=wpsField,properties=indexed,stored,multiValued}
May 31, 2007 9:25:49 AM org.apache.solr.schema.IndexSchema readConfig
FINE: field defined:
treeNums{type=wpsField,properties=indexed,stored,multiValued}
May 31, 2007 9:25:49 AM org.apache.solr.schema.IndexSchema readConfig
FINE: field defined: scopeNote{type=wpsField,properties=stored}
May 31, 2007 9:25:49 AM org.apache.solr.schema.IndexSchema readConfig
FINE: field defined: see{type=wpsField,properties=stored}
May 31, 2007 9:25:49 AM org.apache.solr.schema.IndexSchema readConfig
FINE: field defined:
permutation{type=wpsField,properties=indexed,tokenized,stored,multiValued}
May 31, 2007 9:25:49 AM org.apache.solr.schema.IndexSchema readConfig
FINE: field defined: all-fields{type=wpsField,properties=tokenized}
May 31, 2007 9:25:49 AM org.apache.solr.schema.IndexSchema readConfig
FINE: field defined:
WPS_UNID_FIELD_CONSTANT{type=wpsField,properties=indexed,stored}
May 31, 2007 9:25:49 AM org.apache.solr.schema.IndexSchema readConfig
FINE: dynamic field defined:
unstm_*{type=wpsField,properties=indexed,multiValued}
Screenshot from Luke
--------------------
o shows that field "term" is not tokenized even though it should be
(according to schema.xml and field definition from
org.apache.solr.schema.IndexSchema.readConfig).
Re: Schema question: overriding fieldType attributes in field element
Posted by Yonik Seeley <yo...@apache.org>.
On 5/31/07, Chris Hostetter <ho...@fucit.org> wrote:
> ...maybe there is some terminology confusion here
I think the issue is the additional restrictions:
"consolidate all field properties to the field element. The reason for
this is that the schema is read by another class to give access to
field properties and more outside the Solr context."
But I think either a better or smarter schema reader is warranted
instead. If it can't read fieldType declarations, perhaps use a field
naming convention like solr flare does for which fields to facet on.
-Yonik
Re: Schema question: overriding fieldType attributes in field element
Posted by Chris Hostetter <ho...@fucit.org>.
: Unfortunately, unless I've missed something obvious, the "tokenized"
: property is not available to classes that extend FieldType: the setArgs()
: method of FieldType strips "tokenized" and other standard properties away
: before calling the init() method. Yes, of course one could override
: setArgs(), but that's not a robust solution.
in an ideal world Solr would not strip that property from the Map, since
it doesn't care about it, but sicne it does can't your init method just
call "isTokenized()" to determine it's value (like any of hte other
properties handled automaticly) ... the build in field types ignore it,
but you could write a custom FieldType that inspects it.
: The terminology confusion stems (sorry, pun sort of not intended) from the
: frequent overlap of the terms "tokenize" and "analyze". As I mentioned in
: an earlier message on this thread, it is quite possible to create an
: Analyzer that does all sorts of things without tokenizing, or, more
: precisely, creates a single Token from the field value. I would posit that
: tokenization and analysis are two separate things, albeit most frequently
: done together.
The semi-equivilece of the word "tokenize" when refering to fields and the
broader concept of "Analysis" orriginates with Lucene: in lucene you
declare a field TOKENIZED if you want the Analyzer used at all --
regardless of what the Analyzer does. While i agreed "ANALYZED" would
have been a better name for that constant, in practice the istinction is
so subtle it almost doesn't matter: what you desribe as "an Analyzer that
does all sorts of things without tokenizing" i would call "an Analyzer
that tokenizes it's input into a single token, and then does all
sorts of things" KeywordTokenizer works exactly like this.
-Hoss
Re: Schema question: overriding fieldType attributes in field element
Posted by RW...@wiley.com.
Chris Hostetter <ho...@fucit.org> wrote on 05/31/2007 02:28:58
PM:
> I'm having a little trouble following this discussion, first off as to
> your immediate issue...
>
> : Thanks, but I think I'm going to have to work out a different
solution. I
> : have written my own analyzer that does everything I need: it's not a
> : different analyzer I need but a way to specify that certain fields
should
> : be tokenized and others not -- while still leaving all other options
open.
>
> ...maybe there is some terminology confusion here ... if you've already
> got an "Analyzer" (capital A Lucene classname) then you can specify it
for
> one fieldType, and use that field type for the fields you want analysis
> done. if you have other fields were you don't want tokenizing/analysis
> done, use a different fieldType (with a StrField).
>
This is precisely what I've done (but see below for more).
> As for your followup question...
>
> : As far as the generic options parsing resulting in unused properties
in a
> : ShcemaField object, not it is not specifically documented anywhere,
but
> : the Solr Wiki lists, for both fields and field types: "Common options
that
> : fields can have are...". I could not find anywhere a definitive list
of
> : what is allowed/used or excluded, so I went to the code and found that
the
>
> That's because there is no definitive list. Every FieldType can define
> it's own list of attributes that can be declared and handled by it's own
> init method.
>
Unfortunately, unless I've missed something obvious, the "tokenized"
property is not available to classes that extend FieldType: the setArgs()
method of FieldType strips "tokenized" and other standard properties away
before calling the init() method. Yes, of course one could override
setArgs(), but that's not a robust solution.
The terminology confusion stems (sorry, pun sort of not intended) from the
frequent overlap of the terms "tokenize" and "analyze". As I mentioned in
an earlier message on this thread, it is quite possible to create an
Analyzer that does all sorts of things without tokenizing, or, more
precisely, creates a single Token from the field value. I would posit that
tokenization and analysis are two separate things, albeit most frequently
done together.
-- Robert
Re: Schema question: overriding fieldType attributes in field element
Posted by Chris Hostetter <ho...@fucit.org>.
I'm having a little trouble following this discussion, first off as to
your immediate issue...
: Thanks, but I think I'm going to have to work out a different solution. I
: have written my own analyzer that does everything I need: it's not a
: different analyzer I need but a way to specify that certain fields should
: be tokenized and others not -- while still leaving all other options open.
...maybe there is some terminology confusion here ... if you've already
got an "Analyzer" (capital A Lucene classname) then you can specify it for
one fieldType, and usethat field type for the fields you want analysis
done. if you have other fields were you don't want tokenizing/analysis
done, use a differnet fieldType (with a StrField).
As for your followup question...
: As far as the generic options parsing resulting in unused properties in a
: ShcemaField object, not it is not specifically documented anywhere, but
: the Solr Wiki lists, for both fields and field types: "Common options that
: fields can have are...". I could not find anywhere a definitive list of
: what is allowed/used or excluded, so I went to the code and found that the
That's because there is no definitive list. Every FieldType can define
it's own list of attributes that can be declared and handled by it's own
init method.
-Hoss
Re: Schema question: overriding fieldType attributes in field element
Posted by Mike Klaas <mi...@gmail.com>.
On 31-May-07, at 8:47 AM, RWatkins@wiley.com wrote:
> Thanks, but I think I'm going to have to work out a different
> solution. I
> have written my own analyzer that does everything I need: it's not a
> different analyzer I need but a way to specify that certain fields
> should
> be tokenized and others not -- while still leaving all other
> options open.
Define two fieldTypes, and use one for "tokenized" analysis and
another for "untokenized"?
-Mike
Re: Schema question: overriding fieldType attributes in field element
Posted by RW...@wiley.com.
Thanks, but I think I'm going to have to work out a different solution. I
have written my own analyzer that does everything I need: it's not a
different analyzer I need but a way to specify that certain fields should
be tokenized and others not -- while still leaving all other options open.
As far as the generic options parsing resulting in unused properties in a
ShcemaField object, not it is not specifically documented anywhere, but
the Solr Wiki lists, for both fields and field types: "Common options that
fields can have are...". I could not find anywhere a definitive list of
what is allowed/used or excluded, so I went to the code and found that the
"tokenized" would indeed be respected in SchemaField.
-- Robert
yseeley@gmail.com wrote on 05/31/2007 11:30:04 AM:
> On 5/31/07, RWatkins@wiley.com <RW...@wiley.com> wrote:
> > You say the "tokenized" attribute is not settable from the schema, but
the
> > output from IndexSchema.readConfig shows that the properties are
indeed
> > read, and the resulting SchemaField object retains these properties:
are
> > they then ignored?
>
> Not sure off the top of my head, but don't use it... it's shouldn't be
> documented anywhere.
> It probably slipped through as part of generic options parsing.
>
> > > "untokenized" means don't use the analyzer. If you don't want an
> > > analyzer, then use the "string" type.
> > >
> > This is true only in the simplest of cases. An analyzer can do far
more
> > than tokenize: it can stem, change to lower case, etc. What if you
want
> > one or more of these things to happen, but you don't want
tokenization?
>
> From a Lucene perspective, if you create an untokenized field, the
> analyzer will not be used at all. It should have probably been named
> unanalyzed, as that's more accurate.
>
> KeywordTokenizer (via KeywordTokenizerFactory) is probably what you
> are looking for.
> Create a new text field type with that as the tokenizer, followed by
> whatever filters you want (like lowercasing).
>
> -Yonik
Re: Schema question: overriding fieldType attributes in field element
Posted by Yonik Seeley <yo...@apache.org>.
On 5/31/07, RWatkins@wiley.com <RW...@wiley.com> wrote:
> You say the "tokenized" attribute is not settable from the schema, but the
> output from IndexSchema.readConfig shows that the properties are indeed
> read, and the resulting SchemaField object retains these properties: are
> they then ignored?
Not sure off the top of my head, but don't use it... it's shouldn't be
documented anywhere.
It probably slipped through as part of generic options parsing.
> > "untokenized" means don't use the analyzer. If you don't want an
> > analyzer, then use the "string" type.
> >
> This is true only in the simplest of cases. An analyzer can do far more
> than tokenize: it can stem, change to lower case, etc. What if you want
> one or more of these things to happen, but you don't want tokenization?
>From a Lucene perspective, if you create an untokenized field, the
analyzer will not be used at all. It should have probably been named
unanalyzed, as that's more accurate.
KeywordTokenizer (via KeywordTokenizerFactory) is probably what you
are looking for.
Create a new text field type with that as the tokenizer, followed by
whatever filters you want (like lowercasing).
-Yonik
Re: Schema question: overriding fieldType attributes in field element
Posted by RW...@wiley.com.
Thanks for the prompt response. Comments below ...
yseeley@gmail.com wrote on 05/31/2007 10:55:57 AM:
> On 5/31/07, RWatkins@wiley.com <RW...@wiley.com> wrote:
> > I am trying to override the tokenized attribute of a single FieldType
from
> > the field attribute in schema.xml, but it doesn't seem to work
>
> The "tokenized" attribute is not settable from the schema, and there
> is no reason I can think of why this would be useful rather than
> confusing.
>
You say the "tokenized" attribute is not settable from the schema, but the
output from IndexSchema.readConfig shows that the properties are indeed
read, and the resulting SchemaField object retains these properties: are
they then ignored?
> "untokenized" means don't use the analyzer. If you don't want an
> analyzer, then use the "string" type.
>
This is true only in the simplest of cases. An analyzer can do far more
than tokenize: it can stem, change to lower case, etc. What if you want
one or more of these things to happen, but you don't want tokenization? In
this particular case I want to be able to make exact matches on the entire
field, so that a search for "+termExact:pain" (remember that my searches
are case insensitive, thanks to my analyzer (and regardless of
tokenization)) will return _only_ the document in which the termExact
field contains the single word "Pain" or "pain", and not "Back Pain", etc.
> > PS: Yes, I know I could use solr.StrField for those fields
>
> Could you provide a use-case why you don't want to use StrField
> (normally type "string" in the schema)? What is the external behaviour
> you are looking for?
>
Part of the answer to this question is in the last paragraph, but perhaps
you want to know why I would like to consolidate all field properties to
the field element. The reason for this is that the schema is read by
another class to give access to field properties and more outside the Solr
context.
-- Robert
Re: Schema question: overriding fieldType attributes in field element
Posted by Yonik Seeley <yo...@apache.org>.
On 5/31/07, RWatkins@wiley.com <RW...@wiley.com> wrote:
> I am trying to override the tokenized attribute of a single FieldType from
> the field attribute in schema.xml, but it doesn't seem to work
The "tokenized" attribute is not settable from the schema, and there
is no reason I can think of why this would be useful rather than
confusing.
"untokenized" means don't use the analyzer. If you don't want an
analyzer, then use the "string" type.
> PS: Yes, I know I could use solr.StrField for those fields
Could you provide a use-case why you don't want to use StrField
(normally type "string" in the schema)? What is the external behavior
you are looking for?
-Yonik