You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Lance Norskog <go...@gmail.com> on 2008/01/17 22:53:18 UTC

copyField limitation

Hi-
 
Because sort works much faster on type 'integer', but range queries do not
work on type 'integer', I want to do this:
 
 
          <field name="a_number" type="sint" multiValued="false"
default="-1" stored="true" indexed="true" required="true">>
          <field name="a_number_sort" type="integer" multiValued="false"
stored="false" indexed="true" required="true">
...
 
        <copyField source="a_number" dest="a_number_sort" />

So, a_number_sort always contains the same data as a_number, and now we can
do a fast sort on 'a_number_sort' AND do a range query on 'a_number'. The
two will always have the same data.
. 
But, the <copyField> directive in the schema has a limitation. It will only
copy data between fields with the same type. If the two fields are a
different type, the copy is ignored. This example would require <copyField>
to translate 'sint' to 'integer'. 
 
Another case is days (not times):
 
    field day_source, type="time, defaults to TODAY, is not stored or
indexed.
    field day, type="string", required, stored & indexed, no default   
    <copyField day_source -> day/>
 
This would express the date as a string 2008-xx-xxT00:00:00Z and store that
into the day field. It is not as optimal as using '2008-xx-xx' but is still
useful for wildcards.
 
Now, let's do the sort field case. 
 
    field day_source, type="time, defaults to TODAY, is not stored or
indexed.
    field day_sort, type="integer", required, stored & indexed, no default
    <copyField day_source -> day_sort/>
 
Date is a 64-bit long and the sort fields must be integers. The field is not
stored, only indexed. This means that it has to be self-consistent but not
meaningful to the outside world.  My project does this by subtracting the
date value for jan-1-2007 from the date and use the lower 32 bits for the
sorting field. 

Is this set of cases addressed in Solr 1.3? I think using String as the
intermediate type would work in the first three cases but not for the date
-> int case. This would involve describing transformations in <copyField>
directives and that is much too involved.
 
The advantage of using <copyField> between dissimilar types is that with
defauting, you exactly duplicate the information without relying on your
feeding software. With 'date' field formula syntax, this is the only way to
have duplicate fields for different purposes.
 
Thanks for your time,
 
Lance Norskog

Re: copyField limitation

Posted by Ryan McKinley <ry...@gmail.com>.

> But, the <copyField> directive in the schema has a limitation. It will only
> copy data between fields with the same type. If the two fields are a
> different type, the copy is ignored. This example would require <copyField>
> to translate 'sint' to 'integer'. 

really?  what version are you running?  what error do you get?  Does it 
just not show up?  Have you checked your index with luke?

I don't see anything in the DocumentBuilder that tries to behave this way:
http://svn.apache.org/repos/asf/lucene/solr/trunk/src/java/org/apache/solr/update/DocumentBuilder.java

ryan

RE: copyField limitation

Posted by Lance Norskog <go...@gmail.com>.

Sorting on a non-integer has space problems. As I understand it, sorting
creates an array of integers the size of the number of records in the entire
index. Sorting on a non-integer type also creates a separate array of the
same size with the field data copied into it.  Thus sorting a non-integer
field can use several times as much memory.

We have a very large index with very small records. We are creating matching
integer fields for various fields just to have faster sorts, and we are
doing this after benchmarking our speed and space behaviours.

I filed a Jira issue:

https://issues.apache.org/jira/browse/SOLR-464

Thanks for your time,

Lance Norskog

-----Original Message-----
From: yseeley@gmail.com [mailto:yseeley@gmail.com] On Behalf Of Yonik Seeley
Sent: Thursday, January 17, 2008 2:53 PM
To: solr-user@lucene.apache.org
Subject: Re: copyField limitation

On Jan 17, 2008 4:53 PM, Lance Norskog <go...@gmail.com> wrote:
> Because sort works much faster on type 'integer', but range queries do 
> not work on type 'integer',

Really?  The sort speed should be identical.

-Yonik

Re: copyField limitation

Posted by Yonik Seeley <yo...@apache.org>.

On Jan 17, 2008 4:53 PM, Lance Norskog <go...@gmail.com> wrote:
> Because sort works much faster on type 'integer', but range queries do not
> work on type 'integer',

Really?  The sort speed should be identical.

-Yonik

Re: copyField limitation

Posted by Grant Ingersoll <gs...@apache.org>.

This may be possible to do with Lucene's new SinkTokenizer/ 
TeeTokenFilter functionality.  You might find http://www.mail-archive.com/solr-dev@lucene.apache.org/msg06863.html 
  useful in that context.  Also, search the Lucene dev list for  
discussion.

-Grant

On Jan 22, 2008, at 3:13 PM, Lance Norskog wrote:

> A more interesting use case:
>
> Analyzing text and finding a number, like the mean word length or  
> the mean
> number of repeated words. These are standard tools for spam  
> detection. To
> create these, we would want to shovel text into a text processing  
> chain that
> creates an integer. We then want to both store that integer and  
> index it. We
> don't want to store the shoveled text.
>
> Solr does not now do this. I don't know if the Solr processing stack  
> has
> this flexibility, or if it is worth adding it.
>
> Lance
>
> -----Original Message-----
> From: Chris Hostetter [mailto:hossman_lucene@fucit.org]
> Sent: Thursday, January 17, 2008 6:11 PM
> To: solr-user@lucene.apache.org
> Subject: Re: copyField limitation
>
>
>
> : But, the <copyField> directive in the schema has a limitation. It  
> will
> only
> : copy data between fields with the same type. If the two fields are a
> : different type, the copy is ignored. This example would require
> <copyField>
> : to translate 'sint' to 'integer'.
>
> i can't reproduce this problem. with the following additions to the  
> example
> schema...
>
>   <field name="popularityI" type="integer" indexed="true"  
> stored="true"
> default="0"/>
>   ...
>   <copyField source="popularity" dest="popularityI"/>
>
> ...i was able to see, sort, and search on the popularityI field with  
> no
> problems.
>
> : Another case is days (not times):
> 	...
> : This would express the date as a string 2008-xx-xxT00:00:00Z and  
> store
> that
> : into the day field. It is not as optimal as using '2008-xx-xx' but  
> is
> still
> : useful for wildcards.
> 	...
>
> I'm not entirely sure i understand wht you are asking ... but i  
> believe your
> point is that there is no easy way to do a copyFiled that reformats  
> the data
> (ie: changing date formats, or converting the date to an int)
>
> In my opinion, this class of situations isn't a limitation of  
> copyField as
> much as it is a silly restriction in the way FieldTypes are handled by
> IndexSchema ... currently "TextField" is a special case because it's  
> hte
> only FieldType that can have an analyzer (i'm not even sure where this
> special case logic is ... i thought it was when the INdexSchema is
> initialized, but i can't find it now)
>
> It would be nice if any FieldType could have an analyzer, and as  
> long as th
> token(s) produced by that analyzer met the neccessary conditions for  
> the
> data type, things would go on their merry way ...  
> DateReFormatFilter's could
> be used to convert from any arbitray date format to the one Solr  
> expects,
> etc.... you could have have a detailedDate field and <copyField>  
> from that
> to a justDate string field that used a PatternReplaceFilter to strip  
> off the
> time.
>
> This still wouldn't help change the "stored" value of those fields  
> though so
> that the data would look right when retrieving stored values.
>
> Perhaps we should add an optional hook for mutating the "stored"  
> value of a
> fieldtype as well?  ... it could be an Analyzer (ie:
> tokenizer+filterchain) so that we get reuse of existing concepts, with
> each resulting token being treated as a seperate multivalue (for the  
> common
> case of rejoining all the tokens into a single string, we can add a
> StringBufferConcatTokenFilter or something)
>
> 	?
>
>
> -Hoss
>
>

--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com
http://www.lucenebootcamp.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

Re: copyField limitation

Posted by Ryan McKinley <ry...@gmail.com>.

> 
> Solr does not now do this. I don't know if the Solr processing stack has
> this flexibility, or if it is worth adding it.
> 

I understand every example you have suggested -- i just don't get how it 
isn't possible.  Can you post an exampe of the schema+commands that give 
you an error?

If your goal is to process incoming text and add some derivitave stored 
fields, you may want to look at 1.3-dev UpdateRequestProcessor
http://wiki.apache.org/solr/UpdateRequestProcessor

If you just need to change the token value (the indexed value, *not* the 
stored value) perhaps a custom FieldType where you override:

   public String toInternal(String val);
   public String toExternal(Fieldable f);
   public String indexedToReadable(String indexedForm);

ryan

RE: copyField limitation

Posted by Lance Norskog <go...@gmail.com>.

A more interesting use case:

Analyzing text and finding a number, like the mean word length or the mean
number of repeated words. These are standard tools for spam detection. To
create these, we would want to shovel text into a text processing chain that
creates an integer. We then want to both store that integer and index it. We
don't want to store the shoveled text.

Solr does not now do this. I don't know if the Solr processing stack has
this flexibility, or if it is worth adding it.

Lance

-----Original Message-----
From: Chris Hostetter [mailto:hossman_lucene@fucit.org] 
Sent: Thursday, January 17, 2008 6:11 PM
To: solr-user@lucene.apache.org
Subject: Re: copyField limitation



: But, the <copyField> directive in the schema has a limitation. It will
only
: copy data between fields with the same type. If the two fields are a
: different type, the copy is ignored. This example would require
<copyField>
: to translate 'sint' to 'integer'. 

i can't reproduce this problem. with the following additions to the example
schema...

   <field name="popularityI" type="integer" indexed="true" stored="true"
default="0"/>
   ...
   <copyField source="popularity" dest="popularityI"/>

...i was able to see, sort, and search on the popularityI field with no
problems.

: Another case is days (not times):
	...
: This would express the date as a string 2008-xx-xxT00:00:00Z and store
that
: into the day field. It is not as optimal as using '2008-xx-xx' but is
still
: useful for wildcards.
	...

I'm not entirely sure i understand wht you are asking ... but i believe your
point is that there is no easy way to do a copyFiled that reformats the data
(ie: changing date formats, or converting the date to an int) 

In my opinion, this class of situations isn't a limitation of copyField as
much as it is a silly restriction in the way FieldTypes are handled by
IndexSchema ... currently "TextField" is a special case because it's hte
only FieldType that can have an analyzer (i'm not even sure where this
special case logic is ... i thought it was when the INdexSchema is
initialized, but i can't find it now)

It would be nice if any FieldType could have an analyzer, and as long as th
token(s) produced by that analyzer met the neccessary conditions for the
data type, things would go on their merry way ... DateReFormatFilter's could
be used to convert from any arbitray date format to the one Solr expects,
etc.... you could have have a detailedDate field and <copyField> from that
to a justDate string field that used a PatternReplaceFilter to strip off the
time.

This still wouldn't help change the "stored" value of those fields though so
that the data would look right when retrieving stored values.

Perhaps we should add an optional hook for mutating the "stored" value of a
fieldtype as well?  ... it could be an Analyzer (ie: 
tokenizer+filterchain) so that we get reuse of existing concepts, with
each resulting token being treated as a seperate multivalue (for the common
case of rejoining all the tokens into a single string, we can add a
StringBufferConcatTokenFilter or something) 

	?


-Hoss

Re: copyField limitation

Posted by Chris Hostetter <ho...@fucit.org>.


: But, the <copyField> directive in the schema has a limitation. It will only
: copy data between fields with the same type. If the two fields are a
: different type, the copy is ignored. This example would require <copyField>
: to translate 'sint' to 'integer'. 

i can't reproduce this problem. with the following additions to the 
example schema...

   <field name="popularityI" type="integer" indexed="true" stored="true" default="0"/>
   ...
   <copyField source="popularity" dest="popularityI"/>

...i was able to see, sort, and search on the popularityI field with no 
problems.

: Another case is days (not times):
	...
: This would express the date as a string 2008-xx-xxT00:00:00Z and store that
: into the day field. It is not as optimal as using '2008-xx-xx' but is still
: useful for wildcards.
	...

I'm not entirely sure i understand wht you are asking ... but i believe 
your point is that there is no easy way to do a copyFiled that reformats 
the data (ie: changing date formats, or converting the date to an int) 

In my opinion, this class of situations isn't a limitation of copyField as 
much as it is a silly restriction in the way FieldTypes are handled by 
IndexSchema ... currently "TextField" is a special case because it's hte 
only FieldType that can have an analyzer (i'm not even sure where this 
special case logic is ... i thought it was when the INdexSchema is 
initialized, but i can't find it now)

It would be nice if any FieldType could have an analyzer, and as long as 
th token(s) produced by that analyzer met the neccessary conditions for 
the data type, things would go on their merry way ... DateReFormatFilter's 
could be used to convert from any arbitray date format to the one Solr 
expects, etc.... you could have have a detailedDate field and <copyField> 
from that to a justDate string field that used a PatternReplaceFilter to 
strip off the time.

This still wouldn't help change the "stored" value of those fields though 
so that the data would look right when retrieving stored values.

Perhaps we should add an optional hook for mutating the "stored" value of 
a fieldtype as well?  ... it could be an Analyzer (ie: 
tokenizer+filterchain) so that we get reuse of existing concepts, with 
each resulting token being treated as a seperate multivalue (for the 
common case of rejoining all the tokens into a single string, we can add a 
StringBufferConcatTokenFilter or something) 

	?


-Hoss