You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by so...@rodland.no on 2010/06/29 12:18:22 UTC

use copyField to gather and then split

(sorry if this message ends up being sent twice)

We have a use-case where we'd like to fill a field from multiple sources, 
i.e.

<copyField source="title" dest="text" />
<copyField source="body" dest="text" />
… (other source-fields are copied in to text as well)

and then analyze the resulting text-field in a number of ways, each 
requiring it's own field.

Is it possible to somehow copy the text-field from above to these new 
fields - i.e.

<copyField source="text" dest="textanayzemethod2" />
<copyField source="text" dest="textanayzemethod1" />

Is this at all possible, or do we have to duplicate the first set of 
copyFields for each textanayzemethodN.

if possible: is the order of the statements in schema.xml important here?

Any tips or hints is highly appreciated.


regards,


Fredrik Rodland


--
Fredrik Rødland               Mail:  fredrik@rodland.no
                              Cell:  +47 99 21 98 17
Open Ad Exchange              MSN:   msn@rodland.no
Maisen Pedersens vei 1        AIM:   Fredrik Rodland
NO-1363 Høvik, NORWAY         Web:   http://rodland.no

Re: year range field, proper data type?

Posted by Lance Norskog <go...@gmail.com>.
There is no 'trie string'.

If you use a trie type for this problem, sorting will take much less
memory. Sorting strings uses memory both per document and per unique
term. The Trie types do not use any memory per unique term. So, yes, a
Trie Integer is a good choice for this problem.

On Wed, Jul 7, 2010 at 12:59 PM, Erick Erickson <er...@gmail.com> wrote:
> This isn't a very worrisome case. Most of the messages you see on the board
> about
> the dangers of dates arise because dates can be stored with many unique
> values if
> they include milliseconds. Then, when sorting on date your memory explodes
> because
> all the dates are loaded into memory.
>
> In your case, there are a max of 10,000 years, which isn't the same
> magnitude of problem
> as, say, 10,000,000 documents each with a unique timestamp.
>
> That said, you might as well go for as much speed as you can get and use a
> trie int, that
> way you won't be tripped up by three-digit years being out of lexical
> order.....
>
> Best
> Erick
>
> On Wed, Jul 7, 2010 at 10:55 AM, Jonathan Rochkind <ro...@jhu.edu> wrote:
>
>> So I will have a solr field that contains "years", ie, "1990", "2010",
>> maybe even "1492", "1209" and "907"/"0907".
>>
>> I will be doing range limits over this field.  Ie, [1950 TO 1975] or what
>> have you.  The data represents publication dates of books on a large library
>> shelves; there will be around 3 million documents, with the range of data
>> being concentrated in recent years, but with a long tail stretching off into
>> the past.
>>
>> So it seems to me clear that I should use a trie field of some type, to
>> efficiently accomodate the range querries.
>>
>> It seems to me that I probably don't need/want an actual date field, since
>> the data isn't complex to demand it, it's just a four-digit year.
>>
>> So that pretty much leaves storing as a trie integer, or as a trie string.
>>   Any advice on which is probably better in this case? Or on how to set up
>> the trie field for this kind of data? Thanks for any,
>>
>> Jonathan
>>
>



-- 
Lance Norskog
goksron@gmail.com

Re: year range field, proper data type?

Posted by Erick Erickson <er...@gmail.com>.
This isn't a very worrisome case. Most of the messages you see on the board
about
the dangers of dates arise because dates can be stored with many unique
values if
they include milliseconds. Then, when sorting on date your memory explodes
because
all the dates are loaded into memory.

In your case, there are a max of 10,000 years, which isn't the same
magnitude of problem
as, say, 10,000,000 documents each with a unique timestamp.

That said, you might as well go for as much speed as you can get and use a
trie int, that
way you won't be tripped up by three-digit years being out of lexical
order.....

Best
Erick

On Wed, Jul 7, 2010 at 10:55 AM, Jonathan Rochkind <ro...@jhu.edu> wrote:

> So I will have a solr field that contains "years", ie, "1990", "2010",
> maybe even "1492", "1209" and "907"/"0907".
>
> I will be doing range limits over this field.  Ie, [1950 TO 1975] or what
> have you.  The data represents publication dates of books on a large library
> shelves; there will be around 3 million documents, with the range of data
> being concentrated in recent years, but with a long tail stretching off into
> the past.
>
> So it seems to me clear that I should use a trie field of some type, to
> efficiently accomodate the range querries.
>
> It seems to me that I probably don't need/want an actual date field, since
> the data isn't complex to demand it, it's just a four-digit year.
>
> So that pretty much leaves storing as a trie integer, or as a trie string.
>   Any advice on which is probably better in this case? Or on how to set up
> the trie field for this kind of data? Thanks for any,
>
> Jonathan
>

year range field, proper data type?

Posted by Jonathan Rochkind <ro...@jhu.edu>.
So I will have a solr field that contains "years", ie, "1990", "2010", 
maybe even "1492", "1209" and "907"/"0907".

I will be doing range limits over this field.  Ie, [1950 TO 1975] or 
what have you.  The data represents publication dates of books on a 
large library shelves; there will be around 3 million documents, with 
the range of data being concentrated in recent years, but with a long 
tail stretching off into the past.

So it seems to me clear that I should use a trie field of some type, to 
efficiently accomodate the range querries.

It seems to me that I probably don't need/want an actual date field, 
since the data isn't complex to demand it, it's just a four-digit year.

So that pretty much leaves storing as a trie integer, or as a trie 
string.   Any advice on which is probably better in this case? Or on how 
to set up the trie field for this kind of data? Thanks for any,

Jonathan

Re: use copyField to gather and then split

Posted by Jan Høydahl / Cominvent <ja...@cominvent.com>.
Hi pal :)

Unfortunately copyField works only BEFORE analysis and you cannot "chain" them...

The simplest solution would be to duplicate your copyField's:

<copyField source="title" dest="textanayzemethod2" />
<copyField source="body" dest="textanayzemethod2" />

<copyField source="title" dest="textanayzemethod1" />
<copyField source="body" dest="textanayzemethod1" />

Another way would be to look into the UpdateProcessorChain and write a "copy" processor which does whatever you need.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 29. juni 2010, at 12.18, solr@rodland.no wrote:

> (sorry if this message ends up being sent twice)
> 
> We have a use-case where we'd like to fill a field from multiple sources, i.e.
> 
> <copyField source="title" dest="text" />
> <copyField source="body" dest="text" />
> … (other source-fields are copied in to text as well)
> 
> and then analyze the resulting text-field in a number of ways, each requiring it's own field.
> 
> Is it possible to somehow copy the text-field from above to these new fields - i.e.
> 
> <copyField source="text" dest="textanayzemethod2" />
> <copyField source="text" dest="textanayzemethod1" />
> 
> Is this at all possible, or do we have to duplicate the first set of copyFields for each textanayzemethodN.
> 
> if possible: is the order of the statements in schema.xml important here?
> 
> Any tips or hints is highly appreciated.
> 
> 
> regards,
> 
> 
> Fredrik Rodland
> 
> 
> --
> Fredrik Rødland               Mail:  fredrik@rodland.no
>                             Cell:  +47 99 21 98 17
> Open Ad Exchange              MSN:   msn@rodland.no
> Maisen Pedersens vei 1        AIM:   Fredrik Rodland
> NO-1363 Høvik, NORWAY         Web:   http://rodland.no