You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Chris Hostetter <ho...@fucit.org> on 2007/11/01 02:08:48 UTC

Re: sorting on dynamic fields - good, bad, neither?

: So far this seems acceptable. Query performance seems fine when using
: the dynamic fields to sort result sets; indexing performance also
: seems fine*. That said, there are only 400K documents in the
: collection I'm working with, and few external rating sources at the
: moment (there are about a dozen, and most documents have no external
: ratings data associated with them). But as these fields will be
: created from user-generated data, there's nothing to stop those
: numbers from ballooning.

the biggest factor to worry about is the number of "sources" ... the key 
to understanidng the performance risks is to understand that:
  1) no matter how many documents do or don't have a value for a given 
field, when you sort on thta field, a (cached) array containing one 
element for *every* doc in your index is used for that field.
  2) sorting dynamic fields is no differnet then sorting regular fields.

...so if you've got three sources, and from each source you get a 
"userRatingAvg" and a "userRatingSum" and you want to sort on them, it 
doesn't matter if you create 6 distinct fields, or two dynamic fields; and 
it doesn't matter if only 5 of your 400K docs have values for any one of 
those fields -- An array of 400K entires is going to be created for each 
of those fields the first time you sort on it with each "newSearcher"

as long as you've got the ram ... add as many 
sources/dynamicFields/documents as you want :)


-Hoss


Re: sorting on dynamic fields - good, bad, neither?

Posted by Chris Hostetter <ho...@fucit.org>.
: Each element of the cached array is a ... what? The ID of the

the elements of the array are the values, the indexes into the array are 
the document IDs ... esentailly it's inverted-inverted-index.

: document? (I'll be happy to answer this myself by reading the source
: code, but I'm not quite sure where to start looking.)

It's the FieldCacheImple in Lucene.

: What happens if there are more sort operations on those fields than
: there is memory to hold the cached arrays? OOM exceptions? Failed
: searches? Or simply cache evictions and degraded performance?
: Something else?

The "cache" is very simplistic -- one array per field for the life of the 
index reader involved ... so yes if you sort on enough unique fields, you 
get an OOM.

: > those fields -- An array of 400K entires is going to be created for each
: > of those fields the first time you sort on it with each "newSearcher"
: 
: Is the (max? min?) number of newSearchers something you control in
: solrconfig.xml?

typically there is never more then 2 searchers in Solr at anyone time ... 
the one being used, and maybe one being "warmed" because a commit just 
happened (i was refering to an event called "newSearcher" that can have 
configured actions in the solrconfig.xml - it's a good place to put some 
seed queries that sort on fields you know will be sorted on so thta the 
first user after the new searcher is created doesn't spend a lot of time 
waiting for the FieldCache to be built.

: Also, it seems a bit inefficient to bother allocating an array
: containing an entry for each document when only some small percentage
: of the documents actually contain values for the field. Would it be
: worth investigating whether you could somehow avoid this to save some
: RAM?

as i said, it's sized one per doc because the docid is the index ... there 
have been some other patches in Jira for LUCENE that have suggested 
alternate ways of doing sorting ... if some of those get 
tested/supported/commited we might be able to add config options for using 
them in Solr (for users who know they've got sparse fields for example)



-Hoss


Re: sorting on dynamic fields - good, bad, neither?

Posted by Mike Klaas <mi...@gmail.com>.
On 5-Nov-07, at 2:22 PM, Charles Hornberger wrote:

> On 11/5/07, Charles Hornberger <ch...@gmail.com> wrote:
>> Also, it seems a bit inefficient to bother allocating an array
>> containing an entry for each document when only some small percentage
>> of the documents actually contain values for the field. Would it be
>> worth investigating whether you could somehow avoid this to save some
>> RAM?
>
> To clarify: I'm suggesting that *I* would do the investigating here,
> not someone else :-) ... I'm just wondering if it's even worth trying
> ....

Perhaps something to bring up on java-dev (lucene)?  I think someone  
once implemented a solution using hashmaps for sorting, but I can't  
recall the issue #.

-Mike

Re: sorting on dynamic fields - good, bad, neither?

Posted by Charles Hornberger <ch...@gmail.com>.
On 11/5/07, Charles Hornberger <ch...@gmail.com> wrote:
> Also, it seems a bit inefficient to bother allocating an array
> containing an entry for each document when only some small percentage
> of the documents actually contain values for the field. Would it be
> worth investigating whether you could somehow avoid this to save some
> RAM?

To clarify: I'm suggesting that *I* would do the investigating here,
not someone else :-) ... I'm just wondering if it's even worth trying
....

Re: sorting on dynamic fields - good, bad, neither?

Posted by Charles Hornberger <ch...@gmail.com>.
On 10/31/07, Chris Hostetter <ho...@fucit.org> wrote:
> the biggest factor to worry about is the number of "sources" ... the key
> to understanidng the performance risks is to understand that:
>   1) no matter how many documents do or don't have a value for a given
> field, when you sort on thta field, a (cached) array containing one
> element for *every* doc in your index is used for that field.

Thanks very much for your helpful reply ... a few follow-up questions:

Each element of the cached array is a ... what? The ID of the
document? (I'll be happy to answer this myself by reading the source
code, but I'm not quite sure where to start looking.)

What happens if there are more sort operations on those fields than
there is memory to hold the cached arrays? OOM exceptions? Failed
searches? Or simply cache evictions and degraded performance?
Something else?

>   2) sorting dynamic fields is no differnet then sorting regular fields.

Good to know, thanks.

> ...so if you've got three sources, and from each source you get a
> "userRatingAvg" and a "userRatingSum" and you want to sort on them, it
> doesn't matter if you create 6 distinct fields, or two dynamic fields; and
> it doesn't matter if only 5 of your 400K docs have values for any one of
> those fields -- An array of 400K entires is going to be created for each
> of those fields the first time you sort on it with each "newSearcher"

Is the (max? min?) number of newSearchers something you control in
solrconfig.xml?

Also, it seems a bit inefficient to bother allocating an array
containing an entry for each document when only some small percentage
of the documents actually contain values for the field. Would it be
worth investigating whether you could somehow avoid this to save some
RAM?

Thanks again,
Charlie