You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Robin Palotai <m....@gmail.com> on 2011/04/06 10:09:00 UTC
solr faceted search performance reason
Hello List,
Please see my question at
http://stackoverflow.com/questions/5552919/how-does-lucene-solr-achieve-high-performance-in-multi-field-faceted-search,
I would be interested to know some details.
Thank you,
Robin
Re: solr faceted search performance reason
Posted by Jonathan Rochkind <ro...@jhu.edu>.
PS: If you want to see how Solr actually computes facetting (the
facetting code lives in the 'Solr' codebase, not in the lower level
lucene codebase), here's the file to look at, this web snapshot is from
1.4.1 dont' know if it's been changed more recently, but I don't think
majorly:
http://www.jarvana.com/jarvana/view/org/apache/solr/solr-core/1.4.1/solr-core-1.4.1-sources.jar!/org/apache/solr/request/SimpleFacets.java?format=ok
It's kind of confusing, precisely because it takes several different
approaches depending on the nature of the result set and schema, trying
to pick the most performant approach for the context. I still haven't
wrapped my head around it entirely (I am not a Solr/lucene developer,
just a user).
On 4/6/2011 2:06 PM, Jonathan Rochkind wrote:
> On 4/6/2011 10:55 AM, Robin Palotai wrote:
>> Therefore, Lucene supposedly has some advanced technique for multi-field
>> queries other than just taking the intersection of matching documents based
>> on the inverted index.
> I don't think so, neccesarily. It's just that Lucene's algorithms to
> doing this is very fast, with some additional optimizations to make it
> even faster. There may be some edge cases where the optimizations take
> some shortcuts on top of this -- ie, if you ask for only the first ten
> facet values ordered by number of hits, in some cases solr/lucene won't
> even calculate the hit counts for facet values it already knows aren't
> going to be in the top 10. The facetting code in 1.4+ is actually kind
> of tangled, in that several different calculation approaches can be
> taken depending on the nature of the result set and schema.
>
>
> But anyway, I think you're right that you could set up an rdbms schema
> to _conceptually_ allow very similar operations to a lucene index. It
> would be unlikely to perform as well, because the devil is in the
> details of the storage formats and algorithms, and lucene has been
> optimized for these particular cases (at the expense of not covering a
> great many cases that an rdbms can cover).
>
> In fact, while I can't find it now on Google, I think someone HAS in the
> past written an extension to lucene to have it store it's indexes in an
> rdbms using a schema much like you describe, instead of in the file
> system. I'm not sure why they would want to do this instead of just
> using the rdbms -- either lucene's access algorithms still provide a
> performance benefit even when using an rdbms as the underlying 'file
> system', or lucene provides convenient functions that you wouldn't want
> to have to re-implement yourself solely in terms of an rdbms, or both.
> Ah, here's a brief reference to that approach in the lucene FAQ:
> http://wiki.apache.org/lucene-java/LuceneFAQ#Can_I_store_the_Lucene_index_in_a_relational_database.3F
>
> Jonathan
>
>> So the question is, what is this technique/trick? More broadly: Why can
>> Lucene/Solr achieve better faceted search performance theoretically than
>> RDBMS could (if so)?
>>
>> *Note: My first guess would be that Lucene would use some space partitioning
>> method for partitioning a vector space built from the document fields as
>> dimensions, but as I understand Lucene is not purely vector space based.*
>> Thanks,
>> Robin
>>
>> On Wed, Apr 6, 2011 at 3:15 PM, Erick Erickson<er...@gmail.com>wrote:
>>
>>> Please re-post the question here so others can see
>>> the discussion without going to another list.
>>>
>>> Best
>>> Erick
>>>
>>> On Wed, Apr 6, 2011 at 4:09 AM, Robin Palotai<m.palotai.robin@gmail.com
>>>> wrote:
>>>> Hello List,
>>>>
>>>> Please see my question at
>>>>
>>>>
>>> http://stackoverflow.com/questions/5552919/how-does-lucene-solr-achieve-high-performance-in-multi-field-faceted-search
>>>> ,
>>>> I would be interested to know some details.
>>>>
>>>> Thank you,
>>>> Robin
>>>>
Re: solr faceted search performance reason
Posted by Jonathan Rochkind <ro...@jhu.edu>.
On 4/6/2011 10:55 AM, Robin Palotai wrote:
> Therefore, Lucene supposedly has some advanced technique for multi-field
> queries other than just taking the intersection of matching documents based
> on the inverted index.
I don't think so, neccesarily. It's just that Lucene's algorithms to
doing this is very fast, with some additional optimizations to make it
even faster. There may be some edge cases where the optimizations take
some shortcuts on top of this -- ie, if you ask for only the first ten
facet values ordered by number of hits, in some cases solr/lucene won't
even calculate the hit counts for facet values it already knows aren't
going to be in the top 10. The facetting code in 1.4+ is actually kind
of tangled, in that several different calculation approaches can be
taken depending on the nature of the result set and schema.
But anyway, I think you're right that you could set up an rdbms schema
to _conceptually_ allow very similar operations to a lucene index. It
would be unlikely to perform as well, because the devil is in the
details of the storage formats and algorithms, and lucene has been
optimized for these particular cases (at the expense of not covering a
great many cases that an rdbms can cover).
In fact, while I can't find it now on Google, I think someone HAS in the
past written an extension to lucene to have it store it's indexes in an
rdbms using a schema much like you describe, instead of in the file
system. I'm not sure why they would want to do this instead of just
using the rdbms -- either lucene's access algorithms still provide a
performance benefit even when using an rdbms as the underlying 'file
system', or lucene provides convenient functions that you wouldn't want
to have to re-implement yourself solely in terms of an rdbms, or both.
Ah, here's a brief reference to that approach in the lucene FAQ:
http://wiki.apache.org/lucene-java/LuceneFAQ#Can_I_store_the_Lucene_index_in_a_relational_database.3F
Jonathan
> So the question is, what is this technique/trick? More broadly: Why can
> Lucene/Solr achieve better faceted search performance theoretically than
> RDBMS could (if so)?
>
> *Note: My first guess would be that Lucene would use some space partitioning
> method for partitioning a vector space built from the document fields as
> dimensions, but as I understand Lucene is not purely vector space based.*
> Thanks,
> Robin
>
> On Wed, Apr 6, 2011 at 3:15 PM, Erick Erickson<er...@gmail.com>wrote:
>
>> Please re-post the question here so others can see
>> the discussion without going to another list.
>>
>> Best
>> Erick
>>
>> On Wed, Apr 6, 2011 at 4:09 AM, Robin Palotai<m.palotai.robin@gmail.com
>>> wrote:
>>> Hello List,
>>>
>>> Please see my question at
>>>
>>>
>> http://stackoverflow.com/questions/5552919/how-does-lucene-solr-achieve-high-performance-in-multi-field-faceted-search
>>> ,
>>> I would be interested to know some details.
>>>
>>> Thank you,
>>> Robin
>>>
Re: solr faceted search performance reason
Posted by Robin Palotai <m....@gmail.com>.
Carbon copied:
*Context*
This is a question mainly about Lucene (or possibly Solr) internals. The
main topic is *faceted search*, in which search can happen along multiple
independent dimensions (facets) of objects (for example size, speed, price
of a car).
When implemented with relational database, for a large number of facets
multi-field indices are not useful, since facets can be searched in any
order, so a specific ordered multi-index is used with low chance, and
creating all possible orderings of indices is unbearable.
Solr is advertised to cope well with the faceted search task, which if I
think correctly has to be connected with Lucene (supposedly) performing well
on multi-field queries (where fields of a document relate to facets of an
object).
*Question*
The *inverted index* of Lucene can be stored in a relational database, and
naturally taking the intersections of the matching documents can also be
trivially achieved with RDBMS using single-field indices.
Therefore, Lucene supposedly has some advanced technique for multi-field
queries other than just taking the intersection of matching documents based
on the inverted index.
So the question is, what is this technique/trick? More broadly: Why can
Lucene/Solr achieve better faceted search performance theoretically than
RDBMS could (if so)?
*Note: My first guess would be that Lucene would use some space partitioning
method for partitioning a vector space built from the document fields as
dimensions, but as I understand Lucene is not purely vector space based.*
Thanks,
Robin
On Wed, Apr 6, 2011 at 3:15 PM, Erick Erickson <er...@gmail.com>wrote:
> Please re-post the question here so others can see
> the discussion without going to another list.
>
> Best
> Erick
>
> On Wed, Apr 6, 2011 at 4:09 AM, Robin Palotai <m.palotai.robin@gmail.com
> >wrote:
>
> > Hello List,
> >
> > Please see my question at
> >
> >
> http://stackoverflow.com/questions/5552919/how-does-lucene-solr-achieve-high-performance-in-multi-field-faceted-search
> > ,
> > I would be interested to know some details.
> >
> > Thank you,
> > Robin
> >
>
Re: solr faceted search performance reason
Posted by Erick Erickson <er...@gmail.com>.
Please re-post the question here so others can see
the discussion without going to another list.
Best
Erick
On Wed, Apr 6, 2011 at 4:09 AM, Robin Palotai <m....@gmail.com>wrote:
> Hello List,
>
> Please see my question at
>
> http://stackoverflow.com/questions/5552919/how-does-lucene-solr-achieve-high-performance-in-multi-field-faceted-search
> ,
> I would be interested to know some details.
>
> Thank you,
> Robin
>