You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Yonghui Zhao <zh...@gmail.com> on 2014/09/22 19:48:58 UTC

sortedset vs taxonomy

If we want to implement simple facet counting feature, it seems we can do
it via sortedset or taxonomy writer/reader.

Seems sortedset is simpler but doesn't support hierarchical  facet count
such as A/B/C.

I want to know what's advantage/disadvantage of sortedset or taxonomy?

Is there any trouble with taxonomy when index is optimized(merged)?

Re: sortedset vs taxonomy

Posted by Shai Erera <se...@gmail.com>.
Hi

The taxonomy faceting approach maintains a sidecar index where it keeps the
taxonomy and assigns an integer (ordinal) to each category. Those integers
are encoded in a BinaryDocValues field for each document. It supports
hierarchical faceting as well as assigning additional metadata to each
facet occurrence (called associations). At search time, faceting is done by
aggregating the category ordinals found in each document. Since those
ordinals are global to the index, merging and finding the top-K facets
across segments is relatively cheap.

The SortedSet faceting approach does not need a sidecar index ans relies on
the SortedSet fields. Here too each term/category is assigned an ordinal
and at search time the facets are aggregated using those ordinals. However,
the ordinals of the same category is not the same across segments, and
therefore finding the top-K facets is a bit more expensive (roughly 20%
slower if I remember correctly).

Another difference is that the SortedSet approach keeps a true ordinal for
a facet, so e.g. the category A/B will always receive an ordinal that is
smaller than A/C. In the taxonomy approach though, whichever facet got
added first receives the lowest ordinal, except that the parent of all
categories at a certain level in the hierarchy always receives a smaller
ordinal than all its children.

Working w/ SortedSet facets is indeed simpler than the taxonomy, but the
taxonomy does not seriously complicate things. If you need a facet
hierarchy, you should use the taxonomy approach. Otherwise, I would just
try each and see which one works better for your usecase.

As for optimizing an index, the taxonomy facets do not make any difference
in that case.

Shai

On Mon, Sep 22, 2014 at 8:48 PM, Yonghui Zhao <zh...@gmail.com> wrote:

> If we want to implement simple facet counting feature, it seems we can do
> it via sortedset or taxonomy writer/reader.
>
> Seems sortedset is simpler but doesn't support hierarchical  facet count
> such as A/B/C.
>
> I want to know what's advantage/disadvantage of sortedset or taxonomy?
>
> Is there any trouble with taxonomy when index is optimized(merged)?
>