You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Nicola Buso <nb...@ebi.ac.uk> on 2013/04/26 16:44:11 UTC

Big number of values for facets

Hi all,

I'm encountering a problem to index a document with a large number of
values for one facet.

Caused by: java.lang.IllegalArgumentException: DocValuesField "$facets"
is too large, must be <= 32766
        at
org.apache.lucene.index.BinaryDocValuesWriter.addValue(BinaryDocValuesWriter.java:57)
        at
org.apache.lucene.index.DocValuesProcessor.addBinaryField(DocValuesProcessor.java:111)
        at
org.apache.lucene.index.DocValuesProcessor.addField(DocValuesProcessor.java:57)
        at
org.apache.lucene.index.TwoStoredFieldsConsumers.addField(TwoStoredFieldsConsumers.java:36)
        at
org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:242)
        at
org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:256)
        at
org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:376)
        at
org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1473)


It's obviously hard to visualize such a big number of facets to the user
and is also hard to evaluate which of these values to skip to permit to
store this document into the index.

Do you have any suggestion on how to overcome this number? is it
possible?



Nicola


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Big number of values for facets

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Fri, Apr 26, 2013 at 12:48 PM, Shai Erera <se...@gmail.com> wrote:
> You can also try to use a different IntEncoder which compresses the values
> better. Try FourFlags and the like. Perhaps it will allow you to index more
> facets per document and it will be enough... though i should add "for the
> time being" b/c according to your scenario, you could easily hit more than
> 32K values...
>
> The fact is that DV limit us. Maybe that limitation can be alleviated by
> writing a Codec? Not sure... I'll have to dig into the code. If the
> limitation is on the Codec, it should be possible.

Unfortunately it's not just a Codec limitation: the core data
structures used to hold the binary values also have this limit.  We
should open an issue to explore increasing it (but I think a lot of
places assume this limit...).

> Maybe there's another solution to this. How do you use the facets? Have you
> considered using a different days structure like a Graph DB? Not sure if
> that's applicable to you at all.

I think another option is to use the new (coming in 4.3)
SortedSetDocValuesFacetFields/Accumulator?  I think it doesn't have a
hard limit?

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Big number of values for facets

Posted by Shai Erera <se...@gmail.com>.

You can also try to use a different IntEncoder which compresses the values
better. Try FourFlags and the like. Perhaps it will allow you to index more
facets per document and it will be enough... though i should add "for the
time being" b/c according to your scenario, you could easily hit more than
32K values...

The fact is that DV limit us. Maybe that limitation can be alleviated by
writing a Codec? Not sure... I'll have to dig into the code. If the
limitation is on the Codec, it should be possible.

Maybe there's another solution to this. How do you use the facets? Have you
considered using a different days structure like a Graph DB? Not sure if
that's applicable to you at all.

Shai
On Apr 26, 2013 7:31 PM, "Nicola Buso" <nb...@ebi.ac.uk> wrote:

> Hi,
>
> Mike: no it's not an error of our application I have some entries with
> this peculiarities :-) probably these cases can be mapped in different
> ways?
>
> If I think to the ER world It's not difficult to have a (n to m)
> relation between two tables where one of this table is a categorization
> of some concepts; at this point I don't think is that impossible to find
> some thousands of relations between the two tables if we speak of big
> amount of data (lucene world :-) ).
>
> The specific case is we are storing proteins informations and every
> protein is associated/categories to species. If you consider a
> specialized protein it's not associated to a big number of species but
> most generic proteins are associated to almost every species. Obviously
> species are thousands.
>
> Now the user will never be interested in filtering by thousands of
> species a search result, but this is not a reason to completely discard
> a bunch of facets values; I imagine there will be queries that will
> point out some species (let me say) in the 32766 saved values and some
> other queries that will point out the species not saved in the facets.
>
> We can try to save the most relevant values for this facets, but again
> it's not easy do define "most relevant".
>
>
>
> Nicola.
>
>
> On Fri, 2013-04-26 at 18:44 +0300, Shai Erera wrote:
> > Unfortunately partitions are enabled globally and not per document. And
> you
> > cannot activate them as you go. It's a setting you need to enable before
> > you index. At least, that's how they currently work - we can think of
> > better ways to do it.
> >
> > Also, partitions were not designed to handle that limitation, but rather
> > better RAM consumption for large taxonomies. Ie when facets were on
> > payloads, we didn't have that limitation, and frankly, I didn't know DV
> > limits you at all...
> >
> > The problem is that even if you choose to enable partitions, you need to
> > determine a safe partition size to use. Eg if you have a total of 1M
> > categories and you set partition size to 100K, 10 DV fields will be
> > created. But there's no guarantee a single document's categories space
> > won't fall entirely into one partition... In which case you'll want to
> set
> > partition size to say 5K, but then you'll have 200 DV fields to process
> > during search - bad performance!
> >
> > I'm not near the code at the moment, but I think that partitions are
> > enabled globally to all category lists. Perhaps we can modify the code to
> > apply partitions per CLP. That way, you can index just the problematic
> > dimension in a different category list so that only that dimension
> suffers
> > during search but the rest are processed regularly?
> >
> > Still, can you share some info about this dimension? What sort of
> > categories does it cover that docs have thousands values?
> >
> > The reason I ask is that the only scenario I've seen where partitions
> came
> > in handy was IMO an abuse of the fact module ... :-)
> >
> > Shai
> > On Apr 26, 2013 6:04 PM, "Shai Erera" <se...@gmail.com> wrote:
> >
> > > Hi Nicola,
> > >
> > > I think this limit denotes the number of bytes you can write in a
> single
> > > DV value. So this actually means much less number of facets you index.
> Do
> > > you know how many categories are indexed for that one document?
> > >
> > > Also, do you expect to index large number of facets for most
> documents, or
> > > is this one extreme example?
> > >
> > > Basically I think you can achieve that by enabling partitions.
> Partitions
> > > let you split the categories space into smaller sets, so that each DV
> value
> > > contains less values, and also the RAM consumption during search is
> lower
> > > since FacetArrays is allocated the size of the partition and not the
> > > taxonomy. But you also incur search performance loss because counting a
> > > certain dimension requires traversing multiple DV fields.
> > >
> > > To enable partitions you need to override FacetIndexingParams partition
> > > size. You can try to play with it.
> > >
> > > In am intetested though to understand the general scenario. Perhaps
> this
> > > can be solved some other way...
> > >
> > > Shai
> > > On Apr 26, 2013 5:44 PM, "Nicola Buso" <nb...@ebi.ac.uk> wrote:
> > >
> > >> Hi all,
> > >>
> > >> I'm encountering a problem to index a document with a large number of
> > >> values for one facet.
> > >>
> > >> Caused by: java.lang.IllegalArgumentException: DocValuesField
> "$facets"
> > >> is too large, must be <= 32766
> > >>         at
> > >>
> > >>
> org.apache.lucene.index.BinaryDocValuesWriter.addValue(BinaryDocValuesWriter.java:57)
> > >>         at
> > >>
> > >>
> org.apache.lucene.index.DocValuesProcessor.addBinaryField(DocValuesProcessor.java:111)
> > >>         at
> > >>
> > >>
> org.apache.lucene.index.DocValuesProcessor.addField(DocValuesProcessor.java:57)
> > >>         at
> > >>
> > >>
> org.apache.lucene.index.TwoStoredFieldsConsumers.addField(TwoStoredFieldsConsumers.java:36)
> > >>         at
> > >>
> > >>
> org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:242)
> > >>         at
> > >>
> > >>
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:256)
> > >>         at
> > >>
> > >>
> org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:376)
> > >>         at
> > >>
> org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1473)
> > >>
> > >>
> > >> It's obviously hard to visualize such a big number of facets to the
> user
> > >> and is also hard to evaluate which of these values to skip to permit
> to
> > >> store this document into the index.
> > >>
> > >> Do you have any suggestion on how to overcome this number? is it
> > >> possible?
> > >>
> > >>
> > >>
> > >> Nicola
> > >>
> > >>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > >> For additional commands, e-mail: java-user-help@lucene.apache.org
> > >>
> > >>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Big number of values for facets

Posted by Nicola Buso <nb...@ebi.ac.uk>.

Hi,

Mike: no it's not an error of our application I have some entries with
this peculiarities :-) probably these cases can be mapped in different
ways?

If I think to the ER world It's not difficult to have a (n to m)
relation between two tables where one of this table is a categorization
of some concepts; at this point I don't think is that impossible to find
some thousands of relations between the two tables if we speak of big
amount of data (lucene world :-) ).

The specific case is we are storing proteins informations and every
protein is associated/categories to species. If you consider a
specialized protein it's not associated to a big number of species but
most generic proteins are associated to almost every species. Obviously
species are thousands.

Now the user will never be interested in filtering by thousands of
species a search result, but this is not a reason to completely discard
a bunch of facets values; I imagine there will be queries that will
point out some species (let me say) in the 32766 saved values and some
other queries that will point out the species not saved in the facets.

We can try to save the most relevant values for this facets, but again
it's not easy do define "most relevant".



Nicola.


On Fri, 2013-04-26 at 18:44 +0300, Shai Erera wrote:
> Unfortunately partitions are enabled globally and not per document. And you
> cannot activate them as you go. It's a setting you need to enable before
> you index. At least, that's how they currently work - we can think of
> better ways to do it.
> 
> Also, partitions were not designed to handle that limitation, but rather
> better RAM consumption for large taxonomies. Ie when facets were on
> payloads, we didn't have that limitation, and frankly, I didn't know DV
> limits you at all...
> 
> The problem is that even if you choose to enable partitions, you need to
> determine a safe partition size to use. Eg if you have a total of 1M
> categories and you set partition size to 100K, 10 DV fields will be
> created. But there's no guarantee a single document's categories space
> won't fall entirely into one partition... In which case you'll want to set
> partition size to say 5K, but then you'll have 200 DV fields to process
> during search - bad performance!
> 
> I'm not near the code at the moment, but I think that partitions are
> enabled globally to all category lists. Perhaps we can modify the code to
> apply partitions per CLP. That way, you can index just the problematic
> dimension in a different category list so that only that dimension suffers
> during search but the rest are processed regularly?
> 
> Still, can you share some info about this dimension? What sort of
> categories does it cover that docs have thousands values?
> 
> The reason I ask is that the only scenario I've seen where partitions came
> in handy was IMO an abuse of the fact module ... :-)
> 
> Shai
> On Apr 26, 2013 6:04 PM, "Shai Erera" <se...@gmail.com> wrote:
> 
> > Hi Nicola,
> >
> > I think this limit denotes the number of bytes you can write in a single
> > DV value. So this actually means much less number of facets you index. Do
> > you know how many categories are indexed for that one document?
> >
> > Also, do you expect to index large number of facets for most documents, or
> > is this one extreme example?
> >
> > Basically I think you can achieve that by enabling partitions. Partitions
> > let you split the categories space into smaller sets, so that each DV value
> > contains less values, and also the RAM consumption during search is lower
> > since FacetArrays is allocated the size of the partition and not the
> > taxonomy. But you also incur search performance loss because counting a
> > certain dimension requires traversing multiple DV fields.
> >
> > To enable partitions you need to override FacetIndexingParams partition
> > size. You can try to play with it.
> >
> > In am intetested though to understand the general scenario. Perhaps this
> > can be solved some other way...
> >
> > Shai
> > On Apr 26, 2013 5:44 PM, "Nicola Buso" <nb...@ebi.ac.uk> wrote:
> >
> >> Hi all,
> >>
> >> I'm encountering a problem to index a document with a large number of
> >> values for one facet.
> >>
> >> Caused by: java.lang.IllegalArgumentException: DocValuesField "$facets"
> >> is too large, must be <= 32766
> >>         at
> >>
> >> org.apache.lucene.index.BinaryDocValuesWriter.addValue(BinaryDocValuesWriter.java:57)
> >>         at
> >>
> >> org.apache.lucene.index.DocValuesProcessor.addBinaryField(DocValuesProcessor.java:111)
> >>         at
> >>
> >> org.apache.lucene.index.DocValuesProcessor.addField(DocValuesProcessor.java:57)
> >>         at
> >>
> >> org.apache.lucene.index.TwoStoredFieldsConsumers.addField(TwoStoredFieldsConsumers.java:36)
> >>         at
> >>
> >> org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:242)
> >>         at
> >>
> >> org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:256)
> >>         at
> >>
> >> org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:376)
> >>         at
> >> org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1473)
> >>
> >>
> >> It's obviously hard to visualize such a big number of facets to the user
> >> and is also hard to evaluate which of these values to skip to permit to
> >> store this document into the index.
> >>
> >> Do you have any suggestion on how to overcome this number? is it
> >> possible?
> >>
> >>
> >>
> >> Nicola
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Big number of values for facets

Posted by Shai Erera <se...@gmail.com>.

Unfortunately partitions are enabled globally and not per document. And you
cannot activate them as you go. It's a setting you need to enable before
you index. At least, that's how they currently work - we can think of
better ways to do it.

Also, partitions were not designed to handle that limitation, but rather
better RAM consumption for large taxonomies. Ie when facets were on
payloads, we didn't have that limitation, and frankly, I didn't know DV
limits you at all...

The problem is that even if you choose to enable partitions, you need to
determine a safe partition size to use. Eg if you have a total of 1M
categories and you set partition size to 100K, 10 DV fields will be
created. But there's no guarantee a single document's categories space
won't fall entirely into one partition... In which case you'll want to set
partition size to say 5K, but then you'll have 200 DV fields to process
during search - bad performance!

I'm not near the code at the moment, but I think that partitions are
enabled globally to all category lists. Perhaps we can modify the code to
apply partitions per CLP. That way, you can index just the problematic
dimension in a different category list so that only that dimension suffers
during search but the rest are processed regularly?

Still, can you share some info about this dimension? What sort of
categories does it cover that docs have thousands values?

The reason I ask is that the only scenario I've seen where partitions came
in handy was IMO an abuse of the fact module ... :-)

Shai
On Apr 26, 2013 6:04 PM, "Shai Erera" <se...@gmail.com> wrote:

> Hi Nicola,
>
> I think this limit denotes the number of bytes you can write in a single
> DV value. So this actually means much less number of facets you index. Do
> you know how many categories are indexed for that one document?
>
> Also, do you expect to index large number of facets for most documents, or
> is this one extreme example?
>
> Basically I think you can achieve that by enabling partitions. Partitions
> let you split the categories space into smaller sets, so that each DV value
> contains less values, and also the RAM consumption during search is lower
> since FacetArrays is allocated the size of the partition and not the
> taxonomy. But you also incur search performance loss because counting a
> certain dimension requires traversing multiple DV fields.
>
> To enable partitions you need to override FacetIndexingParams partition
> size. You can try to play with it.
>
> In am intetested though to understand the general scenario. Perhaps this
> can be solved some other way...
>
> Shai
> On Apr 26, 2013 5:44 PM, "Nicola Buso" <nb...@ebi.ac.uk> wrote:
>
>> Hi all,
>>
>> I'm encountering a problem to index a document with a large number of
>> values for one facet.
>>
>> Caused by: java.lang.IllegalArgumentException: DocValuesField "$facets"
>> is too large, must be <= 32766
>>         at
>>
>> org.apache.lucene.index.BinaryDocValuesWriter.addValue(BinaryDocValuesWriter.java:57)
>>         at
>>
>> org.apache.lucene.index.DocValuesProcessor.addBinaryField(DocValuesProcessor.java:111)
>>         at
>>
>> org.apache.lucene.index.DocValuesProcessor.addField(DocValuesProcessor.java:57)
>>         at
>>
>> org.apache.lucene.index.TwoStoredFieldsConsumers.addField(TwoStoredFieldsConsumers.java:36)
>>         at
>>
>> org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:242)
>>         at
>>
>> org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:256)
>>         at
>>
>> org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:376)
>>         at
>> org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1473)
>>
>>
>> It's obviously hard to visualize such a big number of facets to the user
>> and is also hard to evaluate which of these values to skip to permit to
>> store this document into the index.
>>
>> Do you have any suggestion on how to overcome this number? is it
>> possible?
>>
>>
>>
>> Nicola
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>

Re: Big number of values for facets

Posted by Michael McCandless <lu...@mikemccandless.com>.

This means a single document requires more than 32 KB to store all of
its ordinals ... so that document must have like at least 6K facets?

Are you sure this isn't a bug in your app?  That's an insanely high
number of facets for one document ...
Mike McCandless

http://blog.mikemccandless.com


On Fri, Apr 26, 2013 at 11:22 AM, Nicola Buso <nb...@ebi.ac.uk> wrote:
> Hi Shai,
>
> I can't say now how many of these entries I have, I need to trace them,
> but I expect their are exceptions, like 10 entries no more.
>
> Can I enable partitions document by document? Should I activate
> partitions if I reach a threshold just for these exceptions?
>
>
> Nicola.
>
> On Fri, 2013-04-26 at 18:04 +0300, Shai Erera wrote:
>> Hi Nicola,
>>
>> I think this limit denotes the number of bytes you can write in a single DV
>> value. So this actually means much less number of facets you index. Do you
>> know how many categories are indexed for that one document?
>>
>> Also, do you expect to index large number of facets for most documents, or
>> is this one extreme example?
>>
>> Basically I think you can achieve that by enabling partitions. Partitions
>> let you split the categories space into smaller sets, so that each DV value
>> contains less values, and also the RAM consumption during search is lower
>> since FacetArrays is allocated the size of the partition and not the
>> taxonomy. But you also incur search performance loss because counting a
>> certain dimension requires traversing multiple DV fields.
>>
>> To enable partitions you need to override FacetIndexingParams partition
>> size. You can try to play with it.
>>
>> In am intetested though to understand the general scenario. Perhaps this
>> can be solved some other way...
>>
>> Shai
>> On Apr 26, 2013 5:44 PM, "Nicola Buso" <nb...@ebi.ac.uk> wrote:
>>
>> > Hi all,
>> >
>> > I'm encountering a problem to index a document with a large number of
>> > values for one facet.
>> >
>> > Caused by: java.lang.IllegalArgumentException: DocValuesField "$facets"
>> > is too large, must be <= 32766
>> >         at
>> >
>> > org.apache.lucene.index.BinaryDocValuesWriter.addValue(BinaryDocValuesWriter.java:57)
>> >         at
>> >
>> > org.apache.lucene.index.DocValuesProcessor.addBinaryField(DocValuesProcessor.java:111)
>> >         at
>> >
>> > org.apache.lucene.index.DocValuesProcessor.addField(DocValuesProcessor.java:57)
>> >         at
>> >
>> > org.apache.lucene.index.TwoStoredFieldsConsumers.addField(TwoStoredFieldsConsumers.java:36)
>> >         at
>> >
>> > org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:242)
>> >         at
>> >
>> > org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:256)
>> >         at
>> >
>> > org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:376)
>> >         at
>> > org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1473)
>> >
>> >
>> > It's obviously hard to visualize such a big number of facets to the user
>> > and is also hard to evaluate which of these values to skip to permit to
>> > store this document into the index.
>> >
>> > Do you have any suggestion on how to overcome this number? is it
>> > possible?
>> >
>> >
>> >
>> > Nicola
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>> >
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Big number of values for facets

Posted by Nicola Buso <nb...@ebi.ac.uk>.

Hi Shai,

I can't say now how many of these entries I have, I need to trace them,
but I expect their are exceptions, like 10 entries no more.

Can I enable partitions document by document? Should I activate
partitions if I reach a threshold just for these exceptions?


Nicola.

On Fri, 2013-04-26 at 18:04 +0300, Shai Erera wrote:
> Hi Nicola,
> 
> I think this limit denotes the number of bytes you can write in a single DV
> value. So this actually means much less number of facets you index. Do you
> know how many categories are indexed for that one document?
> 
> Also, do you expect to index large number of facets for most documents, or
> is this one extreme example?
> 
> Basically I think you can achieve that by enabling partitions. Partitions
> let you split the categories space into smaller sets, so that each DV value
> contains less values, and also the RAM consumption during search is lower
> since FacetArrays is allocated the size of the partition and not the
> taxonomy. But you also incur search performance loss because counting a
> certain dimension requires traversing multiple DV fields.
> 
> To enable partitions you need to override FacetIndexingParams partition
> size. You can try to play with it.
> 
> In am intetested though to understand the general scenario. Perhaps this
> can be solved some other way...
> 
> Shai
> On Apr 26, 2013 5:44 PM, "Nicola Buso" <nb...@ebi.ac.uk> wrote:
> 
> > Hi all,
> >
> > I'm encountering a problem to index a document with a large number of
> > values for one facet.
> >
> > Caused by: java.lang.IllegalArgumentException: DocValuesField "$facets"
> > is too large, must be <= 32766
> >         at
> >
> > org.apache.lucene.index.BinaryDocValuesWriter.addValue(BinaryDocValuesWriter.java:57)
> >         at
> >
> > org.apache.lucene.index.DocValuesProcessor.addBinaryField(DocValuesProcessor.java:111)
> >         at
> >
> > org.apache.lucene.index.DocValuesProcessor.addField(DocValuesProcessor.java:57)
> >         at
> >
> > org.apache.lucene.index.TwoStoredFieldsConsumers.addField(TwoStoredFieldsConsumers.java:36)
> >         at
> >
> > org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:242)
> >         at
> >
> > org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:256)
> >         at
> >
> > org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:376)
> >         at
> > org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1473)
> >
> >
> > It's obviously hard to visualize such a big number of facets to the user
> > and is also hard to evaluate which of these values to skip to permit to
> > store this document into the index.
> >
> > Do you have any suggestion on how to overcome this number? is it
> > possible?
> >
> >
> >
> > Nicola
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Big number of values for facets

Posted by Shai Erera <se...@gmail.com>.

Hi Nicola,

I think this limit denotes the number of bytes you can write in a single DV
value. So this actually means much less number of facets you index. Do you
know how many categories are indexed for that one document?

Also, do you expect to index large number of facets for most documents, or
is this one extreme example?

Basically I think you can achieve that by enabling partitions. Partitions
let you split the categories space into smaller sets, so that each DV value
contains less values, and also the RAM consumption during search is lower
since FacetArrays is allocated the size of the partition and not the
taxonomy. But you also incur search performance loss because counting a
certain dimension requires traversing multiple DV fields.

To enable partitions you need to override FacetIndexingParams partition
size. You can try to play with it.

In am intetested though to understand the general scenario. Perhaps this
can be solved some other way...

Shai
On Apr 26, 2013 5:44 PM, "Nicola Buso" <nb...@ebi.ac.uk> wrote:

> Hi all,
>
> I'm encountering a problem to index a document with a large number of
> values for one facet.
>
> Caused by: java.lang.IllegalArgumentException: DocValuesField "$facets"
> is too large, must be <= 32766
>         at
>
> org.apache.lucene.index.BinaryDocValuesWriter.addValue(BinaryDocValuesWriter.java:57)
>         at
>
> org.apache.lucene.index.DocValuesProcessor.addBinaryField(DocValuesProcessor.java:111)
>         at
>
> org.apache.lucene.index.DocValuesProcessor.addField(DocValuesProcessor.java:57)
>         at
>
> org.apache.lucene.index.TwoStoredFieldsConsumers.addField(TwoStoredFieldsConsumers.java:36)
>         at
>
> org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:242)
>         at
>
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:256)
>         at
>
> org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:376)
>         at
> org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1473)
>
>
> It's obviously hard to visualize such a big number of facets to the user
> and is also hard to evaluate which of these values to skip to permit to
> store this document into the index.
>
> Do you have any suggestion on how to overcome this number? is it
> possible?
>
>
>
> Nicola
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>