You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Fuad Efendi <fu...@tokenizer.ca> on 2012/08/20 22:34:04 UTC

UnInvertedField limitations

Hi All,


I have a problemŠ  (Yonik, please!) help me, what is Term count limits? I
possibly have 256,000,000 different terms in a fieldŠ or 16,000,000?

Thanks!


2012-08-20 16:20:19,262 ERROR [solr.core.SolrCore] - [pool-1-thread-1] - :
org.apache.solr.common.SolrException: Too many values for UnInvertedField
faceting on field enrich_keywords_string_mv
        at 
org.apache.solr.request.UnInvertedField.<init>(UnInvertedField.java:179)
        at 
org.apache.solr.request.UnInvertedField.getUnInvertedField(UnInvertedField.j
ava:668)
        at 
org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:326)
        at 
org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:4
23)
        at 
org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:206)
        at 
org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java
:85)
        at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHand
ler.java:204)
        at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.
java:129)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1561)




-- 
Fuad Efendi
http://www.tokenizer.ca




Re: UnInvertedField limitations

Posted by Yonik Seeley <yo...@lucidworks.com>.
It's actually limited to 24 bits to point to the term list in a
byte[], but there are 256 different arrays, so the maximum capacity is
4B bytes of un-inverted terms, but each bucket is limited to 4B/256 so
the real limit can come in at a little less due to luck.

>From the comments:

 *   There is a single int[maxDoc()] which either contains a pointer
into a byte[] for
 *   the termNumber lists, or directly contains the termNumber list if
it fits in the 4
 *   bytes of an integer.  If the first byte in the integer is 1, the
next 3 bytes
 *   are a pointer into a byte[] where the termNumber list starts.
 *
 *   There are actually 256 byte arrays, to compensate for the fact
that the pointers
 *   into the byte arrays are only 3 bytes long.  The correct byte
array for a document
 *   is a function of it's id.


-Yonik
http://lucidworks.com


On Thu, Sep 6, 2012 at 6:33 PM, Fuad Efendi <fu...@efendi.ca> wrote:
> Hi Jack,
>
>
> 24bit => 16M possibilities, it's clear; just to confirm... the rest is
> unclear, why 4-byte can have 4 million cardinality? I thought it is 4
> billions...
>
>
> And, just to confirm: UnInvertedField allows 16M cardinality, correct?
>
>
>
>
> On 12-08-20 6:51 PM, "Jack Krupansky" <ja...@basetechnology.com> wrote:
>
>>It appears that there is a hard limit of 24-bits or 16M for the number of
>>bytes to reference the terms in a single field of a single document. It
>>takes 1, 2, 3, 4, or 5 bytes to reference a term. If it took 4 bytes,
>>that
>>would allow 16/4 or 4 million unique terms - per document. Do you have
>>such
>>large documents? This appears to be a hard limit based of 24-bytes in a
>>Java
>>int.
>>
>>You can try facet.method=enum, but that may be too slow.
>>
>>What release of Solr are you running?
>>
>>-- Jack Krupansky
>>
>>-----Original Message-----
>>From: Fuad Efendi
>>Sent: Monday, August 20, 2012 4:34 PM
>>To: Solr-User@lucene.apache.org
>>Subject: UnInvertedField limitations
>>
>>Hi All,
>>
>>
>>I have a problemŠ  (Yonik, please!) help me, what is Term count limits? I
>>possibly have 256,000,000 different terms in a fieldŠ or 16,000,000?
>>
>>Thanks!
>>
>>
>>2012-08-20 16:20:19,262 ERROR [solr.core.SolrCore] - [pool-1-thread-1] - :
>>org.apache.solr.common.SolrException: Too many values for UnInvertedField
>>faceting on field enrich_keywords_string_mv
>>        at
>>org.apache.solr.request.UnInvertedField.<init>(UnInvertedField.java:179)
>>        at
>>org.apache.solr.request.UnInvertedField.getUnInvertedField(UnInvertedField
>>.j
>>ava:668)
>>        at
>>org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:326)
>>        at
>>org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java
>>:4
>>23)
>>        at
>>org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:206)
>>        at
>>org.apache.solr.handler.component.FacetComponent.process(FacetComponent.ja
>>va
>>:85)
>>        at
>>org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHa
>>nd
>>ler.java:204)
>>        at
>>org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBas
>>e.
>>java:129)
>>        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1561)
>>
>>
>>
>>
>>--
>>Fuad Efendi
>>http://www.tokenizer.ca
>>
>>
>>
>
>

Re: UnInvertedField limitations

Posted by Fuad Efendi <fu...@efendi.ca>.
Hi Lance,


Use case is "keyword extraction", and it could be 2- and 3-grams (2- and
3- words); so that theoretically we can have 10,000^3 = 1,000,000,000,000
3-grams for English only... of course my suggestion is to use statistics and
to build a dictionary of such 3-word combinations (remove top, remove
tail, using frequencies)... And to hard-limit this dictionary to 1,000,000...
That was business requirement which technically impossible to implement
(as a realtime query results); we don't even use word stemming etc...




-Fuad




On 12-08-20 7:22 PM, "Lance Norskog" <go...@gmail.com> wrote:

>Is this required by your application? Is there any way to reduce the
>number of terms?
>
>A work around is to use shards. If your terms follow Zipf's Law each
>shard will have fewer than the complete number of terms. For N shards,
>each shard will have ~1/N of the singleton terms. For 2-count terms,
>1/N or 2/N will have that term.
>
>Now I'm interested but not mathematically capable: what is the general
>probabilistic formula for splitting Zipf's Law across shards?
>
>On Mon, Aug 20, 2012 at 3:51 PM, Jack Krupansky <ja...@basetechnology.com>
>wrote:
>> It appears that there is a hard limit of 24-bits or 16M for the number
>>of
>> bytes to reference the terms in a single field of a single document. It
>> takes 1, 2, 3, 4, or 5 bytes to reference a term. If it took 4 bytes,
>>that
>> would allow 16/4 or 4 million unique terms - per document. Do you have
>>such
>> large documents? This appears to be a hard limit based of 24-bytes in a
>>Java
>> int.
>>
>> You can try facet.method=enum, but that may be too slow.
>>
>> What release of Solr are you running?
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: Fuad Efendi
>> Sent: Monday, August 20, 2012 4:34 PM
>> To: Solr-User@lucene.apache.org
>> Subject: UnInvertedField limitations
>>
>>
>> Hi All,
>>
>>
>> I have a problemŠ  (Yonik, please!) help me, what is Term count limits?
>>I
>> possibly have 256,000,000 different terms in a fieldŠ or 16,000,000?
>>
>> Thanks!
>>
>>
>> 2012-08-20 16:20:19,262 ERROR [solr.core.SolrCore] - [pool-1-thread-1]
>>- :
>> org.apache.solr.common.SolrException: Too many values for
>>UnInvertedField
>> faceting on field enrich_keywords_string_mv
>>        at
>> org.apache.solr.request.UnInvertedField.<init>(UnInvertedField.java:179)
>>        at
>> 
>>org.apache.solr.request.UnInvertedField.getUnInvertedField(UnInvertedFiel
>>d.j
>> ava:668)
>>        at
>> 
>>org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:326)
>>        at
>> 
>>org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.jav
>>a:4
>> 23)
>>        at
>> 
>>org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:206
>>)
>>        at
>> 
>>org.apache.solr.handler.component.FacetComponent.process(FacetComponent.j
>>ava
>> :85)
>>        at
>> 
>>org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchH
>>and
>> ler.java:204)
>>        at
>> 
>>org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBa
>>se.
>> java:129)
>>        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1561)
>>
>>
>>
>>
>> --
>> Fuad Efendi
>> http://www.tokenizer.ca
>>
>>
>>
>
>
>
>-- 
>Lance Norskog
>goksron@gmail.com



Re: UnInvertedField limitations

Posted by Lance Norskog <go...@gmail.com>.
Is this required by your application? Is there any way to reduce the
number of terms?

A work around is to use shards. If your terms follow Zipf's Law each
shard will have fewer than the complete number of terms. For N shards,
each shard will have ~1/N of the singleton terms. For 2-count terms,
1/N or 2/N will have that term.

Now I'm interested but not mathematically capable: what is the general
probabilistic formula for splitting Zipf's Law across shards?

On Mon, Aug 20, 2012 at 3:51 PM, Jack Krupansky <ja...@basetechnology.com> wrote:
> It appears that there is a hard limit of 24-bits or 16M for the number of
> bytes to reference the terms in a single field of a single document. It
> takes 1, 2, 3, 4, or 5 bytes to reference a term. If it took 4 bytes, that
> would allow 16/4 or 4 million unique terms - per document. Do you have such
> large documents? This appears to be a hard limit based of 24-bytes in a Java
> int.
>
> You can try facet.method=enum, but that may be too slow.
>
> What release of Solr are you running?
>
> -- Jack Krupansky
>
> -----Original Message----- From: Fuad Efendi
> Sent: Monday, August 20, 2012 4:34 PM
> To: Solr-User@lucene.apache.org
> Subject: UnInvertedField limitations
>
>
> Hi All,
>
>
> I have a problemŠ  (Yonik, please!) help me, what is Term count limits? I
> possibly have 256,000,000 different terms in a fieldŠ or 16,000,000?
>
> Thanks!
>
>
> 2012-08-20 16:20:19,262 ERROR [solr.core.SolrCore] - [pool-1-thread-1] - :
> org.apache.solr.common.SolrException: Too many values for UnInvertedField
> faceting on field enrich_keywords_string_mv
>        at
> org.apache.solr.request.UnInvertedField.<init>(UnInvertedField.java:179)
>        at
> org.apache.solr.request.UnInvertedField.getUnInvertedField(UnInvertedField.j
> ava:668)
>        at
> org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:326)
>        at
> org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:4
> 23)
>        at
> org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:206)
>        at
> org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java
> :85)
>        at
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHand
> ler.java:204)
>        at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.
> java:129)
>        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1561)
>
>
>
>
> --
> Fuad Efendi
> http://www.tokenizer.ca
>
>
>



-- 
Lance Norskog
goksron@gmail.com

Re: UnInvertedField limitations

Posted by Fuad Efendi <fu...@efendi.ca>.
Hi Jack,


24bit => 16M possibilities, it's clear; just to confirm... the rest is
unclear, why 4-byte can have 4 million cardinality? I thought it is 4
billions...


And, just to confirm: UnInvertedField allows 16M cardinality, correct?




On 12-08-20 6:51 PM, "Jack Krupansky" <ja...@basetechnology.com> wrote:

>It appears that there is a hard limit of 24-bits or 16M for the number of
>bytes to reference the terms in a single field of a single document. It
>takes 1, 2, 3, 4, or 5 bytes to reference a term. If it took 4 bytes,
>that 
>would allow 16/4 or 4 million unique terms - per document. Do you have
>such 
>large documents? This appears to be a hard limit based of 24-bytes in a
>Java 
>int.
>
>You can try facet.method=enum, but that may be too slow.
>
>What release of Solr are you running?
>
>-- Jack Krupansky
>
>-----Original Message-----
>From: Fuad Efendi
>Sent: Monday, August 20, 2012 4:34 PM
>To: Solr-User@lucene.apache.org
>Subject: UnInvertedField limitations
>
>Hi All,
>
>
>I have a problemŠ  (Yonik, please!) help me, what is Term count limits? I
>possibly have 256,000,000 different terms in a fieldŠ or 16,000,000?
>
>Thanks!
>
>
>2012-08-20 16:20:19,262 ERROR [solr.core.SolrCore] - [pool-1-thread-1] - :
>org.apache.solr.common.SolrException: Too many values for UnInvertedField
>faceting on field enrich_keywords_string_mv
>        at
>org.apache.solr.request.UnInvertedField.<init>(UnInvertedField.java:179)
>        at
>org.apache.solr.request.UnInvertedField.getUnInvertedField(UnInvertedField
>.j
>ava:668)
>        at
>org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:326)
>        at
>org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java
>:4
>23)
>        at
>org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:206)
>        at
>org.apache.solr.handler.component.FacetComponent.process(FacetComponent.ja
>va
>:85)
>        at
>org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHa
>nd
>ler.java:204)
>        at
>org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBas
>e.
>java:129)
>        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1561)
>
>
>
>
>-- 
>Fuad Efendi
>http://www.tokenizer.ca
>
>
>



Re: UnInvertedField limitations

Posted by Jack Krupansky <ja...@basetechnology.com>.
It appears that there is a hard limit of 24-bits or 16M for the number of 
bytes to reference the terms in a single field of a single document. It 
takes 1, 2, 3, 4, or 5 bytes to reference a term. If it took 4 bytes, that 
would allow 16/4 or 4 million unique terms - per document. Do you have such 
large documents? This appears to be a hard limit based of 24-bytes in a Java 
int.

You can try facet.method=enum, but that may be too slow.

What release of Solr are you running?

-- Jack Krupansky

-----Original Message----- 
From: Fuad Efendi
Sent: Monday, August 20, 2012 4:34 PM
To: Solr-User@lucene.apache.org
Subject: UnInvertedField limitations

Hi All,


I have a problemŠ  (Yonik, please!) help me, what is Term count limits? I
possibly have 256,000,000 different terms in a fieldŠ or 16,000,000?

Thanks!


2012-08-20 16:20:19,262 ERROR [solr.core.SolrCore] - [pool-1-thread-1] - :
org.apache.solr.common.SolrException: Too many values for UnInvertedField
faceting on field enrich_keywords_string_mv
        at
org.apache.solr.request.UnInvertedField.<init>(UnInvertedField.java:179)
        at
org.apache.solr.request.UnInvertedField.getUnInvertedField(UnInvertedField.j
ava:668)
        at
org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:326)
        at
org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:4
23)
        at
org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:206)
        at
org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java
:85)
        at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHand
ler.java:204)
        at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.
java:129)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1561)




-- 
Fuad Efendi
http://www.tokenizer.ca




UnInvertedField limitations

Posted by Fuad Efendi <fu...@efendi.ca>.
Hi All,


I have a problemŠ  (Yonik, please!) help me, what is Term count limits? I
possibly have 256,000,000 different terms in a fieldŠ or 16,000,000? Can I
temporarily disable tho feature?

Thanks!


2012-08-20 16:20:19,262 ERROR [solr.core.SolrCore] - [pool-1-thread-1] - :
org.apache.solr.common.SolrException: Too many values for UnInvertedField
faceting on field enrich_keywords_string_mv
        at 
org.apache.solr.request.UnInvertedField.<init>(UnInvertedField.java:179)
        at 
org.apache.solr.request.UnInvertedField.getUnInvertedField(UnInvertedField.j
ava:668)
        at 
org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:326)
        at 
org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:4
23)
        at 
org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:206)
        at 
org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java
:85)
        at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHand
ler.java:204)
        at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.
java:129)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1561)




-- 
Fuad Efendi
http://www.tokenizer.ca