You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Kydryavtsev Andrey <we...@yandex.ru> on 2014/10/13 23:10:52 UTC

UnInvertedField implementation details

Hi,

I have a question about UnInvertedField internals. I use it for multi terms field faceting. And once got an exception "Too many values for UnInvertedField faceting on field ...". I googled it a bit (http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/68000) and looked into the code. According to UnInvertedField (http://grepcode.com/file/repo1.maven.org/maven2/org.apache.solr/solr-core/4.3.1/org/apache/solr/request/UnInvertedField.java)  JavaDoc:

There is a single int[maxDoc()] which either contains a pointer into a byte[] for the termNumber lists, or directly contains the termNumber list if it fits in the 4 bytes of an integer. If the first byte in the integer is 1, the next 3 bytes are a pointer into a byte[] where the termNumber list starts. There are actually 256 byte arrays, to compensate for the fact that the pointers into the byte arrays are only 3 bytes long. The correct byte array for a document is a function of it's id.

It means thats UnInvertedField contains int[maxDoc()] for each document and double[256][] for termNumber pointers. To map docId and one of 256 arrays next formula is used:

int whichArray = (doc >>> 16) & 0xff;

So it's not "mod" for uniform distribution. And for larger max doc more number of arrays are used. Exception happens when one this 256 arrays became too big. 

 I could be sure that my documents don't have >  4 million unique terms. But I do index "updates":
     - I have one segment only with max_doc=N
     - Load new batch of documents (new document versions) with already existing in index unique ID every 15 minutes
     - Index them. Now I have two segments - old one with "deleted" documents and new one with new "versions" of deleted documents.
     - Merge segments - execute "index optimazed" to have one segment with same max_doc=N

I started to check UnInvertedField state during daily updates and found strange thing. My max_doc is about 700000. According to formula above, pointers are stored into first 10 arrays. And during daily updates last array (tenth array) became bigger and bigger, even if max_doc after merge is constant. And finally it exceeded limit. 

How could it happen? Could it be my "update" problem - somehow I increase number of terms in document instead of updating them? Or it could be UnInvertedField implementation details? And what will happen if I change that formula for uniform distribution? What is point to use less big arrays instead of many relatively small? 

Thanks for any help.