You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Tom Burton-West <tb...@umich.edu> on 2013/05/22 00:23:24 UTC

Solr 4.x replacement for termsIndexDivisor

Due to multiple languages and dirty OCR, our indexes have over 2 billion
unique terms
 ( http://www.hathitrust.org/blogs/large-scale-search/too-many-words-again
 ).

In Solr 3.6 and previous we needed to reduce the memory used for storing
the in-memory representation of the tii file.   We originally used the
termInfosIndexDivisor which affects the sampling of the tii file when read
into memory.  Later we used the termIndexInterval.
Please see
http://lucene.472066.n3.nabble.com/Solr-4-0-Beta-termIndexInterval-vs-termIndexDivisor-vs-termInfosIndexDivisor-tt4006182.htmlfor
more background.

Neither of these work with the default posting format in Solr4.x.  However
in the latest Solr 4.x example/solrconfig.xml file there is commented out
text that implies that you can still use setTermIndexDivisor (appended
below).  That should probably be removed from the example if it does not
work in Solr 4.x.

At the Lucene level there are parameters to affect the size of tie
in-memory representation of the index to the index (tip file).
http://lucene.apache.org/core/4_3_0/core/org/apache/lucene/codecs/lucene41/Lucene41PostingsFormat.html

In the Javadoc for IndexWriterConfig.setTermIndexInterval, There is the
following statement:

*"This parameter does not apply to all PostingsFormat implementations,
including the default one in this release. It only makes sense for term
indexes that are implemented as a fixed gap between terms. For example,
Lucene41PostingsFormat<http://lucene.apache.org/core/4_3_0/core/org/apache/lucene/codecs/lucene41/Lucene41PostingsFormat.html>implements
the term index instead based upon how terms share prefixes. To
configure its parameters (the minimum and maximum size for a block), you
would instead use Lucene41PostingsFormat.Lucene41PostingsFormat(int,
int)<http://lucene.apache.org/core/4_3_0/core/org/apache/lucene/codecs/lucene41/Lucene41PostingsFormat.html#Lucene41PostingsFormat%28int,%20int%29>.
which can also be configured on a per-field basis"*
http://lucene.apache.org/core/4_3_0/core/org/apache/lucene/index/IndexWriterConfig.html#setTermIndexInterval%28int%29

This is followed by an example of how to set the min and max block size in
Lucene.

Is the ability to set the min and max block size available in Solr?

If not, should I open a JIRA?


Tom
----------
Exceprt from the Solr 4.3 latest rev of the example/solrconfig.xml file:

http://svn.apache.org/viewvc/lucene/dev/branches/branch_4x/solr/example/solr/collection1/conf/solrconfig.xml?revision=1470617&view=co

<!-- By explicitly declaring the Factory, the termIndexDivisor can
       be specified.
    --><!--
     <indexReaderFactory name="IndexReaderFactory"
                         class="solr.StandardIndexReaderFactory">
       <int name="setTermIndexDivisor">12</int>
     </indexReaderFactory >
    -->