You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Robust Links <pe...@robustlinks.com> on 2014/10/24 19:51:03 UTC

phrase query in solr 4

Hi

We are trying to upgrade our index from 3.6.1 to 4.9.1 and I wanted to make
sure our existing indexing strategy is still valid or not. The statistics
of the raw corpus are:

- 4.8 Billon total number of tokens in the entire corpus.

- 13MM documents


We have 3 requirements


1) we want to index and search all tokens in a document (i.e. we do not
rely on external stores)

2) we need search time to be fast and willing to pay larger indexing time
and index size,

3)  be able to search as fast as possible ngrams of 3 tokens or less (i.e,
unigrams, bigrams and trigrams).


To satisfy (1) we used the default  <maxFieldLength>2147483647</
maxFieldLength> in solrconfig.xml of 3.6.1 index to specify the total
number of tokens to index in an article. In solr 4 we are specifying it via
the tokenizer in the analyzer chain


 <tokenizer class="solr.ClassicTokenizerFactory" maxTokenLength="2147483647"
/>


To satisfy 2 and 3 in our 3.6.1 index we indexed using the following
shingedFilterFactory in the analyzer chain


<filter class="solr.ShingleFilterFactory" outputUnigrams="true"
maxShingleSize="3”/>


This was based on this thread:

http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200808.mbox/%3C856ac15f0808161539p54417df2ga5a6fdfa35889851@mail.gmail.com%3E


The open questions we are trying to understand now are:


1) whether shingling is still the best strategy for phrase (ngram) search
given our requirements above?

2) if not then what would be a better strategy.


thank you in advance for your help


Peyman