You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Manuel Le Normand <ma...@gmail.com> on 2014/03/10 15:02:14 UTC

Indexing useful N-grams and adding payloads

Hi,
I have a performance and scoring problem for phrase queries

   1. Performance - phrase queries involving frequent terms are very slow
   due to the reading of large positions posting list.
   2. Scoring - I want to control the boost of phrase and entity (in
   gazetteers) matches

Indexing all terms as bi-grams and unigrams is out of question in my use
case, so I plan indexing only the useful bi-grams. Part of it will be
achieved by the CommonGram filter in which I put the frequent words, but I
think of going one step further and indexing also every phrase query I have
extracted from my query log and entity from my gazetteers To the latter
(which are N-grams) I will also add a payload to control the boost.

An example MappingCharFilter.txt would be:

#phrase-query
term1 term2 term3 => term1_term2_term3|1
#entity
firstName lastName => firstName_lastName|2

One of the issues is that I have 100k-1M (depending on frequency)
phrases/entities as above. I saw that MappingCharFilter is implemented as
an FST, still I'm concerned that iterating on the charBuffer for long
documents might cause problems.

Has anyone faced a similar issue? Is this mapping implementation resonable
during query time performance wise?

Thanks in advance,
Manuel