You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by mdz-munich <se...@bsb-muenchen.de> on 2010/12/13 14:23:46 UTC

Query-Expansion, copyFields, flexibility and size of Index (Solr-3.1-SNAPSHOT)

Hi all,

we want to do Query-Expansion with synonyms and word forms on Query-Time.  

Assuming we want to query all fields (text & synonyms/word forms) with
different boosts with dismax, we need following setup (simplified):

<fieldType name="text" class="solr.TextField" positionIncrementGap="0"
sortMissingLast="true">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory" />
  </analyzer>
  <analyzer type="query"> 
    <tokenizer class="solr.WhitespaceTokenizerFactory" />
  </analyzer>
</fieldType>

<fieldType name="text_syn" class="solr.TextField" positionIncrementGap="0"
sortMissingLast="true">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory" />
  </analyzer>
  <analyzer type="query"> 
    <tokenizer class="solr.WhitespaceTokenizerFactory" />
    <filter class="solr.SynonymFilterFactory synonyms="syn.txt"
ignoreCase="true" expand="true"/>
  </analyzer>
</fieldType>

Further more, we need two fields:

<field name="fulltext" type="text" indexed="true" stored="true"
multiValued="true" />
<field name="fulltext_syn" type="text_syn" indexed="true" stored="false"
multiValued="true" />

Last but not least we have to copy the fulltext-field into our
fulltext_syn-field:

<copyField source="fulltext" dest="fulltext_syn" />

Now we can query both fields with "qt=dismax&q=searchterms&qf=fulltext^2.0
fulltext_syn^1.0" etc.

That seems to work out very well. But now comes the dark site of the force:
We quickly realized that every copyField-instruction causes into a full copy
of that field, even if the index-time-analyzer runs on both field-types
(text & text_syn) with a exact identical setup. The result is a 10% larger
index and further more a less flexible application, because for every
retrieval-functionality relating on query-expansion or other special
query-time-analyzing, we have to copy that field into a new field and have
to re-index the whole data.  

We think about something like that:

<fieldType name="text" class="solr.TextField" positionIncrementGap="0"
sortMissingLast="true">
  <analyzer type="index">
   <tokenizer class="solr.WhitespaceTokenizerFactory" />
  </analyzer>
  <analyzer type="query"> 
    <tokenizer class="solr.WhitespaceTokenizerFactory" />
  </analyzer>
  <analyzer type="query_syn"> 
    <tokenizer class="solr.WhitespaceTokenizerFactory" />
    <filter class="solr.SynonymFilterFactory synonyms="syn.txt"
ignoreCase="true" expand="true"/>
  </analyzer>
</fieldType>

Request like: 
"qt=dismax&q=searchterms&qf=fulltext.ana.query^2.0
fulltext.ana.query_syn^1.0" etc.

That would be much more flexible, precisely because we wouldn't have to
re-index the whole data for every copyFields-instruction. And further more
it would decrease storage-consumption about 10%.

Any ideas on that? Any other solutions? 


Best regards,

Sebastian from Munich, Bavarian, Germany
   





-- 
View this message in context: http://lucene.472066.n3.nabble.com/Query-Expansion-copyFields-flexibility-and-size-of-Index-Solr-3-1-SNAPSHOT-tp2078573p2078573.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Query-Expansion, copyFields, flexibility and size of Index (Solr-3.1-SNAPSHOT)

Posted by mdz-munich <se...@bsb-muenchen.de>.

Okay, I start guessing:

- Do we have to write a customized QueryParserPlugin?
- On which point does the RequestHandler/QueryParser/whatever decide what
query-analyzer to use?

10% for every copied field is a lot for us, we're facing Terra-bytes of
digitized Book-Data. So we want to keep the index simple, small and flexible
and just append IR-Functionalities on Query-Time.   

Greetings & thank you,

Sebastian
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Query-Expansion-copyFields-flexibility-and-size-of-Index-Solr-3-1-SNAPSHOT-tp2078573p2085018.html
Sent from the Solr - User mailing list archive at Nabble.com.