You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Brian Whitman <br...@variogr.am> on 2007/05/22 20:35:04 UTC

slow MLT, how to inject top tf-idf terms on indexing

We're looking at MLT queries that take 10-60 seconds on average to  
return, using the latest (this a.m.) SOLR-69 patch. Our data dir is  
8.5G with 300K docs, but almost all of those have on average 50-200KB  
of stored text in thousands of fragments (multivalued field, one  
chunk per sentence.) The mlt query is using the stored content field  
as its mlt.fl. We have the text field storing term vectors via this  
schema line:

<field name="content" type="text" indexed="true"  stored="true"    
multiValued="true" termVectors="true" termPositions="true"  
termOffsets="true"/>

& the luke request handler says:
<lst name="content">
<str name="type">text</str>
<str name="schema">ITSMVop------</str>
<str name="index">ITS-Vop------</str>
<int name="docs">292398</int>
<int name="distinct">5334989</int>

Some questions:

- While the query is on "waiting" in firefox I am watching top and  
the resin java process never breaks 10% of CPU. Is this IO bound?

- Is the multivalued / split by sentence nature of the main text  
field causing any issues? If we reindexed with all the fragments  
going in one large chunk, would it be any faster?

- The speed of MLT did not seem to get any better after we turned on  
termVectors="true" and re-indexed. Shouldn't it have?

- Assuming we're not doing anything wrong and MLT is just a slow  
process, one next step we've identified is to index the "top terms"  
of the content field as a separate field on indexing. We can get the  
N top tf-idf tokens after the indexing happens and copy them to a  
field "content_top" and give that to mlt.fl. Do you think this is  
wise, will it solve the speed issue? One downside is that we'd have  
to re-index periodically to make sure the df gets updated properly  
after adding more docs, but in our case the solr index in question is  
a "transient" one built from a few other solr indexes, and we can  
rebuild it every day.