You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jackrabbit.apache.org by "Dolan, Kelly" <kd...@inmedius.com> on 2011/05/02 19:58:09 UTC

how to provide full text search without negatively impacting performance???

Our application supports an import feature that adds (updates) multiple
documents to the repository at one time.  This feature is used quite
often by our customers with very large numbers of documents.  We learned
when full text search is enabled, performance is terribly unacceptable.
If full text search is disabled, performace becomes acceptable.

 

We need to support full text searching but we cannot afford to incur the
performance impact at the time an import is performed.  I'm on a quest
to find a solution but I haven't found much of anything yet and I feel
quite unsuccessful.  As a result, I'm now looking for some advice.

 

Questions?

1. Is there a way to have full text search enabled but perform indexing
as a background task in a separate, low-priority thread?

 

2. Is there a way to enable/disable full text search (to temporarily
disable indexing) at run-time?

 

3. If yes, if we disable indexing while we import many documents and
then re-enable it, 

   a. does Jackrabbit automatically start to index these new / updated
documents?

   b. is it possible to know what documents have been indexed and which
ones have not?

   c. is there a way to tell Jackrabbit to index these new / updated
documents at run-time?, or

   d. does the entire index need to be rebuilt - and if so, can this be
triggered at run-time?

 

4. Is any/none of this possible with Jackrabbit out-of-the-box?  Or
would we need to make custom modifications to Jackrabbit (which is not
desirable)?

 

5. Is there a different way I should be thinking about this or something
I haven't though of? 

 

Any and all comments are appreciated!  Thanks in advance...

 

Kelly

 

p.s. Below is the <SearchIndex> element in our repository.xml.
Currently, if we want to re-enable full text search, we uncomment the
textFilterClasses parameter, delete the search indices (I think) and
re-start JBoss.

 

Snippet from repository.xml:

        <SearchIndex
class="org.apache.jackrabbit.core.query.lucene.SearchIndex">

          <param name="path" value="${wsp.home}/index"/>

          <param name="useCompoundFile" value="true"/>

          <param name="minMergeDocs" value="100"/>

          <param name="volatileIdleTime" value="3"/>

          <param name="maxMergeDocs" value="100000"/>

          <param name="mergeFactor" value="10"/>

          <param name="maxFieldLength" value="10000"/>

          <param name="bufferSize" value="10"/>

          <param name="cacheSize" value="1000"/>

          <param name="forceConsistencyCheck" value="false"/>

          <param name="autoRepair" value="true"/>

          <param name="analyzer"
value="org.apache.lucene.analysis.standard.StandardAnalyzer"/>

          <param name="queryClass"
value="org.apache.jackrabbit.core.query.QueryImpl"/>

          <param name="respectDocumentOrder" value="true"/>

          <param name="resultFetchSize" value="2147483647"/>

          <param name="extractorPoolSize" value="5"/>

          <param name="extractorTimeout" value="5000"/>

          <param name="extractorBackLogSize" value="100"/>

<!-- Uncomment this version to use full text indexing.

      BEWARE: This will have major impact on the performance of imports

           <param name="textFilterClasses"
value="org.apache.jackrabbit.extractor.MsExcelTextExtractor,org.apache.j
ackrabbit.extractor.MsPowerPointTextExtractor,org.apache.jackrabbit.extr
actor.MsWordTextExtractor,org.apache.jackrabbit.extractor.PdfTextExtract
or,org.apache.jackrabbit.extractor.HTMLTextExtractor,org.apache.jackrabb
it.extractor.XMLTextExtractor,org.apache.jackrabbit.extractor.RTFTextExt
ractor,org.apache.jackrabbit.extractor.OpenOfficeTextExtractor,org.apach
e.jackrabbit.extractor.PlainTextExtractor" />

-->

          <param name="textFilterClasses" value="" />

        </SearchIndex>