You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@jackrabbit.apache.org by ch...@apache.org on 2015/07/15 08:34:49 UTC
svn commit: r1691130 - in /jackrabbit/site/live/oak/docs/query: lucene.html pre-extracted-text-osgi.png

Author: chetanm
Date: Wed Jul 15 06:34:49 2015
New Revision: 1691130

URL: http://svn.apache.org/r1691130
Log:
OAK-2892 - Speed up lucene indexing post migration by pre extracting the text content from binaries

Publishing the docs

Added:
    jackrabbit/site/live/oak/docs/query/pre-extracted-text-osgi.png   (with props)
Modified:
    jackrabbit/site/live/oak/docs/query/lucene.html

Modified: jackrabbit/site/live/oak/docs/query/lucene.html
URL: http://svn.apache.org/viewvc/jackrabbit/site/live/oak/docs/query/lucene.html?rev=1691130&r1=1691129&r2=1691130&view=diff
==============================================================================
--- jackrabbit/site/live/oak/docs/query/lucene.html (original)
+++ jackrabbit/site/live/oak/docs/query/lucene.html Wed Jul 15 06:34:49 2015
@@ -1,13 +1,13 @@
 <!DOCTYPE html>
 <!--
- | Generated by Apache Maven Doxia at 2015-07-06
+ | Generated by Apache Maven Doxia at 2015-07-15
  | Rendered using Apache Maven Fluido Skin 1.3.0
 -->
 <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
   <head>
     <meta charset="UTF-8" />
     <meta name="viewport" content="width=device-width, initial-scale=1.0" />
-    <meta name="Date-Revision-yyyymmdd" content="20150706" />
+    <meta name="Date-Revision-yyyymmdd" content="20150715" />
     <meta http-equiv="Content-Language" content="en" />
     <title>Jackrabbit Oak - Lucene Index</title>
     <link rel="stylesheet" href="../css/apache-maven-fluido-1.3.0.min.css" />
@@ -210,7 +210,7 @@
         <ul class="breadcrumb">
                 
                     
-                  <li id="publishDate">Last Published: 2015-07-06</li>
+                  <li id="publishDate">Last Published: 2015-07-15</li>
                   <li class="divider">|</li> <li id="projectVersion">Version: 1.4-SNAPSHOT</li>
                       
                 
@@ -779,7 +779,7 @@
 <p><a name="include-exclude"></a></p></div>
 <div class="section">
 <h5>Include and Exclude paths from indexing<a name="Include_and_Exclude_paths_from_indexing"></a></h5>
-<p><i>Since 1.0.14+ and 1.2.3+</i></p>
+<p><tt>@since Oak 1.0.14, 1.2.3</tt></p>
 <p>By default the indexer would index all the nodes under the subtree where the index definition is defined as per the indexingRule. In some cases its required to index nodes under certain path. For e.g. if index is defined for global fulltext index which include the complete repository you might want to exclude certain path which contains transient system data. </p>
 <p>For example if you application stores certain logs under <tt>/var/log</tt> and it is not supposed to be indexed as part of fulltext index then it can be excluded</p>
 
@@ -893,8 +893,8 @@
         - relativeNode = true
 </pre></div></div>
 <div class="section">
-<h4>Analyzers (1.1.6)<a name="Analyzers_1.1.6"></a></h4>
-<p><i>This feature is currently not part of 1.0 branch and is only present in unstable 1.x releases</i></p>
+<h4>Analyzers<a name="Analyzers"></a></h4>
+<p><tt>@since Oak 1.2.0</tt></p>
 <p>Analyzers can be configured as part of index definition via <tt>analyzers</tt> node. The default analyzer can be configured via <tt>analyzers/default</tt> node</p>
 
 <div class="source">
@@ -994,12 +994,15 @@
 <dd>Enable copying of Lucene index to local file system to improve indexing performance. See <a href="#copy-on-write">Copy Indexes On Write</a></dd>
 <dt>localIndexDir</dt>
 <dd>Directory to be used for when copy index files to local file system. To be specified when <tt>enableCopyOnReadSupport</tt> is enabled</dd>
+<dt>prefetchIndexFiles</dt>
+<dd>Prefetch the index files when CopyOnRead is enabled. When enabled all new Lucene index files would be copied locally before the index is made available to QueryEngine (1.0.17,1.2.3)</dd>
 <dt>debug</dt>
 <dd>Boolean value. Defaults to <tt>false</tt></dd>
 <dd>If enabled then Lucene logging would be integrated with Slf4j</dd>
 </dl></div>
 <div class="section">
-<h3>Tika Config (1.0.12)<a name="Tika_Config_1.0.12"></a></h3>
+<h3>Tika Config<a name="Tika_Config"></a></h3>
+<p><tt>@since Oak 1.0.12, 1.2.3</tt></p>
 <p>Oak Lucene uses <a class="externalLink" href="http://tika.apache.org/">Apache Tika</a> to extract the text from binary content</p>
 
 <div class="source">
@@ -1080,7 +1083,7 @@
 <p><a name="copy-on-write"></a></p></div>
 <div class="section">
 <h3>CopyOnWrite<a name="CopyOnWrite"></a></h3>
-<p><i>Since 1.0.15 and 1.2.3</i></p>
+<p><tt>@since Oak 1.0.15, 1.2.3</tt></p>
 <p>Similar to <i>CopyOnRead</i> feature Oak Lucene also supports <i>CopyOnWrite</i> to enable faster indexing by first buffering the writes to local filesystem and transferring them to remote storage asynchronously as the indexing proceeds. This should provide better performance and hence faster indexing times.</p>
 <p><b>indexPath</b></p>
 <p>To speed up the indexing with CopyOnWrite you would also need to set <tt>indexPath</tt> in index definition to the path of index in the repository. For e.g. if your index is defined at <tt>/oak:index/lucene</tt> then value of <tt>indexPath</tt> should be set to <tt>/oak:index/lucene</tt>. This would enable the indexer to perform any read during the indexing process locally and thus avoid costly read from remote</p>
@@ -1135,12 +1138,57 @@ Copied 8.5 MB in 218.7 ms
 </pre></div></li>
 </ol>
 <p>From the Luke UI shown you can access various details.</p>
+<p><a name="text-extraction"></a></p></div>
 <div class="section">
-<h4>Advanced search features<a name="Advanced_search_features"></a></h4>
+<h3>Pre-Extracting Text from Binaries<a name="Pre-Extracting_Text_from_Binaries"></a></h3>
+<p><tt>@since Oak 1.0.18, 1.2.3</tt></p>
+<p>Lucene indexing is performed in a single threaded mode. Extracting text from binaries is an expensive operation and slows down the indexing rate considerably. For incremental indexing this mostly works fine but if performing a reindex or creating the index for the first time after migration then it increases the indexing time considerably. </p>
+<p>To speed up the Lucene indexing for such cases i.e. reindexing, we can decouple the text extraction from actual indexing. </p>
+
+<ol style="list-style-type: decimal">
+  
+<li>Extract and store the extracted text from binaries via <a class="externalLink" href="https://github.com/apache/jackrabbit-oak/tree/trunk/oak-run#tika">oak-run tool</a></li>
+  
+<li>Configure a <tt>PreExtractedTextProvider</tt> which can lookup extracted text and  thus avoid text extraction at time of actual indexing</li>
+</ol>
+<p>Below are details around steps required for making using of this feature</p>
+
+<ol style="list-style-type: decimal">
+  
+<li>
+<p>Generate the csv file containing binary file details</p>
+  
+<div class="source">
+<pre>java -cp tika-app-1.8.jar:oak-run.jar \
+org.apache.jackrabbit.oak.run.Main tika \  
+--fds-path /path/to/datastore \
+--nodestore /path/to/segmentstore --data-file dump.csv generate
+</pre></div></li>
+  
+<li>
+<p>Extract the text </p>
+  
+<div class="source">
+<pre>java -cp tika-app-1.8.jar:oak-run.jar \
+org.apache.jackrabbit.oak.run.Main tika \
+--data-file binary-stats.csv \
+--store-path ./store 
+--fds-path /path/to/datastore  extract
+</pre></div></li>
+  
+<li>
+<p>Configure the <tt>PreExtractedTextProvider</tt> - Once the extraction is performed configure a <tt>PreExtractedTextProvider</tt> within the application such that Lucene indexer can make use of that to lookup extracted text. </p>
+<p>For this look for OSGi config for <tt>Apache Jackrabbit Oak DataStore PreExtractedTextProvider</tt></p>
+<p><img src="pre-extracted-text-osgi.png" alt="OSGi Configuration" /> </p></li>
+</ol>
+<p>Once <tt>PreExtractedTextProvider</tt> is configured then upon reindexing Lucene indexer would make use of it to check if text needs to be extracted or not. Check <tt>TextExtractionStatsMBean</tt> for various statistics around text extraction and also to validate if <tt>PreExtractedTextProvider</tt> is being used.</p>
+<p>For more details on this feature refer to <a class="externalLink" href="https://issues.apache.org/jira/browse/OAK-2892">OAK-2892</a></p></div>
+<div class="section">
+<h3>Advanced search features<a name="Advanced_search_features"></a></h3>
 <div class="section">
-<h5>Suggestions<a name="Suggestions"></a></h5>
+<h4>Suggestions<a name="Suggestions"></a></h4>
 <p><tt>@since Oak 1.1.17, 1.0.15</tt></p>
-<p>In order to use Lucene index to perform search suggestions, the index definition node (the one of type <tt>oak:QueryIndexDefinition</tt>)  needs to have the <tt>compatVersion</tt> set to <tt>2</tt>, then one or more property nodes, depending on use case, need to have the  property <tt>useForSuggest</tt> set to <tt>true</tt>, such setting controls from which properties terms to be used for suggestions will be taken.</p>
+<p>In order to use Lucene index to perform search suggestions, the index definition node (the one of type <tt>oak:QueryIndexDefinition</tt>) needs to have the <tt>compatVersion</tt> set to <tt>2</tt>, then one or more property nodes, depending on use case, need to have the property <tt>useForSuggest</tt> set to <tt>true</tt>, such setting controls from which properties terms to be used for suggestions will be taken.</p>
 <p>Once the above configuration has been done, by default, the Lucene suggester is updated every 10 minutes but that can be changed by setting the property <tt>suggestUpdateFrequencyMinutes</tt> in the index definition node to a different value.</p>
 <p>Sample configuration for suggestions based on terms contained in <tt>jcr:description</tt> property.</p>
 
@@ -1162,9 +1210,9 @@ Copied 8.5 MB in 218.7 ms
           - useForSuggest = true
 </pre></div></div>
 <div class="section">
-<h5>Spellchecking<a name="Spellchecking"></a></h5>
+<h4>Spellchecking<a name="Spellchecking"></a></h4>
 <p><tt>@since Oak 1.1.17, 1.0.13</tt></p>
-<p>In order to use Lucene index to perform spellchecking, the index definition node (the one of type <tt>oak:QueryIndexDefinition</tt>)  needs to have the <tt>compatVersion</tt> set to <tt>2</tt>, then one or more property nodes, depending on use case, need to have the  property <tt>useForSpellcheck</tt> set to <tt>true</tt>, such setting controls from which properties terms to be used for spellcheck  corrections will be taken.</p>
+<p>In order to use Lucene index to perform spellchecking, the index definition node (the one of type <tt>oak:QueryIndexDefinition</tt>) needs to have the <tt>compatVersion</tt> set to <tt>2</tt>, then one or more property nodes, depending on use case, need to have the property <tt>useForSpellcheck</tt> set to <tt>true</tt>, such setting controls from which properties terms to be used for spellcheck corrections will be taken.</p>
 <p>Sample configuration for spellchecking based on terms contained in <tt>jcr:title</tt> property.</p>
 
 <div class="source">
@@ -1182,7 +1230,7 @@ Copied 8.5 MB in 218.7 ms
           - propertyIndex = true
           - analyzed = true
           - useForSpellcheck = true
-</pre></div></div></div></div>
+</pre></div></div></div>
 <div class="section">
 <h3>Design Considerations<a name="Design_Considerations"></a></h3>
 <p>Lucene index provides quite a few features to meet various query requirements. While defining the index definition do consider the following aspects</p>

Added: jackrabbit/site/live/oak/docs/query/pre-extracted-text-osgi.png
URL: http://svn.apache.org/viewvc/jackrabbit/site/live/oak/docs/query/pre-extracted-text-osgi.png?rev=1691130&view=auto
==============================================================================
Binary file - no diff available.

Propchange: jackrabbit/site/live/oak/docs/query/pre-extracted-text-osgi.png
------------------------------------------------------------------------------
    svn:mime-type = image/png