You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@jackrabbit.apache.org by ch...@apache.org on 2017/07/18 05:15:11 UTC
svn commit: r1802238 - in /jackrabbit/site/live/oak/docs/query: indexing.html lucene.html

Author: chetanm
Date: Tue Jul 18 05:15:10 2017
New Revision: 1802238

URL: http://svn.apache.org/viewvc?rev=1802238&view=rev
Log:
Updated to refer to new pre-extration links

Modified:
    jackrabbit/site/live/oak/docs/query/indexing.html
    jackrabbit/site/live/oak/docs/query/lucene.html

Modified: jackrabbit/site/live/oak/docs/query/indexing.html
URL: http://svn.apache.org/viewvc/jackrabbit/site/live/oak/docs/query/indexing.html?rev=1802238&r1=1802237&r2=1802238&view=diff
==============================================================================
--- jackrabbit/site/live/oak/docs/query/indexing.html (original)
+++ jackrabbit/site/live/oak/docs/query/indexing.html Tue Jul 18 05:15:10 2017
@@ -1,13 +1,13 @@
 <!DOCTYPE html>
 <!--
- | Generated by Apache Maven Doxia Site Renderer 1.7.4 at 2017-06-23 
+ | Generated by Apache Maven Doxia Site Renderer 1.7.4 at 2017-07-17 
  | Rendered using Apache Maven Fluido Skin 1.6
 -->
 <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
   <head>
     <meta charset="UTF-8" />
     <meta name="viewport" content="width=device-width, initial-scale=1.0" />
-    <meta name="Date-Revision-yyyymmdd" content="20170623" />
+    <meta name="Date-Revision-yyyymmdd" content="20170717" />
     <meta http-equiv="Content-Language" content="en" />
     <title>Jackrabbit Oak &#x2013; Indexing</title>
     <link rel="stylesheet" href="../css/apache-maven-fluido-1.6.min.css" />
@@ -131,7 +131,7 @@
 
       <div id="breadcrumbs">
         <ul class="breadcrumb">
-        <li id="publishDate">Last Published: 2017-06-23<span class="divider">|</span>
+        <li id="publishDate">Last Published: 2017-07-17<span class="divider">|</span>
 </li>
           <li id="projectVersion">Version: 1.8-SNAPSHOT</li>
         </ul>
@@ -301,7 +301,14 @@
       </ul></li>
     </ul></li>
     
-<li><a href="#reindexing">Reindexing</a></li>
+<li><a href="#reindexing">Reindexing</a>
+    
+<ul>
+      
+<li><a href="#reduce-reindexing-times">Reducing reindexing times</a></li>
+      
+<li><a href="#abort-reindex">How to Abort Reindexing</a></li>
+    </ul></li>
   </ul></li>
 </ul>
 <div class="section">
@@ -649,7 +656,10 @@ Removing corrupt flag from index [/oak:i
 </pre></div></div>
 <p>Once reindexing is complete, the <tt>reindex</tt> flag is set to <tt>false</tt> automatically.</p>
 <div class="section">
-<h3><a name="How_to_Abort_Reindexing"></a>How to Abort Reindexing</h3>
+<h3><a name="Reducing_reindexing_times"></a><a name="reduce-reindexing-times"></a> Reducing reindexing times</h3>
+<p>If the index being reindexed has full text extraction configured then reindexing can take long time as most of the time is spent in text extraction. For such cases its recommended to use text <a href="pre-extract-text.html">pre-extraction support</a>. The text pre-extraction can be done before starting the actual reindexing. This would then ensure that during reindexing time is not spent in performing text extraction and hence the actual time taken for reindexing such an index gets reduced considerably.</p></div>
+<div class="section">
+<h3><a name="How_to_Abort_Reindexing"></a><a name="abort-reindex"></a> How to Abort Reindexing</h3>
 <p>Building an index can be slow. It can be aborted (stopped before it is finished), for example if you detect there is an error in the index definition. Reindexing and building a new index can be aborted when using asynchronous indexes. For synchronous indexes, it can be stopped if it was started using the <tt>PropertyIndexAsyncReindexMBean</tt>. To do this, use the respective <tt>IndexStats</tt> JMX bean (for example, <tt>async</tt>, <tt>fulltext-async</tt>, or <tt>async-reindex</tt>), and call the operation <tt>abortAndPause()</tt>. Then, either set the <tt>reindex</tt> flag to <tt>false</tt> (for an existing index), remove the index definition (for a new index), or change the index type to <tt>disabled</tt>. Store the change. Finally, call the operation <tt>resume()</tt> so that regular indexing operations can continue.</p></div></div>
         </div>
       </div>

Modified: jackrabbit/site/live/oak/docs/query/lucene.html
URL: http://svn.apache.org/viewvc/jackrabbit/site/live/oak/docs/query/lucene.html?rev=1802238&r1=1802237&r2=1802238&view=diff
==============================================================================
--- jackrabbit/site/live/oak/docs/query/lucene.html (original)
+++ jackrabbit/site/live/oak/docs/query/lucene.html Tue Jul 18 05:15:10 2017
@@ -1,13 +1,13 @@
 <!DOCTYPE html>
 <!--
- | Generated by Apache Maven Doxia Site Renderer 1.7.4 at 2017-07-03 
+ | Generated by Apache Maven Doxia Site Renderer 1.7.4 at 2017-07-17 
  | Rendered using Apache Maven Fluido Skin 1.6
 -->
 <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
   <head>
     <meta charset="UTF-8" />
     <meta name="viewport" content="width=device-width, initial-scale=1.0" />
-    <meta name="Date-Revision-yyyymmdd" content="20170703" />
+    <meta name="Date-Revision-yyyymmdd" content="20170717" />
     <meta http-equiv="Content-Language" content="en" />
     <title>Jackrabbit Oak &#x2013; Lucene Index</title>
     <link rel="stylesheet" href="../css/apache-maven-fluido-1.6.min.css" />
@@ -131,7 +131,7 @@
 
       <div id="breadcrumbs">
         <ul class="breadcrumb">
-        <li id="publishDate">Last Published: 2017-07-03<span class="divider">|</span>
+        <li id="publishDate">Last Published: 2017-07-17<span class="divider">|</span>
 </li>
           <li id="projectVersion">Version: 1.8-SNAPSHOT</li>
         </ul>
@@ -568,6 +568,7 @@
   - notNullCheckEnabled (boolean) = false
   - nullCheckEnabled (boolean) = false
   - excludeFromAggregation (boolean) = false
+  - weight (long) = -1
 </pre></div></div>
 <p>Following are the details about the above mentioned config options which can be defined at the property definition level</p>
 
@@ -655,7 +656,13 @@
 <dt>excludeFromAggregation</dt>
 <dd>Since 1.0.27, 1.2.11</dd>
 <dd>if set to true the property would be excluded from aggregation <a class="externalLink" href="https://issues.apache.org/jira/browse/OAK-3981">OAK-3981</a></dd>
+<dt><a name="weight"></a></dt>
+<dt>weight</dt>
+<dd>Since 1.6.3</dd>
+<dd>At times, we have property definitions which are added to support for dense results right out of  the index (e.g. <tt>contains(*, 'foo') AND [bar]='baz'</tt>). In such cases, the added property definition &#x201c;might&#x201d;  not be the best one to answer queries which only have the property restriction (eg only <tt>[bar]='baz'</tt>). This  can happen when that index specifies some exclude paths and hence does not index all <tt>bar</tt> properties.</dd>
 </dl>
+<p>For such cases set <tt>weight</tt> to <tt>0</tt> for such properties. In such a case IndexPlanner would not use those property  definitions to determine if that index can answer the query but it would still use them if some other index entry  causes that index to be selected for evaluating such a query.</p>
+<p>Refer <a class="externalLink" href="https://issues.apache.org/jira/browse/OAK-5899">OAK-5899</a> for more details</p>
 <p><a name="property-names"></a><b>Property Names</b></p>
 <p>Property name can be one of following</p>
 
@@ -1178,48 +1185,7 @@ Copied 8.5 MB in 218.7 ms
 <p>From the Luke UI shown you can access various details.</p></div>
 <div class="section">
 <h3><a name="Pre-Extracting_Text_from_Binaries"></a><a name="text-extraction"></a>Pre-Extracting Text from Binaries</h3>
-<p><tt>@since Oak 1.0.18, 1.2.3</tt></p>
-<p>Lucene indexing is performed in a single threaded mode. Extracting text from binaries is an expensive operation and slows down the indexing rate considerably. For incremental indexing this mostly works fine but if performing a reindex or creating the index for the first time after migration then it increases the indexing time considerably. </p>
-<p>To speed up the Lucene indexing for such cases i.e. reindexing, we can decouple the text extraction from actual indexing. </p>
-
-<ol style="list-style-type: decimal">
-  
-<li>Extract and store the extracted text from binaries via <a class="externalLink" href="https://github.com/apache/jackrabbit-oak/tree/trunk/oak-run#tika">oak-run tool</a></li>
-  
-<li>Configure a <tt>PreExtractedTextProvider</tt> which can lookup extracted text and  thus avoid text extraction at time of actual indexing</li>
-</ol>
-<p>Below are details around steps required for making using of this feature</p>
-
-<ol style="list-style-type: decimal">
-  
-<li>
-<p>Generate the csv file containing binary file details</p>
-  
-<div class="source">
-<div class="source"><pre class="prettyprint">java -cp tika-app-1.8.jar:oak-run.jar \
-org.apache.jackrabbit.oak.run.Main tika \  
---fds-path /path/to/datastore \
---nodestore /path/to/segmentstore --data-file dump.csv generate
-</pre></div></div></li>
-  
-<li>
-<p>Extract the text </p>
-  
-<div class="source">
-<div class="source"><pre class="prettyprint">java -cp tika-app-1.8.jar:oak-run.jar \
-org.apache.jackrabbit.oak.run.Main tika \
---data-file binary-stats.csv \
---store-path ./store 
---fds-path /path/to/datastore  extract
-</pre></div></div></li>
-  
-<li>
-<p>Configure the <tt>PreExtractedTextProvider</tt> - Once the extraction is performed configure a <tt>PreExtractedTextProvider</tt> within the application such that Lucene indexer can make use of that to lookup extracted text. </p>
-<p>For this look for OSGi config for <tt>Apache Jackrabbit Oak DataStore PreExtractedTextProvider</tt></p>
-<p><img src="pre-extracted-text-osgi.png" alt="OSGi Configuration" /> </p></li>
-</ol>
-<p>Once <tt>PreExtractedTextProvider</tt> is configured then upon reindexing Lucene indexer would make use of it to check if text needs to be extracted or not. Check <tt>TextExtractionStatsMBean</tt> for various statistics around text extraction and also to validate if <tt>PreExtractedTextProvider</tt> is being used.</p>
-<p>For more details on this feature refer to <a class="externalLink" href="https://issues.apache.org/jira/browse/OAK-2892">OAK-2892</a></p></div>
+<p>Refer to <a href="pre-extract-text.html">pre-extraction via oak-run</a>.</p></div>
 <div class="section">
 <h3><a name="Advanced_search_features"></a><a name="advanced-search-features"></a>Advanced search features</h3>
 <div class="section">