You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@jackrabbit.apache.org by ca...@apache.org on 2018/05/25 10:55:36 UTC

svn commit: r1832233 - /jackrabbit/site/live/oak/docs/query/pre-extract-text.html

Author: catholicon
Date: Fri May 25 10:55:36 2018
New Revision: 1832233

URL: http://svn.apache.org/viewvc?rev=1832233&view=rev
Log:
OAK-301: Oak Docu

Publish documenntation for OAK-7353


Modified:
    jackrabbit/site/live/oak/docs/query/pre-extract-text.html

Modified: jackrabbit/site/live/oak/docs/query/pre-extract-text.html
URL: http://svn.apache.org/viewvc/jackrabbit/site/live/oak/docs/query/pre-extract-text.html?rev=1832233&r1=1832232&r2=1832233&view=diff
==============================================================================
--- jackrabbit/site/live/oak/docs/query/pre-extract-text.html (original)
+++ jackrabbit/site/live/oak/docs/query/pre-extract-text.html Fri May 25 10:55:36 2018
@@ -1,13 +1,13 @@
 <!DOCTYPE html>
 <!--
- | Generated by Apache Maven Doxia Site Renderer 1.7.4 at 2018-05-24 
+ | Generated by Apache Maven Doxia Site Renderer 1.7.4 at 2018-05-25 
  | Rendered using Apache Maven Fluido Skin 1.6
 -->
 <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
   <head>
     <meta charset="UTF-8" />
     <meta name="viewport" content="width=device-width, initial-scale=1.0" />
-    <meta name="Date-Revision-yyyymmdd" content="20180524" />
+    <meta name="Date-Revision-yyyymmdd" content="20180525" />
     <meta http-equiv="Content-Language" content="en" />
     <title>Jackrabbit Oak &#x2013; <a name="pre-extract-text"></a>Pre-Extracting Text from Binaries</title>
     <link rel="stylesheet" href="../css/apache-maven-fluido-1.6.min.css" />
@@ -136,7 +136,7 @@
 
       <div id="breadcrumbs">
         <ul class="breadcrumb">
-        <li id="publishDate">Last Published: 2018-05-24<span class="divider">|</span>
+        <li id="publishDate">Last Published: 2018-05-25<span class="divider">|</span>
 </li>
           <li id="projectVersion">Version: 1.10-SNAPSHOT</li>
         </ul>
@@ -256,7 +256,14 @@
       
 <li><a href="#a-generate-csv">Step 2 - Generate the csv file</a></li>
       
-<li><a href="#a-perform-text-extraction">Step 3 - Perform the text extraction</a></li>
+<li><a href="#a-perform-text-extraction">Step 3 - Perform the text extraction</a>
+      
+<ul>
+        
+<li><a href="#a-tika-text-extraction">1. using tika</a></li>
+        
+<li><a href="#a-index-text-extraction">2. using dumped indexed data</a></li>
+      </ul></li>
     </ul></li>
     
 <li><a href="#b-pre-extracted-text-provider">B - PreExtractedTextProvider</a>
@@ -321,7 +328,27 @@
 <p>By default it scans whole repository. If you need to restrict it to look up under certain path then specify the path via <tt>--path</tt> option.</p></div>
 <div class="section">
 <h3><a name="Step_3_-_Perform_the_text_extraction"></a><a name="a-perform-text-extraction"></a>Step 3 - Perform the text extraction</h3>
-<p>Once the csv file is generated we need to perform the text extraction. To do that we would need to download the <a class="externalLink" href="https://tika.apache.org/download.html">tika-app</a> jar from Tika downloads. You should be able to use 1.15 version with Oak 1.7.4 jar.</p>
+<p>Once the csv file is generated we need to perform the text extraction.</p>
+<p>Currently extracted text files are stored as files per blob in a format which is same one used with <tt>FileDataStore</tt> In addition to that it creates 2 files</p>
+
+<ul>
+  
+<li>blobs_error.txt - File containing blobIds for which text extraction ended in error</li>
+  
+<li>blobs_empty.txt - File containing blobIds for which no text was extracted</li>
+</ul>
+<p>This phase is incremental i.e. if run multiple times and same <tt>--store-path</tt> is specified then it would avoid extracting text from previously processed binaries.</p>
+<p>There are 2 ways of doing this:</p>
+
+<ol style="list-style-type: decimal">
+  
+<li>Do text extraction using tika</li>
+  
+<li>Use a suitable lucene index to get text extraction data from index itself which would have been generated earlier</li>
+</ol>
+<div class="section">
+<h4><a name="Step_3.1_-_Text_extraction_using_tika"></a><a name="a-tika-text-extraction"></a>Step 3.1 - Text extraction using tika</h4>
+<p>To do that we would need to download the <a class="externalLink" href="https://tika.apache.org/download.html">tika-app</a> jar from Tika downloads. You should be able to use 1.15 version with Oak 1.7.4 jar.</p>
 <p>To perform the text extraction use the <tt>--extract</tt> action</p>
 
 <div class="source">
@@ -332,17 +359,94 @@
     --fds-path /path/to/datastore  extract
 </pre></div></div>
 <p>This command does not require access to NodeStore and only requires access to the BlobStore. So configure the BlobStore which is in use like FileDataStore or S3DataStore. Above command would do text extraction using multiple threads and store the extracted text in directory specified by <tt>--store-path</tt>. </p>
-<p>Currently extracted text files are stored as files per blob in a format which is same one used with <tt>FileDataStore</tt> In addition to that it creates 2 files</p>
+<p>Consequently, this can be run from a different machine (possibly more powerful to allow use of multiple cores) to speed up text extraction. One can also split the csv into multiple chunks and process them on different machines and then merge the stores later. Just ensure that at merge time blobs*.txt files are also merged</p>
+<p>Note that we need to launch the command with <tt>-cp</tt> instead of <tt>-jar</tt> as we need to include classes outside of oak-run jar like tika-app. Also ensure that oak-run comes before in classpath. This is required due to some old classes being packaged in tika-app </p></div>
+<div class="section">
+<h4><a name="a3.2_-_Populate_text_extraction_store_using_already_indexed_data"></a><a name="a-index-text-extraction"></a> 3.2 - Populate text extraction store using already indexed data</h4>
+<p><tt>@since Oak 1.9.3</tt></p>
+<p>This approach has some prerequisites to be consistent and useful:</p>
+<div class="section">
+<h5><a name="Consistency_between_indexed_data_and_csv_generated_in_Step_2_above"></a>Consistency between indexed data and csv generated in <a href="#a-generate-csv">Step 2</a> above</h5>
+<p><b>NOTE</b>: This is <b><i>very</i></b> important and not making sure of this can lead to incorrectly populating text extraction store.</p>
+<p>Make sure that no useful binaries are added to the repository between the step that dumped indexed data and the one used for <a href="#a-generate-csv">generating binary stats csv</a></p></div>
+<div class="section">
+<h5><a name="Suitability_of_index_used_for_populating_extracted_text_store"></a>Suitability of index used for populating extracted text store</h5>
+<p>Indexes which index binaries are obvious candidates to be consumed in this way. But there are few more constraints that the definition needs to adhere to:</p>
 
 <ul>
   
-<li>blobs_error.txt - File containing blobIds for which text extraction ended in error</li>
+<li>it should index binary on the same path where binary exists (binary must not be on a relative path)</li>
   
-<li>blobs_empty.txt - File containing blobIds for which no text was extracted</li>
+<li>it should not index multiple binaries on the indexed path
+  
+<ul>
+    
+<li>IOW, multiple non-relative property definitions don&#x2019;t match and index binaries</li>
+  </ul></li>
 </ul>
-<p>This phase is incremental i.e. if run multiple times and same <tt>--store-path</tt> is specified then it would avoid extracting text from previously processed binaries.</p>
-<p>Further the <tt>extract</tt> phase only needs access to <tt>BlobStore</tt> and does not require access to NodeStore. So this can be run from a different machine (possibly more powerful to allow use of multiple cores) to speed up text extraction. One can also split the csv into multiple chunks and process them on different machines and then merge the stores later. Just ensure that at merge time blobs*.txt files are also merged</p>
-<p>Note that we need to launch the command with <tt>-cp</tt> instead of <tt>-jar</tt> as we need to include classes outside of oak-run jar like tika-app. Also ensure that oak-run comes before in classpath. This is required due to some old classes being packaged in tika-app </p></div></div>
+<p>Example of usable index definitions</p>
+
+<div class="source">
+<div class="source"><pre class="prettyprint">    + /oak:index/usableIndex1
+      ...
+      + indexRules
+        ...
+        + nt:resource
+          + properties
+            ...
+            + binary
+              - name=&quot;jcr:data&quot;
+              - nodeScopeIndex=true
+
+    + /oak:index/usableIndex2
+      ...
+      + indexRules
+        ...
+        + nt:resource
+          + properties
+            ...
+            + binary
+              - name=&quot;^[^\/]*$&quot;
+              - isRegexp=true
+              - nodeScopeIndex=true
+</pre></div></div>
+<p>Examples of unusable index definitions</p>
+
+<div class="source">
+<div class="source"><pre class="prettyprint">    + /oak:index/unUsableIndex1
+      ...
+      + indexRules
+        ...
+        + nt:file
+          + properties
+            ...
+            + binary
+              - name=&quot;jcr:content/jcr:data&quot;
+              - nodeScopeIndex=true
+
+    + /oak:index/unUsableIndex2
+      ...
+      + aggregates
+        ...
+        + nt:file
+          ...
+          + include0
+            - path=&quot;jcr:content&quot;
+</pre></div></div>
+<p>With those pre-requisites mentioned, let&#x2019;s dive into how to use this.</p>
+<p>We&#x2019;d first need to dump index data from a suitable index (say <tt>/oak:index/suitableIndexDef</tt>) using <a href="oak-run-indexing.html#async-index-data">dump index</a> method at say <tt>/path/to/index/dump</tt></p>
+<p>Then use <tt>--populate</tt> action to populate extracted text store using a dump of usable indexed data. The command would look something like:</p>
+
+<div class="source">
+<div class="source"><pre class="prettyprint">    java -jar oak-run.jar tika \
+    --data-file binary-stats.csv \
+    --store-path ./store  \
+    --index-dir /path/to/index/dump/index-dumps/suitableIndexDef/data populate
+</pre></div></div>
+<p>This command doesn&#x2019;t need to connect to either node store or blob store, so we don&#x2019;t need to configure it in the execution.</p>
+<p>This command would update <tt>blobs_empty.txt</tt> if indexed data for a given path is empty.</p>
+<p>It would also update <tt>blobs_error.txt</tt> if indexed data for a given path has indexed special value <tt>TextExtractionError</tt>.</p>
+<p>For other cases (multiple or none stored <tt>:fulltext</tt> fields for a given path) output of the command would report them as errors but they won&#x2019;t be recorded in <tt>blobs_error.txt</tt>.</p></div></div></div></div>
 <div class="section">
 <h2><a name="B_-_PreExtractedTextProvider"></a><a name="b-pre-extracted-text-provider"></a>B - PreExtractedTextProvider</h2>
 <p>In this step we would configure Oak to make use of the pre extracted text for the indexing. Depending on how indexing is being performed you would configure the <tt>PreExtractedTextProvider</tt> either in OSGi or in oak-run index command</p>