You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-commits@jackrabbit.apache.org by ch...@apache.org on 2017/07/25 08:55:00 UTC

svn commit: r1802901 - in /jackrabbit/oak/trunk/oak-doc/src/site/markdown/query: oak-run-indexing.md pre-extract-text.md

Author: chetanm
Date: Tue Jul 25 08:55:00 2017
New Revision: 1802901

URL: http://svn.apache.org/viewvc?rev=1802901&view=rev
Log:
OAK-6370 - Improve documentation for text pre-extraction

Added toc and minor updates

Modified:
    jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/oak-run-indexing.md
    jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/pre-extract-text.md

Modified: jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/oak-run-indexing.md
URL: http://svn.apache.org/viewvc/jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/oak-run-indexing.md?rev=1802901&r1=1802900&r2=1802901&view=diff
==============================================================================
--- jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/oak-run-indexing.md (original)
+++ jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/oak-run-indexing.md Tue Jul 25 08:55:00 2017
@@ -305,12 +305,12 @@ PATH
 
 URI
 : Prefix the value with `uri:`
-: _"serverURI": "uri:http://foo"_  
+: _"serverURI": "uri:http\://foo.example.com"_  
 
 BINARY
 : By default the binary values are encoded as Base64 string if the binary is less than 1 MB size. The encoded value is 
   prefixed with `:blobId:`
-: _"jcr:data": ":blobId:axygz="_  
+: _"jcr:data": ":blobId:axygz"_  
 
 
 ### <a name="tika-setup"></a> Tika Setup

Modified: jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/pre-extract-text.md
URL: http://svn.apache.org/viewvc/jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/pre-extract-text.md?rev=1802901&r1=1802900&r2=1802901&view=diff
==============================================================================
--- jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/pre-extract-text.md (original)
+++ jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/pre-extract-text.md Tue Jul 25 08:55:00 2017
@@ -14,7 +14,16 @@
    See the License for the specific language governing permissions and
    limitations under the License.
   -->
-# Pre-Extracting Text from Binaries
+# <a name="pre-extract-text"></a>Pre-Extracting Text from Binaries
+
+* [Pre-Extracting Text from Binaries](#pre-extract-text)
+    * [A - Oak Run Pre-Extraction Command](#a-oak-run-command)
+        * [Step 1 - oak-run Setup](#a-setup)
+        * [Step 2 - Generate the csv file](#a-generate-csv)
+        * [Step 3 - Perform the text extraction](#a-perform-text-extraction)
+    * [B - PreExtractedTextProvider](#b-pre-extracted-text-provider)
+        * [Oak application](#b-oak-app)
+        * [Oak Run Indexing](#b-oak-run)
 
 `@since Oak 1.0.18, 1.2.3`
 
@@ -30,16 +39,16 @@ This feature consist of 2 broad steps
 
 For more details on this feature refer to [OAK-2892][OAK-2892]
 
-## A - Oak Run Pre-Extraction Command
+## <a name="a-oak-run-command"></a>A - Oak Run Pre-Extraction Command
 
 Oak run tool provides a `tika` command which supports traversing the repository and then extracting text from the 
 binary properties. 
 
-### Step 1 - oak-run Setup
+### <a name="a-setup"></a>Step 1 - oak-run Setup
 
 Download following jars
 
-* oak-run 1.7.4 
+* oak-run 1.7.4 [link][1]
 
 Refer to [oak-run setup](../features/oak-run-nodestore-connection-options.html) for details about connecting to different 
 types of NodeStore. Example below assume a setup consisting of SegmentNodeStore and FileDataStore. Depending on setup
@@ -51,7 +60,11 @@ repository in read only mode and hence s
 
 The generated extracted text dir can then be used with older setup.
 
-### Step 2 - Generate the csv file
+Of the following steps #2 i.e. generation of csv file scans the whole repository. Hence this step should be run
+when system is not in active use. Step #3 only requires access to BlobStore and hence can be run while Oak application
+is in use.
+
+### <a name="a-generate-csv"></a>Step 2 - Generate the csv file
 
 As the first step you would need to generate a csv file which would contain details about the binary property.
 This file would be generated by using the `tika` command from oak-run. In this step oak-run would connect to 
@@ -81,7 +94,7 @@ This would generate a csv file with cont
 By default it scans whole repository. If you need to restrict it to look up under certain path then specify the path via 
 `--path` option.
 
-### Step 3 - Perform the text extraction
+### <a name="a-perform-text-extraction"></a>Step 3 - Perform the text extraction
 
 Once the csv file is generated we need to perform the text extraction. To do that we would need to download the 
 [tika-app](https://tika.apache.org/download.html) jar from Tika downloads. You should be able to use 1.15 version
@@ -117,29 +130,29 @@ Note that we need to launch the command
 like tika-app. Also ensure that oak-run comes before in classpath. This is required due to some old classes being packaged 
 in tika-app 
 
-## B - PreExtractedTextProvider
+## <a name="b-pre-extracted-text-provider"></a>B - PreExtractedTextProvider
 
 In this step we would configure Oak to make use of the pre extracted text for the indexing. Depending on how 
 indexing is being performed you would configure the `PreExtractedTextProvider` either in OSGi or in oak-run index command
 
-### Oak application
+### <a name="b-oak-app"></a>Oak application
 
 `@since Oak 1.0.18, 1.2.3`
 
 For this look for OSGi config for `Apache Jackrabbit Oak DataStore PreExtractedTextProvider`
         
-    ![OSGi Configuration](pre-extracted-text-osgi.png)   
+![OSGi Configuration](pre-extracted-text-osgi.png)   
    
 Once `PreExtractedTextProvider` is configured then upon reindexing Lucene
 indexer would make use of it to check if text needs to be extracted or not. Check
 `TextExtractionStatsMBean` for various statistics around text extraction and also
 to validate if `PreExtractedTextProvider` is being used.
 
-### Oak Run Indexing
+### <a name="b-oak-run"></a>Oak Run Indexing
 
 Configure the directory storing pre extracted text via `--pre-extracted-text-dir` option in `index` command.
 See [oak run indexing](oak-run-indexing.html)
 
 
-[oak-run-1.7.1]: https://repo1.maven.org/maven2/org/apache/jackrabbit/oak-run/1.7.1/oak-run-1.7.1.jar
 [OAK-2892]: https://issues.apache.org/jira/browse/OAK-2892
+[1]: https://repo1.maven.org/maven2/org/apache/jackrabbit/oak-run/1.7.4/oak-run-1.7.4.jar