You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-commits@jackrabbit.apache.org by ch...@apache.org on 2015/07/15 08:31:55 UTC

svn commit: r1691129 - in /jackrabbit/oak/trunk/oak-doc/src/site/markdown/query: lucene.md pre-extracted-text-osgi.png

Author: chetanm
Date: Wed Jul 15 06:31:54 2015
New Revision: 1691129

URL: http://svn.apache.org/r1691129
Log:
OAK-2892 - Speed up lucene indexing post migration by pre extracting the text content from binaries

Update the docs

Added:
    jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/pre-extracted-text-osgi.png   (with props)
Modified:
    jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/lucene.md

Modified: jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/lucene.md
URL: http://svn.apache.org/viewvc/jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/lucene.md?rev=1691129&r1=1691128&r2=1691129&view=diff
==============================================================================
--- jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/lucene.md (original)
+++ jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/lucene.md Wed Jul 15 06:31:54 2015
@@ -375,7 +375,7 @@ size. Refer to [OAK-2306][OAK-2306] for
 <a name="include-exclude"></a>
 ##### Include and Exclude paths from indexing
 
-_Since 1.0.14+ and 1.2.3+_
+`@since Oak 1.0.14, 1.2.3`
 
 By default the indexer would index all the nodes under the subtree where the 
 index  definition is defined as per the indexingRule. In some cases its required
@@ -503,10 +503,9 @@ defaults to 5
             - path = "renditions/original"
             - relativeNode = true
 
-#### Analyzers (1.1.6)
+#### Analyzers
 
-_This feature is currently not part of 1.0 branch and is only present in unstable
-1.x releases_
+`@since Oak 1.2.0`
 
 Analyzers can be configured as part of index definition via `analyzers` node.
 The default analyzer can be configured via `analyzers/default` node
@@ -620,7 +619,9 @@ debug
 : Boolean value. Defaults to `false`
 : If enabled then Lucene logging would be integrated with Slf4j
 
-### Tika Config (1.0.12)
+### Tika Config
+
+`@since Oak 1.0.12, 1.2.3`
 
 Oak Lucene uses [Apache Tika][tika] to extract the text from binary content
 
@@ -727,7 +728,7 @@ _With Oak 1.0.13 this feature is now ena
 <a name="copy-on-write"></a>
 ### CopyOnWrite
 
-_Since 1.0.15 and 1.2.3_
+`@since Oak 1.0.15, 1.2.3`
 
 Similar to _CopyOnRead_ feature Oak Lucene also supports _CopyOnWrite_ to enable
 faster indexing by first buffering the writes to local filesystem and transferring
@@ -797,6 +798,56 @@ mentioned steps
         
 From the Luke UI shown you can access various details.
 
+<a name="text-extraction"></a>
+### Pre-Extracting Text from Binaries
+
+`@since Oak 1.0.18, 1.2.3`
+
+Lucene indexing is performed in a single threaded mode. Extracting text from 
+binaries is an expensive operation and slows down the indexing rate considerably.
+For incremental indexing this mostly works fine but if performing a reindex
+or creating the index for the first time after migration then it increases the 
+indexing time considerably. 
+
+To speed up the Lucene indexing for such cases i.e. reindexing, we can decouple 
+the text extraction from actual indexing. 
+
+1. Extract and store the extracted text from binaries via [oak-run tool][oak-run-tika]
+2. Configure a `PreExtractedTextProvider` which can lookup extracted text and 
+   thus avoid text extraction at time of actual indexing
+   
+Below are details around steps required for making using of this feature
+
+1. Generate the csv file containing binary file details
+
+        java -cp tika-app-1.8.jar:oak-run.jar \
+        org.apache.jackrabbit.oak.run.Main tika \  
+        --fds-path /path/to/datastore \
+        --nodestore /path/to/segmentstore --data-file dump.csv generate
+
+2. Extract the text 
+
+        java -cp tika-app-1.8.jar:oak-run.jar \
+        org.apache.jackrabbit.oak.run.Main tika \
+        --data-file binary-stats.csv \
+        --store-path ./store 
+        --fds-path /path/to/datastore  extract
+
+3.  Configure the `PreExtractedTextProvider` - Once the extraction is performed 
+    configure a `PreExtractedTextProvider` within the application such that Lucene 
+    indexer can make use of that to lookup extracted text. 
+
+    For this look for OSGi config for `Apache Jackrabbit Oak DataStore PreExtractedTextProvider`
+        
+    ![OSGi Configuration](pre-extracted-text-osgi.png)   
+   
+Once `PreExtractedTextProvider` is configured then upon reindexing Lucene
+indexer would make use of it to check if text needs to be extracted or not. Check
+`TextExtractionStatsMBean` for various statistics around text extraction and also
+to validate if `PreExtractedTextProvider` is being used.
+
+For more details on this feature refer to [OAK-2892][OAK-2892]
+
 ### Advanced search features
 
 #### Suggestions
@@ -1276,10 +1327,13 @@ such fields
 [OAK-2599]: https://issues.apache.org/jira/browse/OAK-2599
 [OAK-2247]: https://issues.apache.org/jira/browse/OAK-2247
 [OAK-2853]: https://issues.apache.org/jira/browse/OAK-2853
+[OAK-2892]: https://issues.apache.org/jira/browse/OAK-2892
 [luke]: https://code.google.com/p/luke/
 [tika]: http://tika.apache.org/
 [oak-console]: https://github.com/apache/jackrabbit-oak/tree/trunk/oak-run#console
 [JCR-2989]: https://issues.apache.org/jira/browse/JCR-2989?focusedCommentId=13051101
 [solr-analyzer]: https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Specifying_an_Analyzer_in_the_schema
 [default-config]: https://github.com/apache/jackrabbit-oak/blob/trunk/oak-lucene/src/main/resources/org/apache/jackrabbit/oak/plugins/index/lucene/tika-config.xml
-[lucene-codec]: https://lucene.apache.org/core/4_7_1/core/org/apache/lucene/codecs/Codec.html
\ No newline at end of file
+[lucene-codec]: https://lucene.apache.org/core/4_7_1/core/org/apache/lucene/codecs/Codec.html
+[tika-download]: https://tika.apache.org/download.html
+[oak-run-tika]: https://github.com/apache/jackrabbit-oak/tree/trunk/oak-run#tika
\ No newline at end of file

Added: jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/pre-extracted-text-osgi.png
URL: http://svn.apache.org/viewvc/jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/pre-extracted-text-osgi.png?rev=1691129&view=auto
==============================================================================
Binary file - no diff available.

Propchange: jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/pre-extracted-text-osgi.png
------------------------------------------------------------------------------
    svn:mime-type = image/png