You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-commits@jackrabbit.apache.org by ch...@apache.org on 2017/06/19 11:26:54 UTC
svn commit: r1799181 -
/jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/pre-extract-text.md
Author: chetanm
Date: Mon Jun 19 11:26:54 2017
New Revision: 1799181
URL: http://svn.apache.org/viewvc?rev=1799181&view=rev
Log:
OAK-301 : oak docu
Document pre extraction process in more details (wip)
Added:
jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/pre-extract-text.md
Added: jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/pre-extract-text.md
URL: http://svn.apache.org/viewvc/jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/pre-extract-text.md?rev=1799181&view=auto
==============================================================================
--- jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/pre-extract-text.md (added)
+++ jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/pre-extract-text.md Mon Jun 19 11:26:54 2017
@@ -0,0 +1,112 @@
+# Pre-Extracting Text from Binaries
+
+`@since Oak 1.0.18, 1.2.3`
+
+Lucene indexing is performed in a single threaded mode.
+Extracting text from binaries is an expensive operation and slows down the indexing rate considerably.
+For incremental indexing this mostly works fine but if performing a reindex or creating the index for the first time after
+migration then it increases the indexing time considerably.
+To speed up such cases Oak supports pre extracting text from binaries to avoid extracting text at indexing time.
+This feature consist of 2 broad steps
+
+1. Extract and store the extracted text from binaries using oak-run tooling.
+2. Configure Oak runtime to use the extracted text at time of indexing via `PreExtractedTextProvider`
+
+For more details on this feature refer to [OAK-2892][OAK-2892]
+
+## A - Oak Run Pre-Extraction Command
+
+Oak run tool provides a `tika` command which supports traversing the repository and then extracting text from the
+binary properties.
+
+### Step 1 - oak-run Setup
+
+Download following jars
+
+* oak-run 1.7.2
+
+Refer to [oak-run setup](../features/oak-run-nodestore-connection-options.md) for details about connecting to different
+types of NodeStore. Example below assume a setup consisting of SegmentNodeStore and FileDataStore. Depending on setup
+use the appropriate connection options.
+
+You can use current oak-run version to perform text extraction for older Oak setups i.e. its fine to use oak-run
+from 1.7.x branch to connect to Oak repositories from version 1.0.x or later. The oak-run tooling connects to the
+repository in read only mode and hence safe to use with older version.
+
+The generated extracted text dir can then be used with older setup.
+
+### Step 2 - Generate the csv file
+
+As the first step you would need to generate a csv file which would contain details about the binary property.
+This file would be generated by using the `tika` command from oak-run. In this step oak-run would connect to
+repository in read only mode.
+
+To generate the csv file use the `--generate` action
+
+ java -jar oak-run.jar tika \
+ --fds-path /path/to/datastore \
+ --nodestore /path/to/segmentstore --data-file oak-binary-stats.csv --generate
+
+If connecting to S3 this command can take long time because checking binary id currently triggers download of the
+actual binary content which we do not require. To speed up here we can use the Fake DataStore support of oak-run
+
+ java -jar oak-run.jar tika \
+ --fake-ds-path=temp \
+ --nodestore /path/to/segmentstore --data-file oak-binary-stats.csv --generate
+
+This would generate a csv file with content like below
+
+```
+43844ed22d640a114134e5a25550244e8836c00c#28705,28705,"application/octet-stream",,"/content/activities/jcr:content/folderThumbnail/jcr:content"
+43844ed22d640a114134e5a25550244e8836c00c#28705,28705,"application/octet-stream",,"/content/snowboarding/jcr:content/folderThumbnail/jcr:content"
+...
+```
+
+### Step 3 - Perform the text extraction
+
+Once the csv file is generated we need to perform the text extraction. To do that we would need to download the
+[tika-app](https://tika.apache.org/download.html) jar from Tika downloads. You should be able to use 1.15 version
+with Oak 1.7.2 jar.
+
+To perform the text extraction use the `--extract` action
+
+ java -cp tika-app-1.15.jar:oak-run.jar \
+ org.apache.jackrabbit.oak.run.Main tika \
+ --data-file binary-stats.csv \
+ --store-path ./store \
+ --fds-path /path/to/datastore extract
+
+This command does not require access to NodeStore and only requires access to the BlobStore. So configure
+the BlobStore which is in use like FileDataStore or S3DataStore. Above command would do text extraction
+using multiple threads and store the extracted text in directory specified by `--store-path`.
+
+Currently extracted text files are stored as files per blob in a format which is same one used with `FileDataStore`
+
+Note that we need to launch the command with `-cp` instead of `-jar` as we need to include classes outside of oak-run jar
+like tika-app
+
+## B - PreExtractedTextProvider
+
+In this step we would configure Oak to make use of the pre extracted text for the indexing. Depending on how
+indexing is being performed you would configure the `PreExtractedTextProvider` either in OSGi or in oak-run index command
+
+### Oak application
+
+`@since Oak 1.0.18, 1.2.3`
+
+For this look for OSGi config for `Apache Jackrabbit Oak DataStore PreExtractedTextProvider`
+
+ ![OSGi Configuration](pre-extracted-text-osgi.png)
+
+Once `PreExtractedTextProvider` is configured then upon reindexing Lucene
+indexer would make use of it to check if text needs to be extracted or not. Check
+`TextExtractionStatsMBean` for various statistics around text extraction and also
+to validate if `PreExtractedTextProvider` is being used.
+
+### Oak Run Indexing
+
+<<TBD>>
+
+
+[oak-run-1.7.1]: https://repo1.maven.org/maven2/org/apache/jackrabbit/oak-run/1.7.1/oak-run-1.7.1.jar
+[OAK-2892]: https://issues.apache.org/jira/browse/OAK-2892