You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-commits@jackrabbit.apache.org by ch...@apache.org on 2017/07/25 08:55:00 UTC
svn commit: r1802901 - in
/jackrabbit/oak/trunk/oak-doc/src/site/markdown/query: oak-run-indexing.md
pre-extract-text.md
Author: chetanm
Date: Tue Jul 25 08:55:00 2017
New Revision: 1802901
URL: http://svn.apache.org/viewvc?rev=1802901&view=rev
Log:
OAK-6370 - Improve documentation for text pre-extraction
Added toc and minor updates
Modified:
jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/oak-run-indexing.md
jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/pre-extract-text.md
Modified: jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/oak-run-indexing.md
URL: http://svn.apache.org/viewvc/jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/oak-run-indexing.md?rev=1802901&r1=1802900&r2=1802901&view=diff
==============================================================================
--- jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/oak-run-indexing.md (original)
+++ jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/oak-run-indexing.md Tue Jul 25 08:55:00 2017
@@ -305,12 +305,12 @@ PATH
URI
: Prefix the value with `uri:`
-: _"serverURI": "uri:http://foo"_
+: _"serverURI": "uri:http\://foo.example.com"_
BINARY
: By default the binary values are encoded as Base64 string if the binary is less than 1 MB size. The encoded value is
prefixed with `:blobId:`
-: _"jcr:data": ":blobId:axygz="_
+: _"jcr:data": ":blobId:axygz"_
### <a name="tika-setup"></a> Tika Setup
Modified: jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/pre-extract-text.md
URL: http://svn.apache.org/viewvc/jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/pre-extract-text.md?rev=1802901&r1=1802900&r2=1802901&view=diff
==============================================================================
--- jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/pre-extract-text.md (original)
+++ jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/pre-extract-text.md Tue Jul 25 08:55:00 2017
@@ -14,7 +14,16 @@
See the License for the specific language governing permissions and
limitations under the License.
-->
-# Pre-Extracting Text from Binaries
+# <a name="pre-extract-text"></a>Pre-Extracting Text from Binaries
+
+* [Pre-Extracting Text from Binaries](#pre-extract-text)
+ * [A - Oak Run Pre-Extraction Command](#a-oak-run-command)
+ * [Step 1 - oak-run Setup](#a-setup)
+ * [Step 2 - Generate the csv file](#a-generate-csv)
+ * [Step 3 - Perform the text extraction](#a-perform-text-extraction)
+ * [B - PreExtractedTextProvider](#b-pre-extracted-text-provider)
+ * [Oak application](#b-oak-app)
+ * [Oak Run Indexing](#b-oak-run)
`@since Oak 1.0.18, 1.2.3`
@@ -30,16 +39,16 @@ This feature consist of 2 broad steps
For more details on this feature refer to [OAK-2892][OAK-2892]
-## A - Oak Run Pre-Extraction Command
+## <a name="a-oak-run-command"></a>A - Oak Run Pre-Extraction Command
Oak run tool provides a `tika` command which supports traversing the repository and then extracting text from the
binary properties.
-### Step 1 - oak-run Setup
+### <a name="a-setup"></a>Step 1 - oak-run Setup
Download following jars
-* oak-run 1.7.4
+* oak-run 1.7.4 [link][1]
Refer to [oak-run setup](../features/oak-run-nodestore-connection-options.html) for details about connecting to different
types of NodeStore. Example below assume a setup consisting of SegmentNodeStore and FileDataStore. Depending on setup
@@ -51,7 +60,11 @@ repository in read only mode and hence s
The generated extracted text dir can then be used with older setup.
-### Step 2 - Generate the csv file
+Of the following steps #2 i.e. generation of csv file scans the whole repository. Hence this step should be run
+when system is not in active use. Step #3 only requires access to BlobStore and hence can be run while Oak application
+is in use.
+
+### <a name="a-generate-csv"></a>Step 2 - Generate the csv file
As the first step you would need to generate a csv file which would contain details about the binary property.
This file would be generated by using the `tika` command from oak-run. In this step oak-run would connect to
@@ -81,7 +94,7 @@ This would generate a csv file with cont
By default it scans whole repository. If you need to restrict it to look up under certain path then specify the path via
`--path` option.
-### Step 3 - Perform the text extraction
+### <a name="a-perform-text-extraction"></a>Step 3 - Perform the text extraction
Once the csv file is generated we need to perform the text extraction. To do that we would need to download the
[tika-app](https://tika.apache.org/download.html) jar from Tika downloads. You should be able to use 1.15 version
@@ -117,29 +130,29 @@ Note that we need to launch the command
like tika-app. Also ensure that oak-run comes before in classpath. This is required due to some old classes being packaged
in tika-app
-## B - PreExtractedTextProvider
+## <a name="b-pre-extracted-text-provider"></a>B - PreExtractedTextProvider
In this step we would configure Oak to make use of the pre extracted text for the indexing. Depending on how
indexing is being performed you would configure the `PreExtractedTextProvider` either in OSGi or in oak-run index command
-### Oak application
+### <a name="b-oak-app"></a>Oak application
`@since Oak 1.0.18, 1.2.3`
For this look for OSGi config for `Apache Jackrabbit Oak DataStore PreExtractedTextProvider`
- ![OSGi Configuration](pre-extracted-text-osgi.png)
+![OSGi Configuration](pre-extracted-text-osgi.png)
Once `PreExtractedTextProvider` is configured then upon reindexing Lucene
indexer would make use of it to check if text needs to be extracted or not. Check
`TextExtractionStatsMBean` for various statistics around text extraction and also
to validate if `PreExtractedTextProvider` is being used.
-### Oak Run Indexing
+### <a name="b-oak-run"></a>Oak Run Indexing
Configure the directory storing pre extracted text via `--pre-extracted-text-dir` option in `index` command.
See [oak run indexing](oak-run-indexing.html)
-[oak-run-1.7.1]: https://repo1.maven.org/maven2/org/apache/jackrabbit/oak-run/1.7.1/oak-run-1.7.1.jar
[OAK-2892]: https://issues.apache.org/jira/browse/OAK-2892
+[1]: https://repo1.maven.org/maven2/org/apache/jackrabbit/oak-run/1.7.4/oak-run-1.7.4.jar