You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-commits@jackrabbit.apache.org by ch...@apache.org on 2017/06/20 05:42:09 UTC
svn commit: r1799301 -
/jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/pre-extract-text.md
Author: chetanm
Date: Tue Jun 20 05:42:09 2017
New Revision: 1799301
URL: http://svn.apache.org/viewvc?rev=1799301&view=rev
Log:
OAK-6370 - Improve documentation for text pre-extraction
Modified:
jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/pre-extract-text.md
Modified: jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/pre-extract-text.md
URL: http://svn.apache.org/viewvc/jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/pre-extract-text.md?rev=1799301&r1=1799300&r2=1799301&view=diff
==============================================================================
--- jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/pre-extract-text.md (original)
+++ jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/pre-extract-text.md Tue Jun 20 05:42:09 2017
@@ -62,6 +62,9 @@ This would generate a csv file with cont
...
```
+By default it scans whole repository. If you need to restrict it to look up under certain path then specify the path via
+`--path` option.
+
### Step 3 - Perform the text extraction
Once the csv file is generated we need to perform the text extraction. To do that we would need to download the
@@ -81,6 +84,18 @@ the BlobStore which is in use like FileD
using multiple threads and store the extracted text in directory specified by `--store-path`.
Currently extracted text files are stored as files per blob in a format which is same one used with `FileDataStore`
+In addition to that it creates 2 files
+
+* blobs_error.txt - File containing blobIds for which text extraction ended in error
+* blobs_empty.txt - File containing blobIds for which no text was extracted
+
+This phase is incremental i.e. if run multiple times and same `--store-path` is specified then it would avoid
+extracting text from previously processed binaries.
+
+Further the `extract` phase only needs access to `BlobStore` and does not require access to NodeStore. So this
+can be run from a different machine (possibly more powerful to allow use of multiple cores) to speed up text
+extraction. One can also split the csv into multiple chunks and process them on different machines and then merge the
+stores later. Just ensure that at merge time blobs*.txt files are also merged
Note that we need to launch the command with `-cp` instead of `-jar` as we need to include classes outside of oak-run jar
like tika-app