You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-commits@jackrabbit.apache.org by ch...@apache.org on 2015/07/15 08:31:45 UTC

svn commit: r1691128 - /jackrabbit/oak/trunk/oak-run/README.md

Author: chetanm
Date: Wed Jul 15 06:31:44 2015
New Revision: 1691128

URL: http://svn.apache.org/r1691128
Log:
OAK-2953 - Implement text extractor as part of oak-run

Update the readme

Modified:
    jackrabbit/oak/trunk/oak-run/README.md

Modified: jackrabbit/oak/trunk/oak-run/README.md
URL: http://svn.apache.org/viewvc/jackrabbit/oak/trunk/oak-run/README.md?rev=1691128&r1=1691127&r2=1691128&view=diff
==============================================================================
--- jackrabbit/oak/trunk/oak-run/README.md (original)
+++ jackrabbit/oak/trunk/oak-run/README.md Wed Jul 15 06:31:44 2015
@@ -20,6 +20,7 @@ The following runmodes are currently ava
     * scalability : Run scalability tests against different Oak repository fixtures.
     * recovery    : Run a _lastRev recovery on a MongoMK repository
     * checkpoints : Manage checkpoints
+    * tika        : Performs text extraction
     * help        : Print a list of available runmodes
     
 
@@ -181,6 +182,118 @@ The 'rm-all' option will wipe clean the
 The 'rm-unreferenced' option will remove all checkpoints except the one referenced from the async indexer (/:async@async).
 The 'rm <checkpoint>' option will remove a specific checkpoint from the repository.
 
+<a name="tika"></a>
+Tika
+----
+
+The 'tika' mode enables performing text extraction, report generation and 
+csv generation required for text extraction
+
+
+    Apache Jackrabbit Oak 1.4-SNAPSHOT
+    Non-option arguments:                                                         
+    tika [extract|report|generate]                                                
+    report   : Generates a summary report related to binary data                  
+    extract  : Performs the text extraction                                       
+    generate : Generates the csv data file based on configured NodeStore/BlobStore
+    
+    Option                 Description                            
+    ------                 -----------                            
+    -?, -h, --help         show help                              
+    --data-file <File>     Data file in csv format containing the 
+                             binary metadata                      
+    --fds-path <File>      Path of directory used by FileDataStore
+    --nodestore            NodeStore detail                       
+                             /path/to/oak/repository | mongodb:   
+                             //host:port/database                 
+    --path                 Path in repository under which the     
+                             binaries would be searched           
+    --pool-size <Integer>  Size of the thread pool used to        
+                             perform text extraction. Defaults to 
+                             number of cores on the system        
+    --store-path <File>    Path of directory used to store        
+                             extracted text content               
+    --tika-config <File>   Tika config file path   
+
+<a name="tika-csv"></a>
+### CSV File Format
+
+Text extraction tool reads a csv file which contains details regarding those
+binary files from which text needs to be extracted. Entries in csv file look like
+below
+
+```
+43844ed22d640a114134e5a25550244e8836c00c#28705,28705,"application/octet-stream",,"/content/activities/jcr:content/folderThumbnail/jcr:content"
+43844ed22d640a114134e5a25550244e8836c00c#28705,28705,"application/octet-stream",,"/content/snowboarding/jcr:content/folderThumbnail/jcr:content"
+...
+```
+
+Where the columns are in following order
+
+1. BlobId - Value of [Jackrabbit ContentIdentity](http://jackrabbit.apache.org/api/2.0/org/apache/jackrabbit/api/JackrabbitValue.html)
+2. Length
+3. jcr:mimeType
+4. jcr:encoding
+5. path of parent node    
+
+The csv file can be generated programatically. For Oak based repositories
+it can be generated via `generate` command. 
+
+### Generate
+
+CSV file required for `extract` and `report` can  be generated via `generate` 
+mode
+
+    java -jar oak-run.jar tika \  
+    --fds-path /path/to/datastore \
+    --nodestore /path/to/segmentstore --data-file dump.csv generate
+
+Above command would scan the NodeStore and create the csv file. This file can 
+then be passed to `extract` command
+    
+### Report
+
+Tool can generate a summary report from a [csv](#tika-csv) file
+
+    java -jar oak-run.jar tika \ 
+        --data-file /path/to/binary-stats.csv report
+
+The report provides a summary like
+
+```
+14:39:05.402 [main] INFO  o.a.j.o.p.tika.TextExtractorMain - MimeType Stats
+        Total size         : 89.3 MB
+        Total indexed size : 3.4 MB
+        Total count        : 1048
+
+               Type                 Indexed   Supported    Count       Size   
+___________________________________________________________________________________
+application/epub+zip              |      true|      true|  1       |    3.4 MB
+image/png                         |     false|      true|  544     |   40.2 MB
+image/jpeg                        |     false|      true|  444     |   34.0 MB
+image/tiff                        |     false|      true|  11      |    6.1 MB
+application/x-indesign            |     false|     false|  1       |    3.7 MB
+application/octet-stream          |     false|     false|  39      |    1.2 MB
+application/x-shockwave-flash     |     false|     false|  4       |  372.2 kB
+application/pdf                   |     false|     false|  3       |  168.3 kB
+video/quicktime                   |     false|     false|  1       |   95.9 kB
+```
+
+### Extract
+
+Extraction can be performed via following command
+
+    java -cp oak-run.jar:tika-app-1.8.jar \
+    org.apache.jackrabbit.oak.run.Main tika \
+    --data-file binary-stats.csv \
+    --store-path ./store 
+    --fds-path /path/to/datastore  extract
+    
+You would need to provide the tika-app jar which contains all the parsers. 
+It can be downloaded from [here](https://tika.apache.org/download.html). 
+Extraction would then be performed in a multi threaded mode. Extracted text
+would be stored in the `store-path`
+
 Upgrade
 -------