You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tika.apache.org by Apache Wiki <wi...@apache.org> on 2017/05/15 01:45:06 UTC

[Tika Wiki] Update of "TikaAndVisionDL4J" by ThammeGowda

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.

The "TikaAndVisionDL4J" page has been changed by ThammeGowda:
https://wiki.apache.org/tika/TikaAndVisionDL4J?action=diff&rev1=11&rev2=12

Comment:
Added wiki page for DL4J

- ## page was copied from TikaAndVision
- = Tika and Computer Vision - DL4J =
+ = Tika and Computer Vision powered by Deeplearning4j(DL4J) =
+ 
  
  <<TableOfContents(4)>>
  
+ This page describes a way to make Tika perform image recognition. Tika has many implementations of image recognition parsers. Specifically, this page provides information for an implementation powered by [[https://deeplearning4j.org/|Deeplearning4j]], InceptionNet-V3 model pre-trained on ImageNet dataset.
+ This model can detect a thousand different objects in the images. 
- This page describes how to make use of Object (Visual) Recognition capability of Apache Tika.
- TIKA-1993 introduced a new parser to perform object recognition on images.
- Visit [[https://issues.apache.org/jira/browse/TIKA-1993 | TIKA-1993 issue on Jira ]] or [[ https://github.com/apache/tika/pull/125 | pull request on Github]] to read the related conversation. The model was updated from Inception V3 to Inception V4 with [[https://issues.apache.org/jira/browse/TIKA-2306|TIKA-2306]] ([[https://github.com/apache/tika/pull/163|Pull request on Github]]). Continue reading to get Tika up and running for object recognition.
  
- Currently, Tika utilises [[https://arxiv.org/abs/1602.07261|Inception-V4]] model from [[https://www.tensorflow.org/|Tensorflow]] for recognizing objects in the JPEG images.
+ The advantage of this particular setting is, this implementation runs inside a Java Virtual Machine (JVM) stack without dependence on any external services.
+ So it is perfect for the users who are trying to run image recognition on a distributed setup like Apache Hadoop or Apache Spark.
  
  
+ Note:
+  1. This is a work in progress. This feature was added in Tika 1.15
+  2. At the time of writing, Tika 1.15 was not released. You have to [clone Tika repository](https://github.com/apache/tika) and do '''mvn clean install'''. 
+  3. The rest of the page uses version '''1.15-SNAPSHOT''', however, if you are reading this after release, please use '''1.15''' or newer version.
- == Tika and Tensorflow Image Recognition ==
- 
- Tika has two different ways of bindings to Tensorflow:
-  1. Using Commandline Invocation -- Recommended for quick testing, not for production use
-  2. Using REST API -- Recommended for production use
- 
- === 1. Tensorflow Using Commandline Invocation ===
- '''Pros of this approach:'''
-   This parser is easy to setup and test
- '''Cons:'''
-   Very inefficient/slow as it loads and unloads model for every parse call
  
  
+ = Java/Groovy/Scala example =
- ==== Step 1. Install the dependencies ====
- To install tensorflow, follow the instructions on [[https://www.tensorflow.org/install/|the official site here]] for your environment.
- Unless you know what you are doing, you are recommended to follow pip installation. 
  
- Then clone the repository [[https://github.com/tensorflow/models|tensorflow/models]] or download the [[https://github.com/tensorflow/models/archive/master.zip|zip file]].
-   {{{git clone https://github.com/tensorflow/models.git}}}
+ For maven users:
+ Add Tika-parsers and tika-dl to your project
  
- Add 'models/slim' folder to the environment variable, PYTHONPATH.
+ '''Here is an example for Apache Maven users:'''
  
-    {{{$ export PYTHONPATH="$PYTHONPATH:/path/to/models/slim"}}}
+ {{{#!highlight xml
  
- To test the readiness of your environment :
+     <dependencies>
+         <dependency>
+             <groupId>org.apache.tika</groupId>
+             <artifactId>tika-parsers</artifactId>
+             <version>1.15-SNAPSHOT</version>
+         </dependency>
+         <dependency>
+             <groupId>org.apache.tika</groupId>
+             <artifactId>tika-dl</artifactId>
+             <version>1.15-SNAPSHOT</version>
+         </dependency>
+     </dependencies>
+ }}}
  
-   {{{$ python -c 'import tensorflow, numpy, datasets; print("OK")'}}}
+ '''A configuration, tika-config.xml, to activate image recognition model:'''
  
- If the above command prints the message "OK", then the requirements are satisfied.
- 
- ==== Step 2. Create a Tika-Config XML to enable Tensorflow parser. ====
-  A sample config can be found in Tika source code at [[https://raw.githubusercontent.com/apache/tika/master/tika-parsers/src/test/resources/org/apache/tika/parser/recognition/tika-config-tflow.xml|tika-parsers/src/test/resources/org/apache/tika/parser/recognition/tika-config-tflow.xml]]
- 
- '''Here is an example:'''
  {{{#!highlight xml
  <properties>
-     <parsers>
+   <parsers>
-         <parser class="org.apache.tika.parser.recognition.ObjectRecognitionParser">
+     <parser class="org.apache.tika.parser.recognition.ObjectRecognitionParser">
-             <mime>image/jpeg</mime>
+       <mime>image/jpeg</mime>
-             <params>
+       <params>
-                 <param name="topN" type="int">2</param>
+         <param name="topN" type="int">10</param>
-                 <param name="minConfidence" type="double">0.015</param>
+         <param name="minConfidence" type="double">0.015</param>
-                 <param name="class" type="string">org.apache.tika.parser.recognition.tf.TensorflowImageRecParser</param>
+         <param name="class" type="string">org.apache.tika.dl.imagerec.DL4JInceptionV3Net</param>
-             </params>
+       </params>
-         </parser>
-     </parsers>
+     </parser>
+   </parsers>
  </properties>
  }}}
+ Note: Refer to a later section for customizing the config.
  
- '''Description of parameters :'''
- {{{#!csv
-   Param Name, Type, Meaning, Range, Example
-   topN, int, Number of object names to output, a non-zero positive integer, 1 to receive top 1 object name
-   minConfidence, double, Minimum confidence required to output the name of detected objects, [0.0 to 1.0] inclusive, 0.9 for outputting object names iff at least 90% confident
-   class, string, Class that implements object recognition functionality, constant string, org.apache.tika.parser.recognition.tf.TensorflowImageRecParser
+ 
+ '''Sample Java code:'''
+ 
+ {{{#!highlight java
+     //create parser as per desired parser
+     TikaConfig config;
+     try (InputStream stream = ImageRecLocal.class.getClassLoader()
+             .getResourceAsStream("tika-config.xml")){
+         config = new TikaConfig(stream);
+     }
+     Tika parser = new Tika(config);
+     //sample file
+     File imageFile = new File("data/gun.jpg");
+     Metadata meta = new Metadata();
+     parser.parse(imageFile, meta);
+     //retrieve objects from the metadata
+     System.out.println(Arrays.toString(meta.getValues("OBJECT")));
+     // This should print: [assault_rifle (0.78214), rifle (0.18048), revolver (0.02780)]
  }}}
  
+ = For Python and Tika Server users =
  
- ==== Step 3: Demo ====
- To use the vision capability via Tensorflow, just supply the above configuration to Tika.
+  1. Create file with name '''tika-config.xml''' by using the content shown above.
+  2. Clone tika repository and build the tika project
+  3. Add '''tika-dl''' to classpath of tika server
  
  
- For example, to use in Tika App (Assuming you have ''tika-app'' JAR and it is ready to run):
- 
- {{{#!bash 
-       $ java -jar tika-app/target/tika-app-1.15-SNAPSHOT.jar \
-          --config=tika-parsers/src/test/resources/org/apache/tika/parser/recognition/tika-config-tflow.xml \
-          https://upload.wikimedia.org/wikipedia/commons/f/f6/Working_Dogs%2C_Handlers_Share_Special_Bond_DVIDS124942.jpg
-  }}}
- 
- The input image is:
- 
- {{https://upload.wikimedia.org/wikipedia/commons/f/f6/Working_Dogs%2C_Handlers_Share_Special_Bond_DVIDS124942.jpg|Germal Shepherd with Military}}
- 
- And, the top 2 detections are:
- {{{#!highlight xml
+ {{{#!highlight bash
+ alias tika="java -cp tika-server/target/tika-server-1.15-SNAPSHOT.jar:tika-dl/target/tika-dl-1.15-SNAPSHOT-jar-with-dependencies.jar org.apache.tika.server.TikaServerCli "
+ tika --config=tika-config.xml # Starts server port 9998
- ...
- <meta name="OBJECT" content="German shepherd, German shepherd dog, German police dog, alsatian (0.78435)"/>
- <meta name="OBJECT" content="military uniform (0.06694)"/>
- ...
  }}}
  
+ Refer to [[ https://github.com/chrismattmann/tika-python | Tika-python]] for an example usage.
- 
- === 2. Tensorflow Using REST Server ===
- This is the recommended way for utilizing visual recognition capability of Tika.
- This approach uses Tensorflow over REST API. 
- To get this working, we are going to start a python flask based REST API server and tell tika to connect to it. 
- All these dependencies and setup complexities are isolated in docker image.
  
  
- Requirements :
-   Docker --  Visit [[https://www.docker.com/| Docker.com]] and install latest version of Docker. (Note: tested on docker v17.03.1)
+ = Large Scale Image Recognition In Spark =
+  
+ Coming soon! It is being tested here https://github.com/thammegowda/tika-dl4j-spark-imgrec
  
- ==== Step 1. Setup REST Server ====
- You can either start the REST server in an isolated docker container or natively on the host that runs tensorflow.
  
+ = Configuration options =
- ===== a. Using docker (Recommended) =====
- {{{#!highlight bash
  
- cd tika-parsers/src/main/resources/org/apache/tika/parser/recognition/tf/ 
- # alternatively, if you do not have tika's source code, you may simply wget the 'InceptionRestDockerfile' from github link 
- docker build -f InceptionRestDockerfile -t inception-rest-tika .
- docker run -p 8764:8764 -it inception-rest-tika
+ Previously, we have used the three parameters to this parser:
+ {{{#!highlight xml
+ <params>
+     <param name="topN" type="int">10</param>
+     <param name="minConfidence" type="double">0.015</param>
+     <param name="class" type="string">org.apache.tika.dl.imagerec.DL4JInceptionV3Net</param>
+ </params>
  }}}
  
+ The other important parameters are:
+ {{{#!highlight xml
+     <param name="modelWeightsPath" type="string">VALUE</param>
+     <param name="modelJsonPath" type="string">VALUE</param>
- Once it is done, test the setup by visiting [[http://localhost:8764/inception/v4/classify?topk=2&url=https://upload.wikimedia.org/wikipedia/commons/f/f6/Working_Dogs%2C_Handlers_Share_Special_Bond_DVIDS124942.jpg]] in your web browser.
- 
- '''Sample output from API:'''
- {{{#!json
- {
-    "confidence":[
-       0.7843596339225769,
-       0.06694009155035019
-    ],
-    "classnames":[
-       "German shepherd, German shepherd dog, German police dog, alsatian",
-       "military uniform"
-    ],
-    "classids":[
-       236,
-       653
-    ],
-    "time":{
-       "read":7403,
-       "units":"ms",
-       "classification":470
-    }
- }
  }}}
  
- Note: MAC USERS:
-       If you are using an older version, say, 'Docker toolbox' instead of the newer 'Docker for Mac',
- you need to add port forwarding rules in your Virtual Box default machine. 
+ The VALUE string can be:
+  1. File path relative to class loader (note that these could also be inside jar files)
+  2. Absolute file path anywhere on the local filesystem
+  3. Any remote URL, including HTTP or HTTPS
  
-   1. Open the Virtual Box Manager.
-   2. Select your Docker Machine Virtual Box image.
-   3. Open Settings -> Network -> Advanced -> Port Forwarding.
-   4. Add an appname,Host IP 127.0.0.1 and set both ports to 8764.
+ For example:
+  1. {{{<param name="modelWeightsPath" type="string">inception-model-weights.h5</param>}}}
+  2. {{{<param name="modelWeightsPath" type="string">/usr/share/apache-tika/models/tikainception-model-weights.h5</param>}}}
+  3. {{{<param name="modelWeightsPath" type="string">https://myserver.com/files/apache-tika/models/tikainception-model-weights.h5</param>}}}
+ -----
  
- ===== b. Without Using docker =====
- If you chose to setup REST server without a docker container, you are free to manually install all the required tools specified in the [[ https://github.com/apache/tika/blob/master/tika-parsers/src/main/resources/org/apache/tika/parser/recognition/tf/InceptionRestDockerfile | docker file]].
- 
- Note: docker file has setup instructions for Ubuntu, you will have to transform those commands for your environment.
- 
- {{{#!highlight bash
-    python tika-parsers/src/main/resources/org/apache/tika/parser/recognition/tf/inceptionapi.py  --port 8764
- }}}
- 
- ==== Step 2. Create a Tika-Config XML to enable Tensorflow parser. ====
-  A sample config can be found in Tika source code at [[https://github.com/apache/tika/blob/da82df5e9def9698fd32f85fe706660641d7c31f/tika-parsers/src/test/resources/org/apache/tika/parser/recognition/tika-config-tflow-rest.xml|tika-parsers/src/test/resources/org/apache/tika/parser/recognition/tika-config-tflow-rest.xml]]
- 
- '''Here is an example:'''
- {{{#!xml
- <properties>
-     <parsers>
-         <parser class="org.apache.tika.parser.recognition.ObjectRecognitionParser">
-             <mime>image/jpeg</mime>
-             <params>
-                 <param name="topN" type="int">2</param>
-                 <param name="minConfidence" type="double">0.015</param>
-                 <param name="class" type="string">org.apache.tika.parser.recognition.tf.TensorflowRESTRecogniser</param>
-             </params>
-         </parser>
-     </parsers>
- </properties>
- }}}
- 
- '''Description of parameters :'''
- {{{#!csv
-   Param Name, Type, Meaning, Range, Example
-   topN, int, Number of object names to output, a non-zero positive integer, 1 to receive top 1 object name
-   minConfidence, double, Minimum confidence required to output the name of detected objects, [0.0 to 1.0] inclusive, 0.9 for outputting object names iff at least 90% confident
-   class, string, Name of class that Implements Object recognition Contract, constant string, org.apache.tika.parser.recognition.tf.TensorflowRESTRecogniser
-   healthUri, URI, HTTP URL to check availability of API service, any HTTP URL that gets 200 status code when available, http://localhost:8764/inception/v4/ping 
-   apiUri, URI, HTTP URL to POST image data, any HTTP URL that returns data in the JSON format as shown in the sample API output, http://localhost:8764/inception/v4/classify?topk=10 
- }}}
- 
- 
- ==== Step 3. Demo ====
- This demo is same as the Commandline Invocation approach, but this is faster and efficient
- 
- {{{#!bash
-    $ java -jar tika-app/target/tika-app-1.15-SNAPSHOT.jar \
-     --config=tika-parsers/src/test/resources/org/apache/tika/parser/recognition/tika-config-tflow-rest.xml \
-     https://upload.wikimedia.org/wikipedia/commons/f/f6/Working_Dogs%2C_Handlers_Share_Special_Bond_DVIDS124942.jpg
- }}}
- 
- The input image is:
- 
- {{https://upload.wikimedia.org/wikipedia/commons/f/f6/Working_Dogs%2C_Handlers_Share_Special_Bond_DVIDS124942.jpg|Germal Shepherd with Military}}
- 
- And, the top 2 detections are:
- {{{#!highlight xml
- ...
- <meta name="OBJECT" content="German shepherd, German shepherd dog, German police dog, alsatian (0.78435)"/>
- <meta name="OBJECT" content="military uniform (0.06694)"/>
- ...
- }}}
- 
- ==== Changing the default topN, API port or URL ====
- To change the defaults, update the parameters in config XML file accordingly
- 
- '''Here is an example scenario:'''
- 
- Run REST API on port 3030, and get top 4 object names if the confidence is above 10%. You may also change host to something else than 'localhost' if required.
- 
- '''Example Config File'''
- {{{#!xml
- <properties>
-     <parsers>
-         <parser class="org.apache.tika.parser.recognition.ObjectRecognitionParser">
-             <mime>image/jpeg</mime>
-             <params>
-                 <param name="topN" type="int">4</param>
-                 <param name="minConfidence" type="double">0.1</param>
-                 <param name="class" type="string">org.apache.tika.parser.recognition.tf.TensorflowRESTRecogniser</param>
-        		<param name="healthUri" type="uri">http://localhost:3030/inception/v4/ping</param>
-        		<param name="apiUri" type="uri">http://localhost:3030/inception/v4/classify?topk=4</param>
-             </params>
-         </parser>
-     </parsers>
- </properties>
- }}}
- 
- '''To Start the service on port 3030:'''
- 
- Using Docker:
- 
-   {{{docker run -it -p 3030:8764 inception-rest-tika}}}
- 
- 
- Without Using Docker:
- 
- {{{python tika-parsers/src/main/resources/org/apache/tika/parser/recognition/tf/inceptionapi.py  --port 3030}}}
- 
- 
- ----
- 
- === Questions / Suggestions / Improvements / Feedback ? ===
+ = Questions / Suggestions / Improvements / Feedback ? =
  
    1. If it was useful, let us know on twitter by mentioning [[https://twitter.com/ApacheTika|@ApacheTika]]
    2. If you have questions, let us know by [[https://tika.apache.org/mail-lists.html | using Mailing Lists]]