You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tika.apache.org by Apache Wiki <wi...@apache.org> on 2017/05/15 01:45:06 UTC
[Tika Wiki] Update of "TikaAndVisionDL4J" by ThammeGowda
Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.
The "TikaAndVisionDL4J" page has been changed by ThammeGowda:
https://wiki.apache.org/tika/TikaAndVisionDL4J?action=diff&rev1=11&rev2=12
Comment:
Added wiki page for DL4J
- ## page was copied from TikaAndVision
- = Tika and Computer Vision - DL4J =
+ = Tika and Computer Vision powered by Deeplearning4j(DL4J) =
+
<<TableOfContents(4)>>
+ This page describes a way to make Tika perform image recognition. Tika has many implementations of image recognition parsers. Specifically, this page provides information for an implementation powered by [[https://deeplearning4j.org/|Deeplearning4j]], InceptionNet-V3 model pre-trained on ImageNet dataset.
+ This model can detect a thousand different objects in the images.
- This page describes how to make use of Object (Visual) Recognition capability of Apache Tika.
- TIKA-1993 introduced a new parser to perform object recognition on images.
- Visit [[https://issues.apache.org/jira/browse/TIKA-1993 | TIKA-1993 issue on Jira ]] or [[ https://github.com/apache/tika/pull/125 | pull request on Github]] to read the related conversation. The model was updated from Inception V3 to Inception V4 with [[https://issues.apache.org/jira/browse/TIKA-2306|TIKA-2306]] ([[https://github.com/apache/tika/pull/163|Pull request on Github]]). Continue reading to get Tika up and running for object recognition.
- Currently, Tika utilises [[https://arxiv.org/abs/1602.07261|Inception-V4]] model from [[https://www.tensorflow.org/|Tensorflow]] for recognizing objects in the JPEG images.
+ The advantage of this particular setting is, this implementation runs inside a Java Virtual Machine (JVM) stack without dependence on any external services.
+ So it is perfect for the users who are trying to run image recognition on a distributed setup like Apache Hadoop or Apache Spark.
+ Note:
+ 1. This is a work in progress. This feature was added in Tika 1.15
+ 2. At the time of writing, Tika 1.15 was not released. You have to [clone Tika repository](https://github.com/apache/tika) and do '''mvn clean install'''.
+ 3. The rest of the page uses version '''1.15-SNAPSHOT''', however, if you are reading this after release, please use '''1.15''' or newer version.
- == Tika and Tensorflow Image Recognition ==
-
- Tika has two different ways of bindings to Tensorflow:
- 1. Using Commandline Invocation -- Recommended for quick testing, not for production use
- 2. Using REST API -- Recommended for production use
-
- === 1. Tensorflow Using Commandline Invocation ===
- '''Pros of this approach:'''
- This parser is easy to setup and test
- '''Cons:'''
- Very inefficient/slow as it loads and unloads model for every parse call
+ = Java/Groovy/Scala example =
- ==== Step 1. Install the dependencies ====
- To install tensorflow, follow the instructions on [[https://www.tensorflow.org/install/|the official site here]] for your environment.
- Unless you know what you are doing, you are recommended to follow pip installation.
- Then clone the repository [[https://github.com/tensorflow/models|tensorflow/models]] or download the [[https://github.com/tensorflow/models/archive/master.zip|zip file]].
- {{{git clone https://github.com/tensorflow/models.git}}}
+ For maven users:
+ Add Tika-parsers and tika-dl to your project
- Add 'models/slim' folder to the environment variable, PYTHONPATH.
+ '''Here is an example for Apache Maven users:'''
- {{{$ export PYTHONPATH="$PYTHONPATH:/path/to/models/slim"}}}
+ {{{#!highlight xml
- To test the readiness of your environment :
+ <dependencies>
+ <dependency>
+ <groupId>org.apache.tika</groupId>
+ <artifactId>tika-parsers</artifactId>
+ <version>1.15-SNAPSHOT</version>
+ </dependency>
+ <dependency>
+ <groupId>org.apache.tika</groupId>
+ <artifactId>tika-dl</artifactId>
+ <version>1.15-SNAPSHOT</version>
+ </dependency>
+ </dependencies>
+ }}}
- {{{$ python -c 'import tensorflow, numpy, datasets; print("OK")'}}}
+ '''A configuration, tika-config.xml, to activate image recognition model:'''
- If the above command prints the message "OK", then the requirements are satisfied.
-
- ==== Step 2. Create a Tika-Config XML to enable Tensorflow parser. ====
- A sample config can be found in Tika source code at [[https://raw.githubusercontent.com/apache/tika/master/tika-parsers/src/test/resources/org/apache/tika/parser/recognition/tika-config-tflow.xml|tika-parsers/src/test/resources/org/apache/tika/parser/recognition/tika-config-tflow.xml]]
-
- '''Here is an example:'''
{{{#!highlight xml
<properties>
- <parsers>
+ <parsers>
- <parser class="org.apache.tika.parser.recognition.ObjectRecognitionParser">
+ <parser class="org.apache.tika.parser.recognition.ObjectRecognitionParser">
- <mime>image/jpeg</mime>
+ <mime>image/jpeg</mime>
- <params>
+ <params>
- <param name="topN" type="int">2</param>
+ <param name="topN" type="int">10</param>
- <param name="minConfidence" type="double">0.015</param>
+ <param name="minConfidence" type="double">0.015</param>
- <param name="class" type="string">org.apache.tika.parser.recognition.tf.TensorflowImageRecParser</param>
+ <param name="class" type="string">org.apache.tika.dl.imagerec.DL4JInceptionV3Net</param>
- </params>
+ </params>
- </parser>
- </parsers>
+ </parser>
+ </parsers>
</properties>
}}}
+ Note: Refer to a later section for customizing the config.
- '''Description of parameters :'''
- {{{#!csv
- Param Name, Type, Meaning, Range, Example
- topN, int, Number of object names to output, a non-zero positive integer, 1 to receive top 1 object name
- minConfidence, double, Minimum confidence required to output the name of detected objects, [0.0 to 1.0] inclusive, 0.9 for outputting object names iff at least 90% confident
- class, string, Class that implements object recognition functionality, constant string, org.apache.tika.parser.recognition.tf.TensorflowImageRecParser
+
+ '''Sample Java code:'''
+
+ {{{#!highlight java
+ //create parser as per desired parser
+ TikaConfig config;
+ try (InputStream stream = ImageRecLocal.class.getClassLoader()
+ .getResourceAsStream("tika-config.xml")){
+ config = new TikaConfig(stream);
+ }
+ Tika parser = new Tika(config);
+ //sample file
+ File imageFile = new File("data/gun.jpg");
+ Metadata meta = new Metadata();
+ parser.parse(imageFile, meta);
+ //retrieve objects from the metadata
+ System.out.println(Arrays.toString(meta.getValues("OBJECT")));
+ // This should print: [assault_rifle (0.78214), rifle (0.18048), revolver (0.02780)]
}}}
+ = For Python and Tika Server users =
- ==== Step 3: Demo ====
- To use the vision capability via Tensorflow, just supply the above configuration to Tika.
+ 1. Create file with name '''tika-config.xml''' by using the content shown above.
+ 2. Clone tika repository and build the tika project
+ 3. Add '''tika-dl''' to classpath of tika server
- For example, to use in Tika App (Assuming you have ''tika-app'' JAR and it is ready to run):
-
- {{{#!bash
- $ java -jar tika-app/target/tika-app-1.15-SNAPSHOT.jar \
- --config=tika-parsers/src/test/resources/org/apache/tika/parser/recognition/tika-config-tflow.xml \
- https://upload.wikimedia.org/wikipedia/commons/f/f6/Working_Dogs%2C_Handlers_Share_Special_Bond_DVIDS124942.jpg
- }}}
-
- The input image is:
-
- {{https://upload.wikimedia.org/wikipedia/commons/f/f6/Working_Dogs%2C_Handlers_Share_Special_Bond_DVIDS124942.jpg|Germal Shepherd with Military}}
-
- And, the top 2 detections are:
- {{{#!highlight xml
+ {{{#!highlight bash
+ alias tika="java -cp tika-server/target/tika-server-1.15-SNAPSHOT.jar:tika-dl/target/tika-dl-1.15-SNAPSHOT-jar-with-dependencies.jar org.apache.tika.server.TikaServerCli "
+ tika --config=tika-config.xml # Starts server port 9998
- ...
- <meta name="OBJECT" content="German shepherd, German shepherd dog, German police dog, alsatian (0.78435)"/>
- <meta name="OBJECT" content="military uniform (0.06694)"/>
- ...
}}}
+ Refer to [[ https://github.com/chrismattmann/tika-python | Tika-python]] for an example usage.
-
- === 2. Tensorflow Using REST Server ===
- This is the recommended way for utilizing visual recognition capability of Tika.
- This approach uses Tensorflow over REST API.
- To get this working, we are going to start a python flask based REST API server and tell tika to connect to it.
- All these dependencies and setup complexities are isolated in docker image.
- Requirements :
- Docker -- Visit [[https://www.docker.com/| Docker.com]] and install latest version of Docker. (Note: tested on docker v17.03.1)
+ = Large Scale Image Recognition In Spark =
+
+ Coming soon! It is being tested here https://github.com/thammegowda/tika-dl4j-spark-imgrec
- ==== Step 1. Setup REST Server ====
- You can either start the REST server in an isolated docker container or natively on the host that runs tensorflow.
+ = Configuration options =
- ===== a. Using docker (Recommended) =====
- {{{#!highlight bash
- cd tika-parsers/src/main/resources/org/apache/tika/parser/recognition/tf/
- # alternatively, if you do not have tika's source code, you may simply wget the 'InceptionRestDockerfile' from github link
- docker build -f InceptionRestDockerfile -t inception-rest-tika .
- docker run -p 8764:8764 -it inception-rest-tika
+ Previously, we have used the three parameters to this parser:
+ {{{#!highlight xml
+ <params>
+ <param name="topN" type="int">10</param>
+ <param name="minConfidence" type="double">0.015</param>
+ <param name="class" type="string">org.apache.tika.dl.imagerec.DL4JInceptionV3Net</param>
+ </params>
}}}
+ The other important parameters are:
+ {{{#!highlight xml
+ <param name="modelWeightsPath" type="string">VALUE</param>
+ <param name="modelJsonPath" type="string">VALUE</param>
- Once it is done, test the setup by visiting [[http://localhost:8764/inception/v4/classify?topk=2&url=https://upload.wikimedia.org/wikipedia/commons/f/f6/Working_Dogs%2C_Handlers_Share_Special_Bond_DVIDS124942.jpg]] in your web browser.
-
- '''Sample output from API:'''
- {{{#!json
- {
- "confidence":[
- 0.7843596339225769,
- 0.06694009155035019
- ],
- "classnames":[
- "German shepherd, German shepherd dog, German police dog, alsatian",
- "military uniform"
- ],
- "classids":[
- 236,
- 653
- ],
- "time":{
- "read":7403,
- "units":"ms",
- "classification":470
- }
- }
}}}
- Note: MAC USERS:
- If you are using an older version, say, 'Docker toolbox' instead of the newer 'Docker for Mac',
- you need to add port forwarding rules in your Virtual Box default machine.
+ The VALUE string can be:
+ 1. File path relative to class loader (note that these could also be inside jar files)
+ 2. Absolute file path anywhere on the local filesystem
+ 3. Any remote URL, including HTTP or HTTPS
- 1. Open the Virtual Box Manager.
- 2. Select your Docker Machine Virtual Box image.
- 3. Open Settings -> Network -> Advanced -> Port Forwarding.
- 4. Add an appname,Host IP 127.0.0.1 and set both ports to 8764.
+ For example:
+ 1. {{{<param name="modelWeightsPath" type="string">inception-model-weights.h5</param>}}}
+ 2. {{{<param name="modelWeightsPath" type="string">/usr/share/apache-tika/models/tikainception-model-weights.h5</param>}}}
+ 3. {{{<param name="modelWeightsPath" type="string">https://myserver.com/files/apache-tika/models/tikainception-model-weights.h5</param>}}}
+ -----
- ===== b. Without Using docker =====
- If you chose to setup REST server without a docker container, you are free to manually install all the required tools specified in the [[ https://github.com/apache/tika/blob/master/tika-parsers/src/main/resources/org/apache/tika/parser/recognition/tf/InceptionRestDockerfile | docker file]].
-
- Note: docker file has setup instructions for Ubuntu, you will have to transform those commands for your environment.
-
- {{{#!highlight bash
- python tika-parsers/src/main/resources/org/apache/tika/parser/recognition/tf/inceptionapi.py --port 8764
- }}}
-
- ==== Step 2. Create a Tika-Config XML to enable Tensorflow parser. ====
- A sample config can be found in Tika source code at [[https://github.com/apache/tika/blob/da82df5e9def9698fd32f85fe706660641d7c31f/tika-parsers/src/test/resources/org/apache/tika/parser/recognition/tika-config-tflow-rest.xml|tika-parsers/src/test/resources/org/apache/tika/parser/recognition/tika-config-tflow-rest.xml]]
-
- '''Here is an example:'''
- {{{#!xml
- <properties>
- <parsers>
- <parser class="org.apache.tika.parser.recognition.ObjectRecognitionParser">
- <mime>image/jpeg</mime>
- <params>
- <param name="topN" type="int">2</param>
- <param name="minConfidence" type="double">0.015</param>
- <param name="class" type="string">org.apache.tika.parser.recognition.tf.TensorflowRESTRecogniser</param>
- </params>
- </parser>
- </parsers>
- </properties>
- }}}
-
- '''Description of parameters :'''
- {{{#!csv
- Param Name, Type, Meaning, Range, Example
- topN, int, Number of object names to output, a non-zero positive integer, 1 to receive top 1 object name
- minConfidence, double, Minimum confidence required to output the name of detected objects, [0.0 to 1.0] inclusive, 0.9 for outputting object names iff at least 90% confident
- class, string, Name of class that Implements Object recognition Contract, constant string, org.apache.tika.parser.recognition.tf.TensorflowRESTRecogniser
- healthUri, URI, HTTP URL to check availability of API service, any HTTP URL that gets 200 status code when available, http://localhost:8764/inception/v4/ping
- apiUri, URI, HTTP URL to POST image data, any HTTP URL that returns data in the JSON format as shown in the sample API output, http://localhost:8764/inception/v4/classify?topk=10
- }}}
-
-
- ==== Step 3. Demo ====
- This demo is same as the Commandline Invocation approach, but this is faster and efficient
-
- {{{#!bash
- $ java -jar tika-app/target/tika-app-1.15-SNAPSHOT.jar \
- --config=tika-parsers/src/test/resources/org/apache/tika/parser/recognition/tika-config-tflow-rest.xml \
- https://upload.wikimedia.org/wikipedia/commons/f/f6/Working_Dogs%2C_Handlers_Share_Special_Bond_DVIDS124942.jpg
- }}}
-
- The input image is:
-
- {{https://upload.wikimedia.org/wikipedia/commons/f/f6/Working_Dogs%2C_Handlers_Share_Special_Bond_DVIDS124942.jpg|Germal Shepherd with Military}}
-
- And, the top 2 detections are:
- {{{#!highlight xml
- ...
- <meta name="OBJECT" content="German shepherd, German shepherd dog, German police dog, alsatian (0.78435)"/>
- <meta name="OBJECT" content="military uniform (0.06694)"/>
- ...
- }}}
-
- ==== Changing the default topN, API port or URL ====
- To change the defaults, update the parameters in config XML file accordingly
-
- '''Here is an example scenario:'''
-
- Run REST API on port 3030, and get top 4 object names if the confidence is above 10%. You may also change host to something else than 'localhost' if required.
-
- '''Example Config File'''
- {{{#!xml
- <properties>
- <parsers>
- <parser class="org.apache.tika.parser.recognition.ObjectRecognitionParser">
- <mime>image/jpeg</mime>
- <params>
- <param name="topN" type="int">4</param>
- <param name="minConfidence" type="double">0.1</param>
- <param name="class" type="string">org.apache.tika.parser.recognition.tf.TensorflowRESTRecogniser</param>
- <param name="healthUri" type="uri">http://localhost:3030/inception/v4/ping</param>
- <param name="apiUri" type="uri">http://localhost:3030/inception/v4/classify?topk=4</param>
- </params>
- </parser>
- </parsers>
- </properties>
- }}}
-
- '''To Start the service on port 3030:'''
-
- Using Docker:
-
- {{{docker run -it -p 3030:8764 inception-rest-tika}}}
-
-
- Without Using Docker:
-
- {{{python tika-parsers/src/main/resources/org/apache/tika/parser/recognition/tf/inceptionapi.py --port 3030}}}
-
-
- ----
-
- === Questions / Suggestions / Improvements / Feedback ? ===
+ = Questions / Suggestions / Improvements / Feedback ? =
1. If it was useful, let us know on twitter by mentioning [[https://twitter.com/ApacheTika|@ApacheTika]]
2. If you have questions, let us know by [[https://tika.apache.org/mail-lists.html | using Mailing Lists]]