You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by KranthiGV <gi...@git.apache.org> on 2017/03/29 05:14:09 UTC

[GitHub] tika pull request #163: Tika-2306: Update Inception v3 to Inception v4 in Ob...

GitHub user KranthiGV opened a pull request:

    https://github.com/apache/tika/pull/163

    Tika-2306: Update Inception v3 to Inception v4 in Object recognition parser 

    ## Summary:
    Object Recognition Parser currently uses Inception V3 model for the object classification task. Google released a newer Inception V4 model [1][2].
    It has an improved Top -1 accuracy of 80.2 and Top-5 accuracy of 95.2 [3].
    
    ## Quick setup and Test:
    - Install tensor flow using pip -[https://www.tensorflow.org/install/](https://www.tensorflow.org/install/)
    - Install TF-slim 
    
    ```
    git clone https://github.com/tensorflow/models/   
    export PYTHONPATH="$PYTHONPATH:/models/slim" (replace with your installation directory)   
    
    sudo apt-get install libtcmalloc-minimal4   
    export LD_PRELOAD="/usr/lib/libtcmalloc_minimal.so.4"   
    ```
    - NOTE: The last two lines are added due to tensorflow issues [](https://github.com/tensorflow/tensorflow/issues/6968). It would be removed once it is fixed.
    - It can be evaded by integrating parts of tensorflow/models code into our repository. It has Apache license. So, it can be done.
     
    - Checkout the test case `tika-parsers/src/test/resources/org/apache/tika/parser/recognition/tika-config-tflow.xml`
    
    ## Demos:
    `java -jar tika-app/target/tika-app-1.15-SNAPSHOT.jar --config=tika-parsers/src/test/resources/org/apache/tika/parser/recognition/tika-config-tflow.xml tika-parsers/src/test/resources/test-documents/testJPEG.jpg
    `
    - The output would include:
    ```
    <meta name="OBJECT" content="Egyptian cat (0.31143)"/>  
    <meta name="OBJECT" content="tabby, tabby cat (0.07072)"/>  
    ```
    - NOTE: Only jpeg format is supported. I would work on other format support during GSoC ([https://issues.apache.org/jira/browse/TIKA-2262](https://issues.apache.org/jira/browse/TIKA-2262)).
    
    # REST API
    ## Start the inception service on 8764 port : 
    The API service code is added at `tika-parsers/src/main/resources/org/apache/tika/parser/recognition/tf/inceptionapi.py`
    
    Also, a docker file is added to setup the environment quickly
    
    ```
    cd tika-parsers/src/main/resources/org/apache/tika/parser/recognition/tf/
    docker build -f InceptionRestDockerfile -t inception-rest-tika .
    docker run -p 8764:8764 -it inception-rest-tika
    ```
    - Use the config at `tika-parsers/src/test/resources/org/apache/tika/parser/recognition/tika-config-tflow-rest.xml`
    
    # Tests and build
    ## Build status
    ```
    [INFO] Reactor Summary:
    [INFO] 
    [INFO] Apache Tika parent ................................ SUCCESS [0.693s]
    [INFO] Apache Tika core .................................. SUCCESS [19.393s]
    [INFO] Apache Tika parsers ............................... SUCCESS [1:02.685s]
    [INFO] Apache Tika XMP ................................... SUCCESS [0.851s]
    [INFO] Apache Tika serialization ......................... SUCCESS [0.924s]
    [INFO] Apache Tika batch ................................. SUCCESS [1:53.792s]
    [INFO] Apache Tika language detection .................... SUCCESS [2.210s]
    [INFO] Apache Tika application ........................... SUCCESS [23.620s]
    [INFO] Apache Tika OSGi bundle ........................... SUCCESS [11.271s]
    [INFO] Apache Tika translate ............................. SUCCESS [1.161s]
    [INFO] Apache Tika server ................................ SUCCESS [26.655s]
    [INFO] Apache Tika examples .............................. SUCCESS [3.562s]
    [INFO] Apache Tika Java-7 Components ..................... SUCCESS [1.040s]
    [INFO] Apache Tika eval .................................. SUCCESS [13.477s]
    [INFO] Apache Tika ....................................... SUCCESS [0.037s]
    [INFO] ------------------------------------------------------------------------
    [INFO] BUILD SUCCESS
    [INFO] ------------------------------------------------------------------------
    [INFO] Total time: 4:41.751s
    [INFO] Finished at: Wed Mar 29 10:38:00 IST 2017
    [INFO] Final Memory: 158M/1535M
    [INFO] ------------------------------------------------------------------------
    ```
    ## Script based implementation tests
    ```
    timberners@galileo:~/Desktop/gsoc/issues/tika$ java -jar tika-app/target/tika-app-1.15-SNAPSHOT.jar --config=tika-parsers/src/test/resources/org/apache/tika/parser/recognition/tika-config-tflow.xml tika-parsers/src/test/resources/test-documents/testJPEG.jpg
    WARN  JBIG2ImageReader not loaded. jbig2 files will be ignored
    INFO  minConfidence = 0.015, topN=2
    INFO  Recogniser = org.apache.tika.parser.recognition.tf.TensorflowImageRecParser
    INFO  Recogniser Available = true
    <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
    <head>
    <meta name="Egyptian cat" content="0.31143"/>
    <meta name="tabby, tabby cat" content="0.07072"/>
    <meta name="tiger cat" content="0.04990"/>
    <meta name="Siamese cat, Siamese" content="0.02097"/>
    <meta name="Border collie" content="0.01930"/>
    <title/>
    </head>
    <body><p/>
    </body></html><html xmlns="http://www.w3.org/1999/xhtml">
    <head>
    <meta name="org.apache.tika.parser.recognition.object.rec.impl" content="org.apache.tika.parser.recognition.tf.TensorflowImageRecParser"/>
    <meta name="X-Parsed-By" content="org.apache.tika.parser.CompositeParser"/>
    <meta name="X-Parsed-By" content="org.apache.tika.parser.recognition.ObjectRecognitionParser"/>
    <meta name="resourceName" content="testJPEG.jpg"/>
    <meta name="Content-Length" content="7686"/>
    <meta name="OBJECT" content="Egyptian cat (0.31143)"/>
    <meta name="OBJECT" content="tabby, tabby cat (0.07072)"/>
    <meta name="Content-Type" content="image/jpeg"/>
    <title/>
    </head>
    <body><ol id="objects">	<li id="Egyptian cat"> Egyptian cat [eng](confidence = 0.311430 )</li>
    	<li id="tabby, tabby cat"> tabby, tabby cat [eng](confidence = 0.070720 )</li>
    </ol>
    
    timberners@galileo:~/Desktop/gsoc/issues/tika$ java -jar tika-app/p-1.15-SNAPSHOT.jar  --config=tika-parsers/src/test/resources/org/apache/tika/parser/recognition/tika-config-tflow.xml https://upload.wikimedia.org/wikipedia/commons/thumb/3/38/US_Navy_100714-N-4965F-174_Chief_Mass_Communication_Specialist_Paula_Ludwick%2C_assigned_to_Fleet_Combat_Camera_Group_Pacific%2C_shoots_at_a_target_during_a_Navy_Rifle_Qualification_Course.jpg/220px-thumbnail.jpg
    WARN  JBIG2ImageReader not loaded. jbig2 files will be ignored
    INFO  minConfidence = 0.015, topN=2
    INFO  Recogniser = org.apache.tika.parser.recognition.tf.TensorflowImageRecParser
    INFO  Recogniser Available = true
    <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
    <head>
    <meta name="military uniform" content="0.00355"/>
    <meta name="bulletproof vest" content="0.00397"/>
    <meta name="revolver, six-gun, six-shooter" content="0.00176"/>
    <meta name="assault rifle, assault gun" content="0.84119"/>
    <meta name="rifle" content="0.08642"/>
    <title/>
    </head>
    <body><p/>
    </body></html><html xmlns="http://www.w3.org/1999/xhtml">
    <head>
    <meta name="org.apache.tika.parser.recognition.object.rec.impl" content="org.apache.tika.parser.recognition.tf.TensorflowImageRecParser"/>
    <meta name="X-Parsed-By" content="org.apache.tika.parser.CompositeParser"/>
    <meta name="X-Parsed-By" content="org.apache.tika.parser.recognition.ObjectRecognitionParser"/>
    <meta name="resourceName" content="220px-thumbnail.jpg"/>
    <meta name="Content-Length" content="9535"/>
    <meta name="OBJECT" content="assault rifle, assault gun (0.84119)"/>
    <meta name="OBJECT" content="rifle (0.08642)"/>
    <meta name="Content-Type" content="image/jpeg"/>
    <title/>
    </head>
    <body><ol id="objects">	<li id="assault rifle, assault gun"> assault rifle, assault gun [eng](confidence = 0.841190 )</li>
    	<li id="rifle"> rifle [eng](confidence = 0.086420 )</li>
    </ol>
    </body></html>
    
    timberners@galileo:~/Desktop/gsoc/issues/tika$ java -jar tika-app/p-1.15-SNAPSHOT.jar  --config=tika-parsers/src/test/resources/org/apache/tika/parser/recognition/tika-config-tflow.xml https://upload.wikimedia.org/wikipedia/commons/8/8d/Glock17.jpg
    WARN  JBIG2ImageReader not loaded. jbig2 files will be ignored
    INFO  minConfidence = 0.015, topN=2
    INFO  Recogniser = org.apache.tika.parser.recognition.tf.TensorflowImageRecParser
    INFO  Recogniser Available = true
    <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
    <head>
    <meta name="hatchet" content="0.00087"/>
    <meta name="revolver, six-gun, six-shooter" content="0.89842"/>
    <meta name="holster" content="0.02361"/>
    <meta name="assault rifle, assault gun" content="0.01820"/>
    <meta name="rifle" content="0.00943"/>
    <title/>
    </head>
    <body><p/>
    </body></html><html xmlns="http://www.w3.org/1999/xhtml">
    <head>
    <meta name="org.apache.tika.parser.recognition.object.rec.impl" content="org.apache.tika.parser.recognition.tf.TensorflowImageRecParser"/>
    <meta name="X-Parsed-By" content="org.apache.tika.parser.CompositeParser"/>
    <meta name="X-Parsed-By" content="org.apache.tika.parser.recognition.ObjectRecognitionParser"/>
    <meta name="resourceName" content="Glock17.jpg"/>
    <meta name="Content-Length" content="368025"/>
    <meta name="OBJECT" content="revolver, six-gun, six-shooter (0.89842)"/>
    <meta name="OBJECT" content="holster (0.02361)"/>
    <meta name="Content-Type" content="image/jpeg"/>
    <title/>
    </head>
    <body><ol id="objects">	<li id="revolver, six-gun, six-shooter"> revolver, six-gun, six-shooter [eng](confidence = 0.898420 )</li>
    	<li id="holster"> holster [eng](confidence = 0.023610 )</li>
    </ol>
    </body></html>
    
    timberners@galileo:~/Desktop/gsoc/issues/tika$ java -jar tika-app/p-1.15-SNAPSHOT.jar  --config=tika-parsers/src/test/resources/org/apache/tika/parser/recognition/tika-config-tflow.xml http://www.trbimg.com/img-57226a08/turbine/ct-tesla-model-3-unveiling-20160404/650/650x366
    WARN  JBIG2ImageReader not loaded. jbig2 files will be ignored
    INFO  minConfidence = 0.015, topN=2
    INFO  Recogniser = org.apache.tika.parser.recognition.tf.TensorflowImageRecParser
    INFO  Recogniser Available = true
    <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
    <head>
    <meta name="car wheel" content="0.07900"/>
    <meta name="convertible" content="0.04067"/>
    <meta name="sports car, sport car" content="0.75443"/>
    <meta name="beach wagon, station wagon, wagon, estate car, beach waggon, station waggon, waggon" content="0.00955"/>
    <meta name="grille, radiator grille" content="0.01366"/>
    <title/>
    </head>
    <body><p/>
    </body></html><html xmlns="http://www.w3.org/1999/xhtml">
    <head>
    <meta name="org.apache.tika.parser.recognition.object.rec.impl" content="org.apache.tika.parser.recognition.tf.TensorflowImageRecParser"/>
    <meta name="X-Parsed-By" content="org.apache.tika.parser.CompositeParser"/>
    <meta name="X-Parsed-By" content="org.apache.tika.parser.recognition.ObjectRecognitionParser"/>
    <meta name="resourceName" content="650x366"/>
    <meta name="Content-Length" content="42377"/>
    <meta name="OBJECT" content="sports car, sport car (0.75443)"/>
    <meta name="OBJECT" content="car wheel (0.07900)"/>
    <meta name="Content-Type" content="image/jpeg"/>
    <title/>
    </head>
    <body><ol id="objects">	<li id="sports car, sport car"> sports car, sport car [eng](confidence = 0.754430 )</li>
    	<li id="car wheel"> car wheel [eng](confidence = 0.079000 )</li>
    </ol>
    </body></html>
    ```
    ## REST API based implementation tests
    ```
    timberners@galileo:~/Desktop/gsoc/issues/tika$ java -jar tika-app/target/tika-app-1.15-SNAPSHOT.jar  --config=tika-parsers/src/test/resources/org/apache/tika/parser/recognition/tika-config-tflow-rest.xml http://www.trbimg.com/img-57226a08/turbine/ct-tesla-model-3-unveiling-20160404/650/650x366
    WARN  JBIG2ImageReader not loaded. jbig2 files will be ignored
    INFO  Available = true, API Status = HTTP/1.0 200 OK
    INFO  minConfidence = 0.015, topN=2
    INFO  Recogniser = org.apache.tika.parser.recognition.tf.TensorflowRESTRecogniser
    INFO  Recogniser Available = true
    <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
    <head>
    <meta name="org.apache.tika.parser.recognition.object.rec.impl" content="org.apache.tika.parser.recognition.tf.TensorflowRESTRecogniser"/>
    <meta name="X-Parsed-By" content="org.apache.tika.parser.CompositeParser"/>
    <meta name="X-Parsed-By" content="org.apache.tika.parser.recognition.ObjectRecognitionParser"/>
    <meta name="resourceName" content="650x366"/>
    <meta name="Content-Length" content="42377"/>
    <meta name="OBJECT" content="sports car, sport car (0.75443)"/>
    <meta name="OBJECT" content="car wheel (0.07900)"/>
    <meta name="Content-Type" content="image/jpeg"/>
    <title/>
    </head>
    <body><ol id="objects">	<li id="sports car, sport car"> sports car, sport car [en](confidence = 0.754434 )</li>
    	<li id="car wheel"> car wheel [en](confidence = 0.079000 )</li>
    </ol>
    </body></html>
    ```

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/KranthiGV/tika TIKA-2306

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/tika/pull/163.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #163
    
----
commit 236db96393d94756dbc2e3f40b318f8f93b95dff
Author: Kranthi Kiran GV <kr...@gmail.com>
Date:   2017-03-26T20:22:02Z

    fix for TIKA-2306 contributed by kranthigv

commit 0c0bd4bec2312355d2bc48426f8ec94306d0e4a0
Author: Kranthi Kiran GV <kr...@gmail.com>
Date:   2017-03-26T20:28:52Z

    fix for TIKA-2306 contributed by kranthigv

commit cb8f8f5e7ea2b4e13853e6dfc2165127521d9c64
Author: Kranthi Kiran GV <kr...@gmail.com>
Date:   2017-03-26T20:51:36Z

    fix the image

commit c7f27b561ac1a44a35d3f7fd7881daf5dae8b835
Author: Kranthi Kiran GV <kr...@gmail.com>
Date:   2017-03-26T22:17:59Z

    inceptionapi.py file added for REST API feature

commit 1fc82e84cc27f60cc64c7844e36bdab2d3c85e7c
Author: Kranthi Kiran GV <kr...@gmail.com>
Date:   2017-03-26T22:43:41Z

    fix the destination directory

commit 900e4cfff9c5036bceba6a2f6cda1a9c942d3fa7
Author: Kranthi Kiran GV <kr...@gmail.com>
Date:   2017-03-26T23:04:06Z

    fix no variables to save

commit 0341a5d25dececf799746d6906963496a5256f11
Author: Kranthi Kiran GV <kr...@gmail.com>
Date:   2017-03-26T23:17:02Z

    unexpected argument

commit b9f496c68b27e64f1eddca212db88e3444051cc5
Author: Kranthi Kiran GV <kr...@gmail.com>
Date:   2017-03-26T23:19:59Z

    undefined variable

commit f8c51bab139f0b7c8d9ea070ae40c87bbaf87689
Author: Kranthi Kiran GV <kr...@gmail.com>
Date:   2017-03-26T23:25:10Z

    undefined variable

commit d199692b650edaaf743ca6cfc5c34954baf8831d
Author: Kranthi Kiran GV <kr...@gmail.com>
Date:   2017-03-26T23:29:28Z

    undefined variable

commit 0eedec8c62cf5e6ddee4f14ca4b4fa59d2930be5
Author: Kranthi Kiran GV <kr...@gmail.com>
Date:   2017-03-27T02:07:19Z

    Working inceptionapi.py without comments

commit 09cb2df973f20e3a877ca1309b67384264650be0
Author: Kranthi Kiran GV <kr...@gmail.com>
Date:   2017-03-29T03:51:31Z

    fix for TIKA-2306 contributed by kranthigv

commit f92809ac19d5bef903ef1ac393092e6a13884fc0
Author: Kranthi Kiran GV <kr...@gmail.com>
Date:   2017-03-29T03:55:01Z

    fix for TIKA-2306 contributed by kranthigv

commit be773cacaf3c344c11fff9b85ebaf1d0dc8b5174
Author: Kranthi Kiran GV <kr...@gmail.com>
Date:   2017-03-29T04:11:48Z

    fix for TIKA-2306 contributed by kranthigv

commit 75a2ae12d170fc99b4bf9ab266c6169859c23dda
Author: Kranthi Kiran GV <kr...@gmail.com>
Date:   2017-03-29T05:09:22Z

    Changed models repo to a forked repo for future compatibility

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---