You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by ThejanW <gi...@git.apache.org> on 2017/03/18 09:05:57 UTC

[GitHub] tika pull request #158: TIKA-2293 - Tess4jOCRParser - A simpler Java version...

GitHub user ThejanW opened a pull request:

    https://github.com/apache/tika/pull/158

    TIKA-2293 - Tess4jOCRParser - A simpler Java version of TesseractOCRParser

    Right now, TesseractOCRParser calls tesseract and imagemagick from command line. Intention of this new parser "Tess4jOCRParser" is to use the Tess4J API instead of the runtime.exec way to executing tesseract out of process. Please feel free to visit TIKA-2293 for more information.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ThejanW/tika master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/tika/pull/158.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #158
    
----
commit 6d6128f02099f4453f1876328c933ede17f7b559
Author: ThejanW <th...@cse.mrt.ac.lk>
Date:   2017-03-11T05:27:38Z

    Tess4JOCRParser class implemented successfully. I can extract content through Handler now.

commit 5a44b86807a594318d06d47e8bb890c3cfd7654b
Author: ThejanW <th...@cse.mrt.ac.lk>
Date:   2017-03-11T05:31:44Z

    Tess4JOCRParser class implemented successfully. I can extract content through Handler now.

commit def106014347330a8500cf3f615eb49bcd23ca22
Author: ThejanW <th...@cse.mrt.ac.lk>
Date:   2017-03-11T05:58:07Z

    TODO: Test time evaluations

commit 825447f39b39fa83180611091067ed1a6373b9d7
Author: ThejanW <th...@cse.mrt.ac.lk>
Date:   2017-03-11T09:31:48Z

    Wrote the test case to compare the two parsers.

commit ecbe7a8773ffed723bf9a2a420a64b62ac0860e9
Author: ThejanW <th...@cse.mrt.ac.lk>
Date:   2017-03-11T11:17:13Z

    Added test images, Reformatted the ocr parser test case

commit 31c4fb0f0cda5e6df102d09762f2e93aae0e5c4d
Author: Thamme Gowda <th...@apache.org>
Date:   2017-03-11T14:42:15Z

    Merge branch 'master' of https://github.com/ThejanW/tika into thejan-tess4j

commit 4c87364003a7f0dec86932b4e1b28291432e5fcb
Author: Thamme Gowda <th...@apache.org>
Date:   2017-03-11T15:54:55Z

    performance improvements + code clean

commit 9e672e9da7ff35400c24b21644924b99563999c2
Author: Thejan Wijesinghe <th...@cse.mrt.ac.lk>
Date:   2017-03-11T17:43:38Z

    Merge pull request #1 from thammegowda/thejan-tess4j
    
    Performance improvements and Fixes

commit f5f07429e96f32c3e718ccfec8a3163916b29448
Author: ThejanW <th...@cse.mrt.ac.lk>
Date:   2017-03-12T05:51:40Z

    Excluded tess4J from bringing  log4j-over-slf4j.jar + some code reformatting

commit 25bd1c2eb47db7ccc3a30d12fe199c77d2303e8a
Author: ThejanW <th...@cse.mrt.ac.lk>
Date:   2017-03-12T09:00:41Z

    Deleted the use of extractHOCROutput method + Enabled Tesseract's quiite command line option + Code reformatting

commit e41250af6ca27158d209896577e6e305abcbcb52
Author: ThejanW <th...@cse.mrt.ac.lk>
Date:   2017-03-12T10:19:08Z

    Performance improvements

commit 94a2a70add233f53779011325a7ff0c94e4e91d7
Author: ThejanW <th...@cse.mrt.ac.lk>
Date:   2017-03-15T11:57:23Z

    Set org.apache.tika.parser.Parser to default.

commit 75e185a10884d6afe08555050f676f8ea95d66be
Author: ThejanW <th...@cse.mrt.ac.lk>
Date:   2017-03-15T13:03:57Z

    Merge branch 'master' of https://github.com/apache/tika
    
    # Please enter a commit message to explain why this merge is necessary,
    # especially if it merges an updated upstream into a topic branch.
    #
    # Lines starting with '#' will be ignored, and an empty message aborts
    # the commit.
    
    Syncing with the upstream.

commit 260e9cec23f0bde2de975ac7142132b7ffa1cf17
Author: ThejanW <th...@cse.mrt.ac.lk>
Date:   2017-03-17T17:29:49Z

    TIKA - 2293
    
    # Relocate test images
    # Add deskewing functionality for skewed images
    # Add new unit tests

commit 69504c54ffda2c93fb8205e88dd82b3a119455f4
Author: ThejanW <th...@cse.mrt.ac.lk>
Date:   2017-03-17T17:33:24Z

    TIKA - 2293
    
    # Add test images

commit 4058e49f29d081d45bbe84b3ac75267e2a8d7cf0
Author: ThejanW <th...@cse.mrt.ac.lk>
Date:   2017-03-17T17:46:20Z

    TIKA - 2293
    
    # Fix minor error in test document path in runBenchmark unit test

commit 1a188aa2dcd8eb30086fd297cbeb31cfe47f0863
Author: ThejanW <th...@cse.mrt.ac.lk>
Date:   2017-03-18T07:39:15Z

    TIKA - 2293
    
    # change the tesseract model to volatile
    # add informative comments

commit dd3f3a299b2d7bb742a4fc12133ef500c68a2439
Author: ThejanW <th...@cse.mrt.ac.lk>
Date:   2017-03-18T07:49:42Z

    Merge remote-tracking branch 'upstream/master'
    
    Sync with upstream

commit ac6677dab44cef7e1de20181201f4f14103c3d71
Author: ThejanW <th...@cse.mrt.ac.lk>
Date:   2017-03-18T08:51:03Z

    TIKA - 2293
    
    # remove unnecessary test cases and test images from master

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---