You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by ThejanW <gi...@git.apache.org> on 2017/03/18 09:05:57 UTC
[GitHub] tika pull request #158: TIKA-2293 - Tess4jOCRParser - A simpler Java version...
GitHub user ThejanW opened a pull request:
https://github.com/apache/tika/pull/158
TIKA-2293 - Tess4jOCRParser - A simpler Java version of TesseractOCRParser
Right now, TesseractOCRParser calls tesseract and imagemagick from command line. Intention of this new parser "Tess4jOCRParser" is to use the Tess4J API instead of the runtime.exec way to executing tesseract out of process. Please feel free to visit TIKA-2293 for more information.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/ThejanW/tika master
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/tika/pull/158.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #158
----
commit 6d6128f02099f4453f1876328c933ede17f7b559
Author: ThejanW <th...@cse.mrt.ac.lk>
Date: 2017-03-11T05:27:38Z
Tess4JOCRParser class implemented successfully. I can extract content through Handler now.
commit 5a44b86807a594318d06d47e8bb890c3cfd7654b
Author: ThejanW <th...@cse.mrt.ac.lk>
Date: 2017-03-11T05:31:44Z
Tess4JOCRParser class implemented successfully. I can extract content through Handler now.
commit def106014347330a8500cf3f615eb49bcd23ca22
Author: ThejanW <th...@cse.mrt.ac.lk>
Date: 2017-03-11T05:58:07Z
TODO: Test time evaluations
commit 825447f39b39fa83180611091067ed1a6373b9d7
Author: ThejanW <th...@cse.mrt.ac.lk>
Date: 2017-03-11T09:31:48Z
Wrote the test case to compare the two parsers.
commit ecbe7a8773ffed723bf9a2a420a64b62ac0860e9
Author: ThejanW <th...@cse.mrt.ac.lk>
Date: 2017-03-11T11:17:13Z
Added test images, Reformatted the ocr parser test case
commit 31c4fb0f0cda5e6df102d09762f2e93aae0e5c4d
Author: Thamme Gowda <th...@apache.org>
Date: 2017-03-11T14:42:15Z
Merge branch 'master' of https://github.com/ThejanW/tika into thejan-tess4j
commit 4c87364003a7f0dec86932b4e1b28291432e5fcb
Author: Thamme Gowda <th...@apache.org>
Date: 2017-03-11T15:54:55Z
performance improvements + code clean
commit 9e672e9da7ff35400c24b21644924b99563999c2
Author: Thejan Wijesinghe <th...@cse.mrt.ac.lk>
Date: 2017-03-11T17:43:38Z
Merge pull request #1 from thammegowda/thejan-tess4j
Performance improvements and Fixes
commit f5f07429e96f32c3e718ccfec8a3163916b29448
Author: ThejanW <th...@cse.mrt.ac.lk>
Date: 2017-03-12T05:51:40Z
Excluded tess4J from bringing log4j-over-slf4j.jar + some code reformatting
commit 25bd1c2eb47db7ccc3a30d12fe199c77d2303e8a
Author: ThejanW <th...@cse.mrt.ac.lk>
Date: 2017-03-12T09:00:41Z
Deleted the use of extractHOCROutput method + Enabled Tesseract's quiite command line option + Code reformatting
commit e41250af6ca27158d209896577e6e305abcbcb52
Author: ThejanW <th...@cse.mrt.ac.lk>
Date: 2017-03-12T10:19:08Z
Performance improvements
commit 94a2a70add233f53779011325a7ff0c94e4e91d7
Author: ThejanW <th...@cse.mrt.ac.lk>
Date: 2017-03-15T11:57:23Z
Set org.apache.tika.parser.Parser to default.
commit 75e185a10884d6afe08555050f676f8ea95d66be
Author: ThejanW <th...@cse.mrt.ac.lk>
Date: 2017-03-15T13:03:57Z
Merge branch 'master' of https://github.com/apache/tika
# Please enter a commit message to explain why this merge is necessary,
# especially if it merges an updated upstream into a topic branch.
#
# Lines starting with '#' will be ignored, and an empty message aborts
# the commit.
Syncing with the upstream.
commit 260e9cec23f0bde2de975ac7142132b7ffa1cf17
Author: ThejanW <th...@cse.mrt.ac.lk>
Date: 2017-03-17T17:29:49Z
TIKA - 2293
# Relocate test images
# Add deskewing functionality for skewed images
# Add new unit tests
commit 69504c54ffda2c93fb8205e88dd82b3a119455f4
Author: ThejanW <th...@cse.mrt.ac.lk>
Date: 2017-03-17T17:33:24Z
TIKA - 2293
# Add test images
commit 4058e49f29d081d45bbe84b3ac75267e2a8d7cf0
Author: ThejanW <th...@cse.mrt.ac.lk>
Date: 2017-03-17T17:46:20Z
TIKA - 2293
# Fix minor error in test document path in runBenchmark unit test
commit 1a188aa2dcd8eb30086fd297cbeb31cfe47f0863
Author: ThejanW <th...@cse.mrt.ac.lk>
Date: 2017-03-18T07:39:15Z
TIKA - 2293
# change the tesseract model to volatile
# add informative comments
commit dd3f3a299b2d7bb742a4fc12133ef500c68a2439
Author: ThejanW <th...@cse.mrt.ac.lk>
Date: 2017-03-18T07:49:42Z
Merge remote-tracking branch 'upstream/master'
Sync with upstream
commit ac6677dab44cef7e1de20181201f4f14103c3d71
Author: ThejanW <th...@cse.mrt.ac.lk>
Date: 2017-03-18T08:51:03Z
TIKA - 2293
# remove unnecessary test cases and test images from master
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---