You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Peter Kronenberg (Jira)" <ji...@apache.org> on 2021/01/14 14:00:00 UTC

[jira] [Updated] (TIKA-3272) Improve Rotation handling

     [ https://issues.apache.org/jira/browse/TIKA-3272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Peter Kronenberg updated TIKA-3272:
-----------------------------------
    Description: 
* Discussed with Tim on the mailing list about avoiding the call to rotate if the angle of rotation is 0. 
 * Also, allowing just rotation, without doing the other per-processing, which has a lot more overhead.
 * Replace rotation.py (and Python dependency) with calls to Tess4j classes (2 classes are extracted from the Tess4j package to avoid importing the entire package)

 

*ApplyRotation* and *EnableImageProcessing* have been separated so that *ApplyRotation* does not depend on *EnableImageProcessing*.     Doing all the pre-processing with _ImageMagick_ adds a lot of overhead.  Doing the rotation by itself is much quicker.  It’s not clear if the rotation operation is the fastest one out of all of them or if doing any of them on their own would be faster.  But with rotation, there is an easy way to figure out if the document needs it.  Not so for the other operations.

 If *ApplyRotation*=True and *EnableImageProcessing*=False, then _ImageMagick_ will be called *just* to fix the rotation.  But only if the rotation angle > 0.  If the angle is 0, then we don’t call _ImageMagick_ at all. 

If both *ApplyRotation* and *EnableImageProcessing* are True, then we call _ImageMagick_ to do all the pre-processing, but we only include rotation if the angle <L 0.

When determining if the current angle of rotation is 0, we assume anything where -1.0 < angle <  1.0 is 0. The code that determines the angle appears to return 0 anyway for anything in this range.  This does not affect the accuracy of the OCR result.

The dependency on Python has been removed.  This includes:
 * pythonPath in TesseractOCRConfig
 * The testing that checks to see if Python is on the system and can be run and has all the pre-reqs.

  was:
* Discussed with Tim on the mailing list about avoiding the call to rotate if the angle of rotation is 0. 
 * Also, allowing just rotation, without doing the other per-processing, which has a lot more overhead.
 * Replace rotation.py (and Python dependency) with calls to Tess4j classes (2 classes are extracted from the Tess4j package to avoid importing the entire package)


> Improve Rotation handling
> -------------------------
>
>                 Key: TIKA-3272
>                 URL: https://issues.apache.org/jira/browse/TIKA-3272
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Peter Kronenberg
>            Priority: Major
>
> * Discussed with Tim on the mailing list about avoiding the call to rotate if the angle of rotation is 0. 
>  * Also, allowing just rotation, without doing the other per-processing, which has a lot more overhead.
>  * Replace rotation.py (and Python dependency) with calls to Tess4j classes (2 classes are extracted from the Tess4j package to avoid importing the entire package)
>  
> *ApplyRotation* and *EnableImageProcessing* have been separated so that *ApplyRotation* does not depend on *EnableImageProcessing*.     Doing all the pre-processing with _ImageMagick_ adds a lot of overhead.  Doing the rotation by itself is much quicker.  It’s not clear if the rotation operation is the fastest one out of all of them or if doing any of them on their own would be faster.  But with rotation, there is an easy way to figure out if the document needs it.  Not so for the other operations.
>  If *ApplyRotation*=True and *EnableImageProcessing*=False, then _ImageMagick_ will be called *just* to fix the rotation.  But only if the rotation angle > 0.  If the angle is 0, then we don’t call _ImageMagick_ at all. 
> If both *ApplyRotation* and *EnableImageProcessing* are True, then we call _ImageMagick_ to do all the pre-processing, but we only include rotation if the angle <L 0.
> When determining if the current angle of rotation is 0, we assume anything where -1.0 < angle <  1.0 is 0. The code that determines the angle appears to return 0 anyway for anything in this range.  This does not affect the accuracy of the OCR result.
> The dependency on Python has been removed.  This includes:
>  * pythonPath in TesseractOCRConfig
>  * The testing that checks to see if Python is on the system and can be run and has all the pre-reqs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)