You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@spamassassin.apache.org on 2022/08/14 04:08:49 UTC

[Bug 8026] New: t/extracttext.t tesseract test fails on some installations

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=8026

            Bug ID: 8026
           Summary: t/extracttext.t tesseract test fails on some
                    installations
           Product: Spamassassin
           Version: 4.0.0
          Hardware: All
                OS: All
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Regression Tests
          Assignee: dev@spamassassin.apache.org
          Reporter: sidney@sidney.com
  Target Milestone: Undefined

On my copy of FreeBSD 13.1-RELEASE installed on a VirtualBox VM with tesseract
5.1.0 installed from FreeBSD's pkg repository, test t/extracttext.t
consistently fails because tesseract reads the "XJ" characters in the test jpg
file as "X]J".

Recreating the test file using a font that is more tesseract-friendly seems to
help. Since the test is not intended to test the limits of tesseract's OCR
capabilities, this seems like a proper fix. I've redone the test data using Tex
Gyre Bonum font as per the results in https://superuser.com/a/1543382

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 8026] t/extracttext.t tesseract test fails on some installations

Posted by bu...@spamassassin.apache.org.
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=8026

Sidney Markowitz <si...@sidney.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |FIXED
             Status|NEW                         |RESOLVED

--- Comment #1 from Sidney Markowitz <si...@sidney.com> ---
It pointed out in another comment in the superuser article linked to in the
previous comment, the fint used seems to be less important than font size.
After initial experiments worked on freebsd but failed in differtent ways on
macOS, I found settings that succeed using hte the available versions of
tesseract on all platforms I tried.

These tests revealed a bug when tesseract is installed in a directory that has
a space in the pathname, but that is a more minor issue. See bug 8027

trunk % svn ci -m "bug 8026 - Update extracttest.t with test data that works
with more versions of tesseract"
Sending        MANIFEST
Deleting       t/data/spam/extracttext/gtube_jpg.eml
Adding         t/data/spam/extracttext/gtube_png.eml
Sending        t/extracttext.t
Transmitting file data ...done
Committing transaction...
Committed revision 1903411.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 8026] t/extracttext.t tesseract test fails on some installations

Posted by bu...@spamassassin.apache.org.
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=8026

Sidney Markowitz <si...@sidney.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |sidney@sidney.com
   Target Milestone|Undefined                   |4.0.0

-- 
You are receiving this mail because:
You are the assignee for the bug.