You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@spamassassin.apache.org on 2019/06/25 07:43:19 UTC

[Bug 7727] New Plugin TesseractOcr

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7727

Henrik Krohns <ap...@hege.li> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |apache@hege.li

--- Comment #1 from Henrik Krohns <ap...@hege.li> ---

Hi there, some comments. I'm always blunt, so keep that in mind. :-)

- Quite hesitant about $p->set_rendered() use and the whole concept of adding
"random crap" to rendered body, which could have many side effects. Nothing
uses set_rendered() in current codebase, there needs to be some investigation
if it actually works properly, how module priorities should be set, whether
stuff end up in Bayes and other plugins that read rendered body etc. Of course
this is also bad performance wise if all tests have to wait for OCR to
complete. An async model would be much more generally usable, but it would have
to have own rules (FuzzyOCR style?). Or perhaps body rules could be tflagged
with "ocr" so they can be run independently against that data.

- Absolutely no reason to use a 9 year old Image::OCR::Tesseract module, from
an author who seems to pretty much left CPAN business long time ago. The
modules only purpose seems to be running tesseract/convert shell commands, for
which a proper SA way is using $pms->enter_helper_run_mode() / Timeout /
helper_app_pipe_open - see Pyzor.pm for example.

- It is _horrible_ security wise to use mime provided $fname for a filesystem
name. Just use the static counter, and $ext if needed to construct a fixed safe
name. This also removes need for the quite iffy chr(65+$unique++) loop..

- Unneeded dependency File::Which, should use
Mail::SpamAssassin::Util::find_executable_in_env_path

- Proper way to combine path and filename string: File::Spec->catfile($tmpdir,
$fname)

- "print PICT" should also check for errors (disk could be full?)

- Should not use parse_config to accept any random tocr_* line, use the
set_config() stanza that defines exact supported options, and sets their
defaults and types, validated

- tesseract_type() has unused $w,$h (remove?)

- should use more common dbg method (also use info() only for serious errors,
not for "found attachment" things)

sub dbg {
  my $msg = shift;
  Mail::SpamAssassin::Plugin::dbg("TesseractOct: $msg", @_);
}

- A bit shame to lock into tesseract, as people had good success with gocr
previously. Would have been nice to be able to select engine(s). Did you try
gocr?


I would not delay 3.4.3 any further, there's really no need to include this
there in this state. Just improve it in trunk, if voted so? I'm still +-0
because of all the architectural issues. Of course having an experimental and
disabled by default plugin there is no problem.

-- 
You are receiving this mail because:
You are the assignee for the bug.