You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by Tim Allison <ta...@apache.org> on 2021/01/12 16:58:57 UTC

PDFs and detectAngles

This is a follow up on an earlier discussion.  I compared running our
PDFParser with and without "detectAngles" on the 10k set of PDFs that I've
been using recently.

DetectAngles is not related to image processing or OCR, rather, when this
parameter is set to "true", the PDFParser relies on information about the
orientation of the text runs to stitch the runs back into more accurate
words/lines/sentences.

The results are here:
https://corpora.tika.apache.org/base/reports/detect_angles.tgz

It takes roughly 3x time to "detectAngles" on the test set of 10k PDFs (10
threads, wallclock: 41 seconds vs 141 seconds).  There was a 0.6% increase
in common tokens.  For a few files, the improvement was _dramatic_.  And, I
suspect that our unigram/bag of words approach is not measuring
improvements in multi-word text runs/sentences.

Given the cost in processing time, I'm slightly inclined not to change our
default "false" for Tika 2.0.0.  If anyone disagrees, please open an issue.

Cheers,

              Tim

Re: PDFs and detectAngles

Posted by Tim Allison <ta...@apache.org>.

In follow up runs...it looks closer to 2x time increase (41 seconds vs 85).

For kicks, I also experimented with:
a) sort by position and detect angles (91 seconds)
b) sort by position, detect angles and suppress duplicate text (105 seconds)

There were regressions in text extraction with sort by position even with
suppress duplicate text.  So, on some files, there's an improvement on
others there is worse text. :(

So, unless there's a strong feeling or a way to improve the speed of detect
angles, let's leave default settings as they are for now.

Cheers,

       Tim

On Tue, Jan 12, 2021 at 11:58 AM Tim Allison <ta...@apache.org> wrote:

> This is a follow up on an earlier discussion.  I compared running our
> PDFParser with and without "detectAngles" on the 10k set of PDFs that I've
> been using recently.
>
> DetectAngles is not related to image processing or OCR, rather, when this
> parameter is set to "true", the PDFParser relies on information about the
> orientation of the text runs to stitch the runs back into more accurate
> words/lines/sentences.
>
> The results are here:
> https://corpora.tika.apache.org/base/reports/detect_angles.tgz
>
> It takes roughly 3x time to "detectAngles" on the test set of 10k PDFs (10
> threads, wallclock: 41 seconds vs 141 seconds).  There was a 0.6% increase
> in common tokens.  For a few files, the improvement was _dramatic_.  And, I
> suspect that our unigram/bag of words approach is not measuring
> improvements in multi-word text runs/sentences.
>
> Given the cost in processing time, I'm slightly inclined not to change our
> default "false" for Tika 2.0.0.  If anyone disagrees, please open an issue.
>
> Cheers,
>
>               Tim
>