You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by Tim Allison <ta...@apache.org> on 2018/11/26 20:49:55 UTC

Comparing extracted text with pdftotext

All,

  I just finished drafting a high level "lab report" comparing
pdftotext and Tika/PDFBox on the PDFs in our refreshed regression
corpus: https://wiki.apache.org/tika/ComparisonTikaAndPDFToText201811.
The more interesting bits are in the actual reports from tika-eval
and/or the comparison database available here:
http://162.242.228.174/pdf_parsing/pdftotextVPDFBox_201811/

  Let me know what you think.

          Cheers,

                   Tim

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: Comparing extracted text with pdftotext

Posted by Tilman Hausherr <TH...@t-online.de>.
commoncrawl3/IK/IKIMQ2USV4HEF2NF4K7UNMZZFADCKVWP
the missing part in PDFBox is a diagonal text

commoncrawl3/2L/2LBSIRE27J5TTKH53KR6PEZH6QKJ3BZ7
the missing words are separated with a "-" at the end of the line. 
Interesting feature.

Tilman

Am 26.11.2018 um 21:49 schrieb Tim Allison:
> All,
>
>    I just finished drafting a high level "lab report" comparing
> pdftotext and Tika/PDFBox on the PDFs in our refreshed regression
> corpus: https://wiki.apache.org/tika/ComparisonTikaAndPDFToText201811.
> The more interesting bits are in the actual reports from tika-eval
> and/or the comparison database available here:
> http://162.242.228.174/pdf_parsing/pdftotextVPDFBox_201811/
>
>    Let me know what you think.
>
>            Cheers,
>
>                     Tim
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org