You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by Tim Allison <ta...@apache.org> on 2018/11/26 20:49:55 UTC
Comparing extracted text with pdftotext
All,
I just finished drafting a high level "lab report" comparing
pdftotext and Tika/PDFBox on the PDFs in our refreshed regression
corpus: https://wiki.apache.org/tika/ComparisonTikaAndPDFToText201811.
The more interesting bits are in the actual reports from tika-eval
and/or the comparison database available here:
http://162.242.228.174/pdf_parsing/pdftotextVPDFBox_201811/
Let me know what you think.
Cheers,
Tim
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org
Re: Comparing extracted text with pdftotext
Posted by Tilman Hausherr <TH...@t-online.de>.
commoncrawl3/IK/IKIMQ2USV4HEF2NF4K7UNMZZFADCKVWP
the missing part in PDFBox is a diagonal text
commoncrawl3/2L/2LBSIRE27J5TTKH53KR6PEZH6QKJ3BZ7
the missing words are separated with a "-" at the end of the line.
Interesting feature.
Tilman
Am 26.11.2018 um 21:49 schrieb Tim Allison:
> All,
>
> I just finished drafting a high level "lab report" comparing
> pdftotext and Tika/PDFBox on the PDFs in our refreshed regression
> corpus: https://wiki.apache.org/tika/ComparisonTikaAndPDFToText201811.
> The more interesting bits are in the actual reports from tika-eval
> and/or the comparison database available here:
> http://162.242.228.174/pdf_parsing/pdftotextVPDFBox_201811/
>
> Let me know what you think.
>
> Cheers,
>
> Tim
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org