You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Tilman Hausherr <TH...@t-online.de> on 2017/07/15 11:22:02 UTC

fwd: A Benchmark and Evaluation for Text Extraction from PDF

http://ad-publications.informatik.uni-freiburg.de/benchmark.pdf

A Benchmark and Evaluation for Text Extraction from PDF

PDFBox is the best in 4 categories, the worst in one (missing newlines), 
and near the top in one (lack of errors). I have asked the authors to 
name me some of the files re: missing newlines, and the two error files.

Tilman



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: fwd: A Benchmark and Evaluation for Text Extraction from PDF

Posted by Tilman Hausherr <TH...@t-online.de>.
I have received the two ERR files and one NL- files from Claudius 
Korzen. I have uploaded them at
http://www.filedropper.com/pdfboxerror1
http://www.filedropper.com/pdfboxerror2
http://www.filedropper.com/examplenlminus

About the ERR files: these are indeed bad. The content streams are bad, 
probably a bug in the creator software.

About the NL- file: he wrote that per their test design decision, the 
two formulas in "Lemma 3.3" on page 13 should be different paragraphs 
because they have different semantic roles than the body text.

I'm neutral about this... IMHO extracting formulas form a PDF is useless 
because one will never get an exact copy due to the two-dimensionality 
(subscript and superscript) of them.

Tilman


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: fwd: A Benchmark and Evaluation for Text Extraction from PDF

Posted by Tilman Hausherr <TH...@t-online.de>.
> - the said itext is similar to PDFBox (page 7 upper right) ;-)

That one I noticed and also mentioned it in my mail to them.

Another thing I just saw today:

pdf2xml, that was said to be based on apache tika is also based on 
PDFBox 1.1.0:
https://bitbucket.org/tiedemann/pdf2xml/src/65b534eb6f10d2251185065bedc3ee7416bc5831/share/lib/pdfxtk/?at=master
and tika-app 1.3
https://bitbucket.org/tiedemann/pdf2xml/src/65b534eb6f10d2251185065bedc3ee7416bc5831/share/lib/?at=master

Tilman

>
> Andreas
>
>>
>> PDFBox is the best in 4 categories, the worst in one (missing 
>> newlines), and near the top in one (lack of errors). I have asked the 
>> authors to name me some of the files re: missing newlines, and the 
>> two error files.
>>
>> Tilman
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: fwd: A Benchmark and Evaluation for Text Extraction from PDF

Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Am 15.07.2017 um 13:22 schrieb Tilman Hausherr:
> http://ad-publications.informatik.uni-freiburg.de/benchmark.pdf
> 
> A Benchmark and Evaluation for Text Extraction from PDF
Interesting, some details I've already found:

- they used 2.0.3
- the said itext is similar to PDFBox (page 7 upper right) ;-)

Andreas

> 
> PDFBox is the best in 4 categories, the worst in one (missing newlines), and 
> near the top in one (lack of errors). I have asked the authors to name me some 
> of the files re: missing newlines, and the two error files.
> 
> Tilman
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org