You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Christopher Mason <cm...@thetus.com> on 2010/04/21 02:13:50 UTC

Rendering Glitches

I'm investigating libraries for rendering and extracting text from PDF. 
  Across the half dozen I've looked at, both commercial and open source, 
I think pdfbox is the cleanest.

However, I've run across a number of pdfs that pdfbox does not render 
properly.  One I'm particularly concerned about is:

http://www.cmason.com/tmp/Sowa.pdf

It looks to have encoding or char -> glyph issues in pdfbox, but look 
okay in every other reader/library I've tried.  I've tried with both 
pdfbox-1.1.0 and with the trunk.  Here's how it looks in pdfbox trunk 
versus Preview:

http://www.cmason.com/tmp/Sowa.png

Any help or suggestions would be most appreciated.

-c



java -cp 
~/.m2/repository/commons-logging/commons-logging/1.1.1/commons-logging-1.1.1.jar:pdfbox-1.1.0.jar:fontbox-1.1.0.jar 
org.apache.pdfbox.PDFToImage -color rgba -startPage 1 -endPage 1 
-resolution 100 -imageType png -outputPrefix Sowa ~/Sites/docs/Sowa.pdf



Re: Rendering Glitches

Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Hi,

Christopher Mason schrieb:
> 
> I'm investigating libraries for rendering and extracting text from PDF. 
>  Across the half dozen I've looked at, both commercial and open source, 
> I think pdfbox is the cleanest.
Oh, interesting. :-)

> However, I've run across a number of pdfs that pdfbox does not render 
> properly.  One I'm particularly concerned about is:
> 
> http://www.cmason.com/tmp/Sowa.pdf
> 
> It looks to have encoding or char -> glyph issues in pdfbox, but look 
> okay in every other reader/library I've tried.  I've tried with both 
> pdfbox-1.1.0 and with the trunk.  Here's how it looks in pdfbox trunk 
> versus Preview:
> 
> http://www.cmason.com/tmp/Sowa.png
> 
> Any help or suggestions would be most appreciated.
I've a quick look at the pdf. It uses an embedded subset of true type fonts
which is a known problem, see PDFBOX-490 [1] for further details.

BR
Andreas Lehmkühler

[1] https://issues.apache.org/jira/browse/PDFBOX-490

Re: Rendering Glitches

Posted by Thomas Fischer <fi...@aon.at>.
Hi Christopher,

Am 21.04.2010 um 02:13 schrieb Christopher Mason:

> 
> I'm investigating libraries for rendering and extracting text from PDF.  Across the half dozen I've looked at, both commercial and open source, I think pdfbox is the cleanest.

I agree, I've just extracted text from around 40.000 mathematical PDF files, and my experience is that pdfbox is the best tool.
There are also a few exceptions…

> However, I've run across a number of pdfs that pdfbox does not render properly.  One I'm particularly concerned about is:
> 
> http://www.cmason.com/tmp/Sowa.pdf

I just try to extract text and am not concerned with rendering.
As far as I see, the text I get is as good as it can be, thus I don't think that there should be problems with font and/or glyphs, see the excerpt below.
But I have to agree that org.apache.pdfbox.PDFToImage doesn't give me anything useful either (actually one very long image consisting of all the pages of the document, with errors like the image mentioned.

Cheers
Thomas


> 
> It looks to have encoding or char -> glyph issues in pdfbox, but look okay in every other reader/library I've tried.  I've tried with both pdfbox-1.1.0 and with the trunk.  Here's how it looks in pdfbox trunk versus Preview:
> 
> http://www.cmason.com/tmp/Sowa.png
> 
> Any help or suggestions would be most appreciated.
> 
> -c
> 
> 
> 
> java -cp ~/.m2/repository/commons-logging/commons-logging/1.1.1/commons-logging-1.1.1.jar:pdfbox-1.1.0.jar:fontbox-1.1.0.jar org.apache.pdfbox.PDFToImage -color rgba -startPage 1 -endPage 1 -resolution 100 -imageType png -outputPrefix Sowa ~/Sites/docs/Sowa.pdf
> 
> 

Beginning of text:

The Challenge 
Of Knowledge Soup
John F. Sowa 
26 August 2004 
PerMIS 2004 Workshop at NIST 
Gaithersburg, Maryland 
Outline of This Talk
1. Thesis: 
Support interoperability among heterogeneous systems 
by defining all concepts precisely and unambiguously. 
2. Antithesis: 
"There are more things in heaven and earth, Horatio, 
Than are dreamt of in your philosophy." 
William Shakespeare 
3. Synthesis: 
Develop more flexible methods of knowledge acquisition 
by simulating the human cognitive cycle. 
Aristotle's Syllogisms
System of logic based on four sentence patterns: 
1. Universal affirmative.  Every employee is human. 
2. Particular affirmative.  Some employees are customers. 
3. Universal negative.  No employee is a competitor. 
4. Particular negative.  Some customers are not employees. 
Affirmative patterns for stating inheritance. 
Negative patterns for stating constraints. 
Description logics are based on Aristotle's syllogisms. 
Tree of Porphyry