You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Robert Main <RM...@gmx.de> on 2010/03/05 12:07:57 UTC

Wrong symbols printed

Hello. I want to extract text from pages and when I try to write it into a new PDF, some characters are mixed up.
I extract the text using the TextPosition objects that contain the actual text strings, font, position etc.

This is the important code that I use to write the text into the page:
contentStream is a PDPageContentStream, te is a TextPosition,
page is a PDPage


contentStream.setFont(te.getFont(), te.getFontSizeInPt());
				contentStream.setTextMatrix(1, 0, 0, 1, te.getXDirAdj(), page.getArtBox().getHeight()-te.getYDirAdj());
				contentStream.drawString(te.getCharacter());

It works for normal text, however there are problems with mathematical terms, see the attachment please.
The out.png has the converted page using pdftoimage; everything went fine except that the sigma sign is missing. myresult.pdf on the other hand has lots of font problems: nearly every special character is the root sign and if it isn't the root sign, it's some other mixed character.
If you want to take a look at the original pdf, it's
http://www.xs4all.nl/~johanw/math.pdf page 16.
-- 
Sicherer, schneller und einfacher. Die aktuellen Internet-Browser -
jetzt kostenlos herunterladen! http://portal.gmx.net/de/go/atbrowser

Re: Wrong symbols printed

Posted by Villu Ruusmann <vi...@gmail.com>.

Hello there,

I guess it would be better if you opened a new issue in PDFBox's JIRA
and summarized your findings there.
https://issues.apache.org/jira

Please note that e-mail attachments don't survive in this mailing list.

> Hello. I want to extract text from pages and when I try to write it into a new PDF, some characters are mixed up.
> I extract the text using the TextPosition objects that contain the actual text strings, font, position etc.
>

You're dealing with a PDF document which contains Type1C fonts and has
been generated with pdfTeX-1.10b. This is a rather tricky combination.

> This is the important code that I use to write the text into the page:
> contentStream is a PDPageContentStream, te is a TextPosition,
> page is a PDPage
>
>
> contentStream.setFont(te.getFont(), te.getFontSizeInPt());
>                                contentStream.setTextMatrix(1, 0, 0, 1, te.getXDirAdj(), page.getArtBox().getHeight()-te.getYDirAdj());
>                                contentStream.drawString(te.getCharacter());
>

The current Type1C font support has been tested with PDF text
extraction and rendering, but to my knowledge not with PDF generation.
The conversion from Java characters to raw bytes could be misbehaving.

Are you experiencing the same behaviour with PDFBox 0.8.0 (and earlier
versions)?

> It works for normal text, however there are problems with mathematical terms, see the attachment please.
> The out.png has the converted page using pdftoimage; everything went fine except that the sigma sign is missing. myresult.pdf on the other hand has lots of font problems: nearly every special character is the root sign and if it isn't the root sign, it's some other mixed character.

I rendered a couple of pages myself with PDFBox 1.0.1-SNAPSHOT and all
the greek letters (deltas and sigmas) appeared to be correct. However,
all the parentheses were missing from mathematical expressions.


VR

Re: Wrong symbols printed

Posted by Robert Main <RM...@gmx.de>.

Apparently, my webmail didn't send the attachments, so I uploaded them elsewhere..
The png: http://img21.imageshack.us/img21/8867/outu.png
The pdf: http://tiny.cc/yKZWu


-------- Original-Nachricht --------
> Datum: Fri, 05 Mar 2010 12:07:57 +0100
> Von: "Robert Main" <RM...@gmx.de>
> An: users@pdfbox.apache.org
> Betreff: Wrong symbols printed

> Hello. I want to extract text from pages and when I try to write it into a
> new PDF, some characters are mixed up.
> I extract the text using the TextPosition objects that contain the actual
> text strings, font, position etc.
> 
> This is the important code that I use to write the text into the page:
> contentStream is a PDPageContentStream, te is a TextPosition,
> page is a PDPage
> 
> 
> contentStream.setFont(te.getFont(), te.getFontSizeInPt());
> 				contentStream.setTextMatrix(1, 0, 0, 1, te.getXDirAdj(),
> page.getArtBox().getHeight()-te.getYDirAdj());
> 				contentStream.drawString(te.getCharacter());
> 
> It works for normal text, however there are problems with mathematical
> terms, see the attachment please.
> The out.png has the converted page using pdftoimage; everything went fine
> except that the sigma sign is missing. myresult.pdf on the other hand has
> lots of font problems: nearly every special character is the root sign and
> if it isn't the root sign, it's some other mixed character.
> If you want to take a look at the original pdf, it's
> http://www.xs4all.nl/~johanw/math.pdf page 16.
> -- 
> Sicherer, schneller und einfacher. Die aktuellen Internet-Browser -
> jetzt kostenlos herunterladen! http://portal.gmx.net/de/go/atbrowser

-- 
GMX DSL: Internet, Telefon und Entertainment für nur 19,99 EUR/mtl.!
http://portal.gmx.net/de/go/dsl02