You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "John Hewson (JIRA)" <ji...@apache.org> on 2014/12/10 00:09:14 UTC

[jira] [Comment Edited] (PDFBOX-922) True type PDFont subclass only supports WinAnsiEncoding (hardcoded!)

    [ https://issues.apache.org/jira/browse/PDFBOX-922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14031695#comment-14031695 ] 

John Hewson edited comment on PDFBOX-922 at 12/9/14 11:09 PM:
--------------------------------------------------------------

{quote}
drawString() in PDPageContentStream just writes the text into PDF as any COSString would choose to represent it. This is not the right thing to do. When the font is a CID keyed font, every glyph is 16 bit wide by definition, and COSString won't necessarily notice and write it correctly.
{quote}

Not quite: every CID can be up to 16-bits wide, but many (or for < 256 glyphs, all) will fit inside 8 bits. The byte-width of a string is controlled by -whether or not it starts with a BOM, not which font it uses- the current font's CMap but is always 16-bits with TTF.

Therefore, drawString() must know what font is currently being drawn, and ask that font to encode the String to whatever byte sequence it takes to draw those glyphs. So, PDFont must be added to the drawString() API, and PDFont ought to have a method for "public byte[] encode(String)".

drawString() is only valid after setFont() has been called, so it doesn't need adding to the API, we can just use the current font. PDFont#encode is a good idea, yes.

{quote}
PDFont needs a clearly specified API which performs java String to font-specific encoding transformation.
{quote}

Yes, as above.

{quote}
Observe that there are no methods in PDFont called decode(), and I have a hard time figuring out what any one of these methods actually do, because everything seems to be called "encode" or "lookup". It seems that the encode(byte[], int int) performs decoding, so it should be renamed such.
{quote}

Yes, I don't know if anybody knows what those methods are actually doing, including the original author.

{quote}
In general I'd recommend pushing the encode/decode job down to the font layer. Provide just two methods: "byte[] encode(String)" and "String decode(byte[])". Their job is to convert between the byte sequences required by that font and java Strings, and they handle full runs of text, not just single characters. They will then use single- or multibyte encodings as the font requires without the higher level having to do crazy stuff like processEncodedText() currently does in PDFStreamEngine.
{quote}

processEncodedText() is indeed crazy and needs fixing, but what you propose won't work because the 16-bit string encoding is not set by the font, it's set on a per-string basis by having that string start with a BOM.

{quote}
There are unfortunately very many ways to encode text in PDF, and especially if text needs to be decodable from the byte stream generated by other programs, the full complexity must be faced and implemented. These are to be solved in a case-by-case basis in the PDFont hierarchy. The PDFont highest class methods for encode and decode should be defined as abstract to reflect the fact that encoding depends on the particular subtype of the font.
{quote}

Yes, though as far as decoding the correct text is concerned all you have to do is make sure that the ToUnicode map is built correctly - you can put any old garbage in the actual strings (any many PDFs do). 

{quote}
It may be that for some of these fonts the implementation is same because the actual mechanics can be handled by varying the Encoding instance, though.
{quote}

Maybe, though the Encoding class is for Type1 fonts (and equivalent, e.g. Type1C) only.


was (Author: jahewson):
{quote}
drawString() in PDPageContentStream just writes the text into PDF as any COSString would choose to represent it. This is not the right thing to do. When the font is a CID keyed font, every glyph is 16 bit wide by definition, and COSString won't necessarily notice and write it correctly.
{quote}

Not quite: every CID can be up to 16-bits wide, but many (or for < 256 glyphs, all) will fit inside 8 bits. The byte-width of a string is controlled by whether or not it starts with a BOM, not which font it uses.

Therefore, drawString() must know what font is currently being drawn, and ask that font to encode the String to whatever byte sequence it takes to draw those glyphs. So, PDFont must be added to the drawString() API, and PDFont ought to have a method for "public byte[] encode(String)".

drawString() is only valid after setFont() has been called, so it doesn't need adding to the API, we can just use the current font. PDFont#encode is a good idea, yes.

{quote}
PDFont needs a clearly specified API which performs java String to font-specific encoding transformation.
{quote}

Yes, as above.

{quote}
Observe that there are no methods in PDFont called decode(), and I have a hard time figuring out what any one of these methods actually do, because everything seems to be called "encode" or "lookup". It seems that the encode(byte[], int int) performs decoding, so it should be renamed such.
{quote}

Yes, I don't know if anybody knows what those methods are actually doing, including the original author.

{quote}
In general I'd recommend pushing the encode/decode job down to the font layer. Provide just two methods: "byte[] encode(String)" and "String decode(byte[])". Their job is to convert between the byte sequences required by that font and java Strings, and they handle full runs of text, not just single characters. They will then use single- or multibyte encodings as the font requires without the higher level having to do crazy stuff like processEncodedText() currently does in PDFStreamEngine.
{quote}

processEncodedText() is indeed crazy and needs fixing, but what you propose won't work because the 16-bit string encoding is not set by the font, it's set on a per-string basis by having that string start with a BOM.

{quote}
There are unfortunately very many ways to encode text in PDF, and especially if text needs to be decodable from the byte stream generated by other programs, the full complexity must be faced and implemented. These are to be solved in a case-by-case basis in the PDFont hierarchy. The PDFont highest class methods for encode and decode should be defined as abstract to reflect the fact that encoding depends on the particular subtype of the font.
{quote}

Yes, though as far as decoding the correct text is concerned all you have to do is make sure that the ToUnicode map is built correctly - you can put any old garbage in the actual strings (any many PDFs do). 

{quote}
It may be that for some of these fonts the implementation is same because the actual mechanics can be handled by varying the Encoding instance, though.
{quote}

Maybe, though the Encoding class is for Type1 fonts (and equivalent, e.g. Type1C) only.

> True type PDFont subclass only supports WinAnsiEncoding (hardcoded!)
> --------------------------------------------------------------------
>
>                 Key: PDFBOX-922
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-922
>             Project: PDFBox
>          Issue Type: New Feature
>          Components: Writing
>    Affects Versions: 1.3.1
>         Environment: JDK 1.6 / OS irrelevant, tried against 1.3.1 and 1.2.0
>            Reporter: Thanos Agelatos
>            Assignee: Andreas Lehmkühler
>            Priority: Blocker
>             Fix For: 2.0.0
>
>         Attachments: pdfbox-unicode.diff, pdfbox-unicode2.diff
>
>
> PDFBox cannot embed Identity-H or Identity-V type TTF fonts in the PDF it creates, making it impossible to create PDFs in any language apart from English and ones supported in WinAnsiEncoding. This behaviour is caused because method PDTrueTypeFont.loadTTF has hardcoded WinAnsiEncoding inside, and there is no Identity-H or Identity-V Encoding classes provided (to set afterwards via PDFont.setFont() )
> This excludes the following languages plus many others:
> - Greek
> - Bulgarian
> - Swedish
> - Baltic languages
> - Malteze 
> The PDF created contains garbled characters and/or squares.
> Simple test case:
> {code}
>                 PDDocument doc = null;
> 		try {
> 			doc = new PDDocument();
> 			PDPage page = new PDPage();
> 			doc.addPage(page);
> 			// extract fonts for fields
> 			byte[] arialNorm = extractFont("arial.ttf");
> 			//byte[] arialBold = extractFont("arialbd.ttf"); 
> 			//PDFont font = PDType1Font.HELVETICA;
> 			PDFont font = PDTrueTypeFont.loadTTF(doc, new ByteArrayInputStream(arialNorm));
> 			
> 			PDPageContentStream contentStream = new PDPageContentStream(doc, page);
> 			contentStream.beginText();
> 			contentStream.setFont(font, 12);
> 			contentStream.moveTextPositionByAmount(100, 700);
> 			contentStream.drawString("Hello world from PDFBox ελληνικά"); // text here may appear garbled; insert any text in Greek or Bulgarian or Malteze
> 			contentStream.endText();
> 			contentStream.close();
> 			doc.save("pdfbox.pdf");
> 			System.out.println(" created!");
> 		} catch (Exception ioe) {
> 			ioe.printStackTrace();
> 		} finally {
> 			if (doc != null) {
> 				try { doc.close(); } catch (Exception e) {}
> 			}
> 		}
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)