You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Antti Lankila (JIRA)" <ji...@apache.org> on 2014/06/03 16:04:05 UTC

[jira] [Updated] (PDFBOX-922) True type PDFont subclass only supports WinAnsiEncoding (hardcoded!)

     [ https://issues.apache.org/jira/browse/PDFBOX-922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Antti Lankila updated PDFBOX-922:
---------------------------------

    Attachment: pdfbox-unicode.diff

OK. I got something that probably qualifies for the worst possible implementation of Unicode text writing in a PDF generation library in the entire history of mankind. Consider this an early preview.

All that matters is that I did see this pile of garbage spit out unicode text when used with TTF font that has Windows platform Unicode encoding CMAP table.

To use it:

- PDType0Font.loadTTF() is a new method that generates a Type0 font with CIDFont Type2 hanging from it. The old TrueTypeFont.loadTTF is still usable, but you won't get Unicode text capabilities.

- PDContentStream has a new method, drawUnicodeString(), which must be used when drawing text using a CID font. This generates the required 16-bit strings into the document.

It turns out that whenever a CID font is used, all text strings meant to be printed will be read as 16-bit big-endian values. So there's no point to mess with PDFDocEncoding or UTF-16BE COSString or any of that stuff -- drawing strings on page is a fundamentally special operation which depends entirely on the font being used.

IMHO, the PDPageContentStream drawString should always be provided with the font that is currently being used for drawing so it could ask that font for instructions on how to correctly express the various glyphs.

> True type PDFont subclass only supports WinAnsiEncoding (hardcoded!)
> --------------------------------------------------------------------
>
>                 Key: PDFBOX-922
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-922
>             Project: PDFBox
>          Issue Type: New Feature
>          Components: Writing
>    Affects Versions: 1.3.1
>         Environment: JDK 1.6 / OS irrelevant, tried against 1.3.1 and 1.2.0
>            Reporter: Thanos Agelatos
>            Assignee: Andreas Lehmkühler
>         Attachments: pdfbox-unicode.diff
>
>
> PDFBox cannot embed Identity-H or Identity-V type TTF fonts in the PDF it creates, making it impossible to create PDFs in any language apart from English and ones supported in WinAnsiEncoding. This behaviour is caused because method PDTrueTypeFont.loadTTF has hardcoded WinAnsiEncoding inside, and there is no Identity-H or Identity-V Encoding classes provided (to set afterwards via PDFont.setFont() )
> This excludes the following languages plus many others:
> - Greek
> - Bulgarian
> - Swedish
> - Baltic languages
> - Malteze 
> The PDF created contains garbled characters and/or squares.
> Simple test case:
>                 PDDocument doc = null;
> 		try {
> 			doc = new PDDocument();
> 			PDPage page = new PDPage();
> 			doc.addPage(page);
> 			// extract fonts for fields
> 			byte[] arialNorm = extractFont("arial.ttf");
> 			//byte[] arialBold = extractFont("arialbd.ttf"); 
> 			//PDFont font = PDType1Font.HELVETICA;
> 			PDFont font = PDTrueTypeFont.loadTTF(doc, new ByteArrayInputStream(arialNorm));
> 			
> 			PDPageContentStream contentStream = new PDPageContentStream(doc, page);
> 			contentStream.beginText();
> 			contentStream.setFont(font, 12);
> 			contentStream.moveTextPositionByAmount(100, 700);
> 			contentStream.drawString("Hello world from PDFBox ελληνικά"); // text here may appear garbled; insert any text in Greek or Bulgarian or Malteze
> 			contentStream.endText();
> 			contentStream.close();
> 			doc.save("pdfbox.pdf");
> 			System.out.println(" created!");
> 		} catch (Exception ioe) {
> 			ioe.printStackTrace();
> 		} finally {
> 			if (doc != null) {
> 				try { doc.close(); } catch (Exception e) {}
> 			}
> 		}



--
This message was sent by Atlassian JIRA
(v6.2#6252)