You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Gunnar Brand <Gu...@interface-projects.de> on 2021/04/07 21:51:14 UTC
AW: Empty cmap in TTF Files.

Hi Constantine

I worked on it a bit.
In the end I don't render the HOCR directly but simply transfer the Tesseract PDF data over to the PDF.
For this I clone the glyphless font (which is embedded as resource in the jar) and add it to the target page and then while transfering the PDF data change the font commands to the new font name on the fly. I also add a lot of marked content since I have the info from the HOCR file (something Tesseract should have done).

I still tried to get the HOCR rendering working, which is a big PITA once you enter the realm of sloped or rotated text. Basically all HOCR does is draw a rectangle around words and lines and giving you a single pointer to the baseline of the first word.  So you later need to calculate intersections of the baseline with word rectangles etc etc. I probably spent more time on it than I should but I wanted to get it working.

When doing the PDF transfer I found a way to mix PDContentStream with raw commands in a nicer way, and this works fine with the "broken" glyphless font, too.
All you do is set font size, set the text matrix or use newLineAtOFfset , set the text zoom and then write a raw COSString and  a raw Tj operator.
There are issues with RTL text I haven't looked at yet (you probably need to reverse the order as tesseract does) and vertical text is another thing probably not working.
I will dump the code right here. Please note that you don't really need to  create a XFormObject, you can use the page directly and apppend the text data.

The HOCR classes I am using is a hastily written HOCR parser for the tesseract hocr files. All it does is parse blocks, paragraphs, line and word XML into java classes and transfer coordinates into the PDF coordinate system (bottom to top). It does some advanced math for the baseline and rotation handling, especially to get the start coordinates for lines and words . (slope only makes sense for unrotated text, though). If all you have is 90 degree angled stuff it is kinda easy to calculate these.

I added some marked content code (excesivly with confidence properties on word level) that might make text extractors happy later on. It also requires the raw method since the other stream does not do inline dictionaries at all. (There is a inline dicitionary property "ActualText" that can overwrite the whole content of the glyphs inside, maybe worth trying out, too.)

http://kba.cloud/hocr-spec/1.2/

Gunnar

class PageMerger
{
	private final static Charset UTF16BE = Charset.forName("UTF-16BE");
	
	private final static Operator BDC = Operator.getOperator(OperatorName.BEGIN_MARKED_CONTENT_SEQ);
	private final static COSName DIV = COSName.getPDFName("Div");
	private final static COSName P = COSName.getPDFName("P");
	private final static COSName SPAN = COSName.getPDFName("Span");
	// might be a good idea to thing about this, see PDF spec
	private final static COSName REVERSED_CHARS = COSName.getPDFName("ReversedChars");

	private final static COSName   LANG = COSName.getPDFName("Lang");
	private final static COSName   WRITING_MODE = COSName.getPDFName("WritingMode");
	private final static COSString LTR = new COSString("LrTb");
	private final static COSString RTL = new COSString("RlTb");
	private final static COSName   CONFIDENCE = COSName.getPDFName("X_Confidence");

	private final PDDocument _doc;

	private PDFont _glyphless;

	public PageMerger(PDDocument doc) throws IOException
	{
		_doc = doc;
	}


	public void addHocrText(PDPage page, HocrPage hocr) throws IOException
	{
		final PDRectangle pageBox = page.getCropBox(), formBox = rectangle(hocr.bbox);

		PDFormXObject form = new PDFormXObject(_doc);
		form.setBBox(formBox);
		form.setResources(new PDResources());
		final COSName fname = addGlyphless(form.getResources());
		final float glyphScale = 100f * _glyphless.getBoundingBox().getHeight() / _glyphless.getBoundingBox().getWidth();

		PDPageContentStream cs = new PDPageContentStream(_doc, form, form.getContentStream().createOutputStream(COSName.FLATE_DECODE));
		ContentStreamWriter csw = new ContentStreamWriter(new PDPageContentStreamAdapter(cs));
		
		for ( HocrBlock block : hocr.blocks ) {
			cs.beginMarkedContent(DIV);
			cs.beginText();
			cs.setRenderingMode(RenderingMode.NEITHER);
			float size = -1f;
			for ( HocrParagraph para : block.paragraphs ) {
				if ( para.lang==null && para.dir==null ) {
					cs.beginMarkedContent(P);
				} else {
					COSDictionary dict = new COSDictionary();
					dict.setDirect(true);
					if ( para.lang!=null ) dict.setString(LANG, para.lang);
					if ( para.dir!=null ) dict.setItem(WRITING_MODE, para.dir=="ltr" ? LTR : RTL);
					csw.writeTokens(P, dict, BDC);
				}

				for ( HocrLine line : para.lines ) {
					cs.beginMarkedContent(SPAN);
					Point2D p = line.getStart();
					cs.setTextMatrix(Matrix.getRotateInstance(Math.toRadians(line.getRotation()), (float)p.getX(), (float)p.getY()));
					if ( line.size!=size ) cs.setFont(_glyphless, size = line.size);
					HocrWord last = null;
					for ( Iterator<HocrWord> lit = line.words.iterator(); lit.hasNext(); ) {
						HocrWord word = lit.next();
						
						COSDictionary dict = new COSDictionary();
						dict.setDirect(true);
						dict.setFloat(CONFIDENCE, word.confidence);
						if ( word.lang!=null ) dict.setString(LANG, word.lang);
						if ( word.dir!=null ) dict.setItem(WRITING_MODE, word.dir=="ltr" ? LTR : RTL);
						csw.writeTokens(SPAN, dict, BDC);
						
						// preferable but less reliable (produces the same error as tesseract does) :( 
						if ( last!=null ) cs.newLineAtOffset(word.getDistance() - last.getDistance(), 0);
						last = word;
						
						// overkill but places words at the right position
//						p = word.getStart();
//						cs.setTextMatrix(Matrix.getRotateInstance(Math.toRadians(line.getRotation()), (float)p.getX(), (float)p.getY()));
						
						String text = word.text;
						if ( lit.hasNext() && !text.isBlank() ) text += " "; 
						float zoom = glyphScale * word.getWidth() / (line.size * text.codePointCount(0, text.length()));
						cs.setHorizontalScaling(zoom);
						// COSString constructor with java string argument adds a BOM in the beginning, that ain't good. 
						csw.writeToken(new COSString(text.getBytes(UTF16BE)));
						csw.writeToken(Operator.getOperator(OperatorName.SHOW_TEXT));
						cs.endMarkedContent();
					}
					cs.endMarkedContent();
				}
				cs.endMarkedContent();
			}
			cs.endText();
			cs.endMarkedContent();
		}
		cs.close();

		// Rotation matrix a b c d e f: cos sin -sin cos 0 0
		// x = a*x + c*y + e 
		// y = b*x + d*y + f
		int rotation = page.getRotation();
		final float x1 = pageBox.getLowerLeftX(),  y1 = pageBox.getLowerLeftY();
		final float x2 = pageBox.getUpperRightX(), y2 = pageBox.getUpperRightY();
		final float s = pageBox.getWidth() / ((rotation % 180 == 0)  ? formBox.getWidth() : formBox.getHeight());
		Matrix m = null;
		switch(rotation) {
			case 0:   m = new Matrix( s,  0,  0,  s, x1, y1); break;
			case 180: m = new Matrix(-s,  0,  0, -s, x2, x2); break;
			case 90:  m = new Matrix(0,   s, -s,  0, x2, y1); break;
			case 270: m = new Matrix(0,  -s,  s,  0, x1, y2); break;
		}
		cs = new PDPageContentStream(_doc, page, AppendMode.APPEND, true, true);
		cs.transform(m);
		cs.drawForm(form);
		cs.close();
	}


	private COSName addGlyphless(PDResources target) throws IOException
	{
		if ( _glyphless!=null ) return target.add(_glyphless);
		try (
			InputStream in = PageMerger.class.getResourceAsStream("glyphless.pdf");
			PDDocument template = PDDocument.load(in)
		) {
			PDResources source = template.getPage(0).getResources();
			PDFont font = cloneFont(source, source.getFontNames().iterator().next());
			COSName name = target.add(font);
			_glyphless = target.getFont(name);
			return name;
		}
	}

	private PDFont cloneFont(PDResources source, COSName name) throws IOException
	{
		PDFont f1 = source.getFont(name);
		PDFCloneUtility c = new PDFCloneUtility(_doc);
		return new PDType0Font((COSDictionary)c.cloneForNewDocument(f1.getCOSObject()));
	}

	private static PDRectangle rectangle(Rectangle bbox)
	{
		return new PDRectangle(bbox.x, bbox.y, bbox.width, bbox.height);
	}

	@SuppressWarnings("deprecation")
	private static class PDPageContentStreamAdapter extends OutputStream
	{
		private final PDPageContentStream stream;

		PDPageContentStreamAdapter(PDPageContentStream stream) {
			this.stream = stream;
		}
		
		@Override public void write(int b) throws IOException {
			stream.appendRawCommands(b);
		}
		
		@Override public void write(byte[] b) throws IOException {
			stream.appendRawCommands(b);
		}
		
		@Override public void write(byte[] b, int off, int len) throws IOException {
			stream.appendRawCommands(off==0 && len==b.length ? b : Arrays.copyOfRange(b, off, off + len));
		}
	}
}


-----Ursprüngliche Nachricht-----
Von: Constantine Dokolas <cd...@gmail.com> 
Gesendet: Freitag, 26. März 2021 11:38
An: users@pdfbox.apache.org
Betreff: Re: Empty cmap in TTF Files.

Hi, Gunnar,

Do you think this SO question
<https://stackoverflow.com/questions/49363954/using-arialmt-for-arabic-text-without-embedding-font-with-pdfbox>
is related? I'm the OP and the (admittedly somewhat niche) case for no-glyph (i.e. non-renderable) chars on a PDF is a "capability" that's been missing for me.

To give some context, at work I'm responsible for a library that, among other things, overlays OCRed text (from diverse sources) on images placed in PDF pages. There have been issues I've overcome (especially concerning Unicode), but "glyphless font" embedding is something that would really make a noticeable impact on PDF size. Most OCR software that produce PDFs from images do this in some way, Tesseract included.

I think PDFBox is a great library for reading and generating PDFs, and I'm seriously considering contributing as soon as possible. A big thanks to everyone working to make this project successful.

C.D.
--
There is a computer disease that anybody who works with computers knows about. It's a very serious disease and it interferes completely with the work. The trouble with computers is that you 'play' with them!
- Richard P. Feynman


On Thu, Mar 25, 2021 at 2:30 PM Gunnar Brand < Gunnar.Brand@interface-projects.de> wrote:

> Hi.
>
> The process is as follows:
> 1) For images: use the image
>     For PDFs: render each page to 300 dpi (since optimized PDFs don't 
> necessarily have a single big image), maybe even with text if text 
> extraction returned gibberish (missing unicode mapping).
> 2) Use tesseract to OCR image/page with PDF and HOCR output. (for pages:
> create an imageless PDF). The HOCR is used for additional page layout 
> information and word confidence values.
> 3) For images, use the HOCR to filter the PDF text stream and add 
> layout information
>     For PDFs, insert the tesseract PDF text stream into the orignal 
> PDF's page (+add that glyphless font), use the HOCR to filter and add 
> layout information.
>
> For step 3, I would like to use a normal PDPageContentStream to add 
> the content instead of working with a raw stream. But that step fails 
> since I cannot use the showText() method with a Font that has an empty cmap.
>
> I attached an empty tesseract PDF with the glyphless font. Appending 
> text using the font to the single page in there will fail immediately 
> with the exception due to the empty cmap. Adding the font to any other 
> PDF and trying to show text using it will fail as well.
>
> I can probably get away with just creating/transfering the Tj commands 
> raw, but I was wondering if the empty cmap behaviour is ok or would it 
> be better to ignore empty cmaps (i.e. look for a non empty one first 
> and return null if none can be found in TrueTypeFont.getUnicodeCmapImpl).
>
> Gunnar
>
>
>
> -----Ursprüngliche Nachricht-----
> Von: Tilman Hausherr <TH...@t-online.de>
> Gesendet: Donnerstag, 25. März 2021 04:37
> An: users@pdfbox.apache.org
> Betreff: Re: Empty cmap in TTF Files.
>
> Am 24.03.2021 um 14:40 schrieb Gunnar Brand:
> > Hi.
> >
> > I am working on merging original PDFs and the PDF/HOCR output of
> Tesseract, as to create a searchable PDF. Transplanting the glyphless 
> font used by tesseract was no problem, it doesn’t matter if I simply 
> use the font in the original PDF or use cloneutil, when saving the 
> file the font is embedded properly.
> >
> > The problem is when I show text using a content stream, I get a “No
> Glyph for …” exception. I traced this down to the glyphless font 
> containing empty cmap tables. There is a CIDToGIDMap. Coincidentally 
> PDFBOX-5103 just addressed this issue with a reverse mapping if the 
> cmap is null. But the cmap is just empty and will return 0 for any 
> character code, so this new feature will never work in this case.
> >
> > For testing I modified TrueTypeFont.getUnicodeCmapImpl(isStrict) so 
> > that
> it ignores empty cmap subtables  (even the fallback at the end of the 
> method now being a loop). With this PDFBox will happily use the 
> tesseract glyphless font. Now I lack the knowledge if empty cmaps make 
> any sense at all and if they do I will simply write raw show text 
> commands, but maybe it is something to consider?
> >
> > Gunnar
>
> I tried tesseract some time ago and it generates searchable PDFs out 
> of the box, why not use that?
>
> Can you upload one of your files to a sharehoster so that I understand 
> what this is about?
>
> Tilman
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org