You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Peter Murray-Rust <pm...@cam.ac.uk> on 2015/03/24 14:21:07 UTC

Interpreting vector and pixel glyphs for characters

On Tue, Mar 24, 2015 at 9:26 AM, Maruan Sahyoun <sa...@fileaffairs.de>
wrote:

>... As you would like to remove certain vectors which are matching a
certain >character/glyph you first need to find out which are the ones
drawing e.g. the letter >'T'. I don't think that this is doable in a
reasonable amount of time for arbitary text.

>Maruan

This is true! And it's unfortunately a common problem with PDFs which use
* outline fonts/glyphs
* pixel glyphs
* scanned text

I think it is possible in limited subdomains and we are starting to try to
do this in science/maths. Our approach (
https://bitbucket.org/petermr/diagramanalyzer,
https://bitbucket.org/petermr/imageanalysis,
https://bitbucket.org/petermr/javaocr) is to create tools that recognize
text in common fonts. Unfortunately there is no clear library for OCR in
Java (we looked at all of them - Tesseract is non-native - and have ended
up extending javaocr).

Scanned typescript can be a nightmare (missing pixels, bleeding across
glyph boundaries, etc.) but sometimes works.
In our approach we try to analyze born-digital glyphs by heuristics rather
than machine-learning (which needs retraining for all new fonts/size). The
vector glyphs have a constant SVG signature for each character and this can
sometimes be worked out, or mapped by the crowd). The pixel glyphs are
harder and we shrink them to a common skeleton and classify from that. Once
one character is done it's usually possible to recognize it in later
occurrences.

It's early days, but it people are interested in collaborating or have
better solutions we'd be interested (we aren't able to help with casual
problems).

P.

-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069