You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Peter Murray-Rust <pm...@cam.ac.uk> on 2015/01/27 09:37:15 UTC

Non-unicode characters

thomasjjg@gmail.com wrote...

>>>I have a requirement to read tamil pdf document and store the content in
db. When I read the document using Pdfbox, the characters are junked and
not readable. I suppose this problem to be with fonts used. Can you help me
to resolve this?

It sounds as if you have a font which is not Unicode compliant and probably
undocumented. We encounter a similar problem in scientific documents where
characters with high Unicode points are represented by a variety of
non-standard Fonts. in PDF2SVG (http://bitbucket.org/petermr/pdf2svg; which
is built on top of PDFBox) we try to provide a debug and a variety of
kludges to translate to Unicode.

There are several messy ways in which characters are transmitted in
non-Unicode fashion:
* outline glyphs (i.e. the vectors representing the fonts)
* bitmapped glyphs with names or code points

The glyphs are usually supplied with the fonts. They are referenced eitehr
by non-Unicode points or by names.

There is no algorithmic way of solving the problem. If the font is in
common use you *may* be able to find a translation table to Unicode by
searching the web or asking, but in our experience this is uncommon. You
may, if you are lucky, find a table of glyph images mapped onto codepoints
or names.

If you have a large document or many documents it will be necessary to
create a translation table. We do this in PDF2SVG on a heuristic basis -
sometimes the characters have a sequence that maps onto the alphabet or
Unicode, but sometimes it's completely arbitrary. There could be other
horrors - such as different codepoints for a character with different sizes
(Microsoft does this for maths).

I know nothing of Tamil - have read
http://en.wikipedia.org/wiki/Tamil_script - and assuming your legacy font
is systematic then you will to map these Unicode tables onto your font.
Assuming you don't have a translation table you will have to do this
manually, character by character. Assuming you can reliably recognize Tamil
characters you can visually map the glyphs onto the rendered PDF onto
Unicode.

Alternatively you could print the characters to screen and use an Optical
Character Recognition program. We are (slowly) developing this for
mathematics and other symbols , but not for Tamil. You might find that a
good OCR program is the best way forward.

None of this will be huge fun, I am afraid - but the task is finite if
there is only one font.







-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Re: Non-unicode characters

Posted by Jack Bush <ne...@yahoo.com.au>.
Unsubscribe please 

     On Tuesday, 27 January 2015, 19:39, Peter Murray-Rust <pm...@cam.ac.uk> wrote:
   

 thomasjjg@gmail.com wrote...

>>>I have a requirement to read tamil pdf document and store the content in
db. When I read the document using Pdfbox, the characters are junked and
not readable. I suppose this problem to be with fonts used. Can you help me
to resolve this?

It sounds as if you have a font which is not Unicode compliant and probably
undocumented. We encounter a similar problem in scientific documents where
characters with high Unicode points are represented by a variety of
non-standard Fonts. in PDF2SVG (http://bitbucket.org/petermr/pdf2svg; which
is built on top of PDFBox) we try to provide a debug and a variety of
kludges to translate to Unicode.

There are several messy ways in which characters are transmitted in
non-Unicode fashion:
* outline glyphs (i.e. the vectors representing the fonts)
* bitmapped glyphs with names or code points

The glyphs are usually supplied with the fonts. They are referenced eitehr
by non-Unicode points or by names.

There is no algorithmic way of solving the problem. If the font is in
common use you *may* be able to find a translation table to Unicode by
searching the web or asking, but in our experience this is uncommon. You
may, if you are lucky, find a table of glyph images mapped onto codepoints
or names.

If you have a large document or many documents it will be necessary to
create a translation table. We do this in PDF2SVG on a heuristic basis -
sometimes the characters have a sequence that maps onto the alphabet or
Unicode, but sometimes it's completely arbitrary. There could be other
horrors - such as different codepoints for a character with different sizes
(Microsoft does this for maths).

I know nothing of Tamil - have read
http://en.wikipedia.org/wiki/Tamil_script - and assuming your legacy font
is systematic then you will to map these Unicode tables onto your font.
Assuming you don't have a translation table you will have to do this
manually, character by character. Assuming you can reliably recognize Tamil
characters you can visually map the glyphs onto the rendered PDF onto
Unicode.

Alternatively you could print the characters to screen and use an Optical
Character Recognition program. We are (slowly) developing this for
mathematics and other symbols , but not for Tamil. You might find that a
good OCR program is the best way forward.

None of this will be huge fun, I am afraid - but the task is finite if
there is only one font.







-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069