You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Anton Stoychev <an...@gmail.com> on 2014/03/27 20:00:38 UTC

Pdf with Times New Roman and Cyrillic glyphs = weird characters extracted

So the problematic pdf is this:
http://www.parliament.bg/pub/StenD/iv260712.pdf

The first time I opened it in Adobe Reader the entries in the first column
showed as garbled glyphs like ȺɅȿɄɋȺɇȾɔɊɊɍɆȿɇɈȼɇȿɇɄɈ.

Then I installed Times New Roman font family on my Fedora machine and I
restarted Adobe Reader. This fixed and I was able to see correct names like
"АЛЕКСАНДЪР РУМЕНОВ НЕНКОВ"

This are names persons' names in Cyrillic.

I'm using PDFBox along with tabula-extractor (
https://github.com/jazzido/tabula-extractor) to extract table data but it
seems even with Times New Roman installed on my machine, the names are
still garbled:

ȺɅȿɄɋȺɇȾɔɊȾɂɆɂɌɊɈȼɉȺɍɇɈȼ,740,ɄȻ,-,+,+,0,0,+,+,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-
ȺɅȿɄɋȺɇȾɔɊɏɊɂɋɌɈȼɆȿɌɈȾɂȿȼ,917,Ⱦɉɋ,0,0,0,=,=,+,+,-,+,+,-,-,-,-,+,+,+,+,+,+,-,+,+,-,-,-
ȺɅȿɄɋɂȼȺɋɂɅȿȼȺɅȿɄɋɂȿȼ,919,ɄȻ,-,+,+,-,-,+,+,0,0,+,-,-,-,-,+,+,-,+,0,+,-,+,+,-,-,-
ȺɅɂɈɋɆȺɇɂȻɊȺɂɆɂɆȺɆɈȼ,336,Ⱦɉɋ,0,+,+,-,-,+,+,-,+,+,-,-,-,-,+,0,0,0,0,0,0,0,0,0,0,-
ȺɇȾɈɇɉȿɌɊɈȼȺɇȾɈɇɈȼ,856,ȽȿɊȻ,+,=,-,+,+,=,-,+,=,0,0,0,0,0,0,-,+,-,-,-,0,-,-,+,+,0
ȺɇɌɈɇɄɈɇɋɌȺɇɌɂɇɈȼɄɍɌȿȼ,343,ɄȻ,0,0,0,-,-,+,+,-,+,+,0,-,-,-,+,+,-,+,+,+,-,+,0,0,-,-
ȺɇɌɈɇɂɃɃɈɊȾȺɇɈȼɃɈɊȾȺɇɈȼ,604,ȽȿɊȻ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
ȺɌȺɇȺɋɁȺɎɂɊɈȼɁȺɎɂɊɈȼ,744,ɄȻ,-,+,+,-,-,+,+,0,+,+,-,-,-,-,+,+,-,+,0,+,-,+,+,-,-,-
ȺɌȺɇȺɋɂȼȺɇɈȼɌȺɒɄɈȼ,857,ȽȿɊȻ,+,=,=,+,+,-,=,0,0,0,0,0,0,0,0,-,+,-,-,0,0,0,0,0,0,0

Is this something to do with glyphlist_ext described
http://pdfbox.apache.org/cookbook/textextraction.html#external-glyph-list ?

I tried PDFont font = PDTrueTypeFont.loadTTF(document, "Times New Roman.ttf"
);

It didn't do anything.

Am I doing something wrong? How can I fix this?

Best Regards,

Anton

Re: Pdf with Times New Roman and Cyrillic glyphs = weird characters extracted

Posted by Tres Finocchiaro <tr...@gmail.com>.
Not sure if this is relevant but I made a font map which helped with
unknown fonts on *nix systems:

https://github.com/qzindustries/qz-print/blob/master/pdfbox_1.8.4_qz/src/org/apache/pdfbox/resources/FontMapping.properties

Feel free to borrow some for fedora if it helps!

Re: Pdf with Times New Roman and Cyrillic glyphs = weird characters extracted

Posted by Tres Finocchiaro <tr...@gmail.com>.
Not sure if this is relevant but I made a font map which helped with
unknown fonts on *nix systems:

https://github.com/qzindustries/qz-print/blob/master/pdfbox_1.8.4_qz/src/org/apache/pdfbox/resources/FontMapping.properties

Feel free to borrow some for fedora if it helps!