You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Natraj Kadur <na...@gmail.com> on 2009/01/21 10:56:01 UTC
Need work around for a problem with processpage
Hi,
I am using the PDFBox for one of the application. What I am doing is I
am extracting the PDF text from the PDF and generating the TOC entries. But
I am facing one problem, that is, if the PDF contains these two
characters "✠"(✠) and "Ⓔ"(Ⓔ) then the processpage(PDPage,
COSStream) gives an IOException "Unknown encoding for 'UniJIS-UCS2-H' ". Can
you let us know is there any way as to overcome this problem?
Regards
Natraj
RE: Need work around for a problem with processpage
Posted by Pe...@ibi.com.
This is probably no help, and this is not a solution either;
Perhaps it is more rambling looking for a solution on my part.
I want to mention that this is not an area that I am familiar with, but I thought that I would give it a shot, right or wrong I have learned a little bit. It appears to me that the characters you mentioned are actually defined in the CMap file org/apache/pdfbox/resources/cmap/UniJIS-UCS2-H
The CMap file is read as a resource, if your class path were to resolve the CMap file in a different directory perhaps from an earlier installation,
which did not define the characters that would cause the problem.
I was wondering if perhaps the character map is getting corrupted somehow, but I have no proof of this.
Lets start with the hex values of the numbers below, "✠"(✠) and "Ⓔ"(Ⓔ)
9402 = x24BA
10016 = x2720
Below is a link to the definition of CMap or Character Map files descriptions.
http://www.adobe.com/devnet/font/pdfs/5099.CMapFiles.pdf
Here is a link to ToUnicode Mapping File Tutorial
http://www.adobe.com/devnet/acrobat/pdfs/5411.ToUnicode.pdf
Look in this file: org/apache/pdfbox/resources/cmap/UniJIS-UCS2-H
It should start like this.
%!PS-Adobe-3.0 Resource-CMap
%%DocumentNeededResources: ProcSet (CIDInit)
%%IncludeResource: ProcSet (CIDInit)
%%BeginResource: CMap (UniJIS-UCS2-H)
You will find that the character 24BA is not defined as a character mapping.
It is defined in a character range mapping.
100 begincidrange
<24b6> <24cf> 10339
This means that the character should map like this.
24b6 -> "x2863"
24b6 is 10339 or x2863
24b7 is 10340
24b8 is 10341
24b9 is 10342
24ba is 10343 <- This is your character or x2867
This is the character you are looking for.
%% 9402=24BA E-o 24ba CIRCLED LATIN CAPITAL LETTER E
However if you look in the other japaneese character mapping files, the character 24BA is explicitly defined as a character mapping:
org/apache/pdfbox/resources/cmap/adobe-Japan1-UCS2
You will find a mapping for the CIRCLED LATIN CAPITAL LETTER E character.
org\apache\pdfbox\resources\cmap\Adobe-Japan1-UCS2
1 beginbfchar
<24BA> <004F030A>
endbfchar
It is also defined as a different mapping in this file.
org\apache\pdfbox\resources\cmap\Adobe-CNS1-UCS2
1 beginbfchar
<24BA> <75F6>
endbfchar
http://www.adobe.com/devnet/font/pdfs/5099.CMapFiles.pdf
http://www.adobe.com/devnet/font/pdfs/5099.CMapFiles.pdf
> On Wed, Jan 21, 2009 at 10:56 AM, Natraj Kadur
> <na...@gmail.com> wrote:
> > I am using the PDFBox for one of the application. What I am
> doing is I
> > am extracting the PDF text from the PDF and generating the TOC
> > entries. But I am facing one problem, that is, if the PDF contains
> > these two characters "✠"(✠) and "Ⓔ"(Ⓔ) then the
> > processpage(PDPage,
> > COSStream) gives an IOException "Unknown encoding for
> 'UniJIS-UCS2-H' ". Can
> > you let us know is there any way as to overcome this problem?
>
> Unfortunately not. Unless someone else has a good answer,
> you'll probably need to look at the relevant source code in
> PDFBox to figure out what to do with this. If you do that,
> we'd be happy to apply any fix you may come up with.
I'm haven't a better answer than Jukka, but perhaps a hint were to look for the solution.
As far as I understand, the are several unicode-mappings defined in Resources/cmap. You have to look,
if the 2 characters you mentioned above are part of the mapping-table "UniJIS-UCS2-H". If not, the question
will be: is there a problem with the mapping-file or with the document-producing software.
HTH
Andreas
----------------------------------------------------------------
Vorsitzender des Aufsichtsrates: Alwin Fitting
Geschaeftsfuehrung: Chittur Ramakrishnan (Vorsitzender),
Stefan Niehusmann
Sitz der Gesellschaft: Dortmund
Eingetragen beim Amtsgericht Dortmund
Handelsregister-Nr. HR B 21222
USt.-IdNr. DE 2588 96 719
AW: Need work around for a problem with processpage
Posted by An...@rwe.com.
> On Wed, Jan 21, 2009 at 10:56 AM, Natraj Kadur
> <na...@gmail.com> wrote:
> > I am using the PDFBox for one of the application. What I am
> doing is I
> > am extracting the PDF text from the PDF and generating the TOC
> > entries. But I am facing one problem, that is, if the PDF contains
> > these two characters "✠"(✠) and "Ⓔ"(Ⓔ) then the
> > processpage(PDPage,
> > COSStream) gives an IOException "Unknown encoding for
> 'UniJIS-UCS2-H' ". Can
> > you let us know is there any way as to overcome this problem?
>
> Unfortunately not. Unless someone else has a good answer,
> you'll probably need to look at the relevant source code in
> PDFBox to figure out what to do with this. If you do that,
> we'd be happy to apply any fix you may come up with.
I'm haven't a better answer than Jukka, but perhaps a hint were to look for the solution.
As far as I understand, the are several unicode-mappings defined in Resources/cmap. You have to look,
if the 2 characters you mentioned above are part of the mapping-table "UniJIS-UCS2-H". If not, the question
will be: is there a problem with the mapping-file or with the document-producing software.
HTH
Andreas
----------------------------------------------------------------
Vorsitzender des Aufsichtsrates: Alwin Fitting
Geschaeftsfuehrung: Chittur Ramakrishnan (Vorsitzender),
Stefan Niehusmann
Sitz der Gesellschaft: Dortmund
Eingetragen beim Amtsgericht Dortmund
Handelsregister-Nr. HR B 21222
USt.-IdNr. DE 2588 96 719
Re: Need work around for a problem with processpage
Posted by Jukka Zitting <ju...@gmail.com>.
Hi,
On Wed, Jan 21, 2009 at 10:56 AM, Natraj Kadur <na...@gmail.com> wrote:
> I am using the PDFBox for one of the application. What I am doing is I
> am extracting the PDF text from the PDF and generating the TOC entries. But
> I am facing one problem, that is, if the PDF contains these two
> characters "✠"(✠) and "Ⓔ"(Ⓔ) then the processpage(PDPage,
> COSStream) gives an IOException "Unknown encoding for 'UniJIS-UCS2-H' ". Can
> you let us know is there any way as to overcome this problem?
Unfortunately not. Unless someone else has a good answer, you'll
probably need to look at the relevant source code in PDFBox to figure
out what to do with this. If you do that, we'd be happy to apply any
fix you may come up with.
BR,
Jukka Zitting