You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Natraj Kadur <na...@gmail.com> on 2009/01/21 10:56:01 UTC

Need work around for a problem with processpage

Hi,

I am using the PDFBox for one of the application. What I am doing is I
am extracting the PDF text from the PDF and generating the TOC entries. But
I am facing one problem, that is, if the PDF contains these two
characters "&#10016;"(✠) and "&#9402;"(Ⓔ) then the processpage(PDPage,
COSStream) gives an IOException "Unknown encoding for 'UniJIS-UCS2-H' ". Can
you let us know is there any way as to overcome this problem?

Regards
Natraj

RE: Need work around for a problem with processpage

Posted by Pe...@ibi.com.

This is probably no help, and this is not a solution either; 

Perhaps it is more rambling looking for a solution on my part.
I want to mention that this is not an area that I am familiar with, but I thought that I would give it a shot, right or wrong I have learned a little bit. It appears to me that the characters you mentioned are actually defined in the CMap file org/apache/pdfbox/resources/cmap/UniJIS-UCS2-H

The CMap file is read as a resource, if your class path were to resolve the CMap file in a different directory perhaps from an earlier installation,
which did not define the characters that would cause the problem.

I was wondering if perhaps the character map is getting corrupted somehow, but I have no proof of this.

Lets start with the hex values of the numbers below, "&#10016;"(✠) and "&#9402;"(Ⓔ)

 9402 = x24BA
10016 = x2720

Below is a link to the definition of CMap or Character Map files descriptions.

http://www.adobe.com/devnet/font/pdfs/5099.CMapFiles.pdf

Here is a link to ToUnicode Mapping File Tutorial 
http://www.adobe.com/devnet/acrobat/pdfs/5411.ToUnicode.pdf

Look in this file:  org/apache/pdfbox/resources/cmap/UniJIS-UCS2-H
It should start like this.

%!PS-Adobe-3.0 Resource-CMap
%%DocumentNeededResources: ProcSet (CIDInit)
%%IncludeResource: ProcSet (CIDInit)
%%BeginResource: CMap (UniJIS-UCS2-H)

You will find that the character 24BA is not defined as a character mapping. 
It is defined in a character range mapping.

100 begincidrange

<24b6> <24cf> 10339

This means that the character should map like this.

24b6 -> "x2863"

24b6 is 10339 or x2863
24b7 is 10340
24b8 is 10341
24b9 is 10342
24ba is 10343   <- This is your character or x2867

This is the character you are looking for.

%% 9402=24BA E-o    24ba    CIRCLED LATIN CAPITAL LETTER E

However if you look in the other japaneese character mapping files, the character 24BA is explicitly defined as a character mapping:
	org/apache/pdfbox/resources/cmap/adobe-Japan1-UCS2

You will find a mapping for the CIRCLED LATIN CAPITAL LETTER E character.

org\apache\pdfbox\resources\cmap\Adobe-Japan1-UCS2
1 beginbfchar
<24BA> <004F030A>
endbfchar

It is also defined as a different mapping in this file.
org\apache\pdfbox\resources\cmap\Adobe-CNS1-UCS2
1 beginbfchar
<24BA> <75F6>
endbfchar

http://www.adobe.com/devnet/font/pdfs/5099.CMapFiles.pdf

http://www.adobe.com/devnet/font/pdfs/5099.CMapFiles.pdf

> On Wed, Jan 21, 2009 at 10:56 AM, Natraj Kadur
> <na...@gmail.com> wrote:
> > I am using the PDFBox for one of the application. What I am
> doing is I
> > am extracting the PDF text from the PDF and generating the TOC
> > entries. But I am facing one problem, that is, if the PDF contains
> > these two characters "&#10016;"(✠) and "&#9402;"(Ⓔ) then the
> > processpage(PDPage,
> > COSStream) gives an IOException "Unknown encoding for
> 'UniJIS-UCS2-H' ". Can
> > you let us know is there any way as to overcome this problem?
>
> Unfortunately not. Unless someone else has a good answer,
> you'll probably need to look at the relevant source code in
> PDFBox to figure out what to do with this. If you do that,
> we'd be happy to apply any fix you may come up with.
I'm haven't a better answer than Jukka, but perhaps a hint were to look for the solution.
As far as I understand, the are several unicode-mappings defined in Resources/cmap. You have to look,
if the 2 characters you mentioned above are part of the mapping-table "UniJIS-UCS2-H". If not, the question
will be: is there a problem with the mapping-file or with the document-producing software.

HTH
Andreas
----------------------------------------------------------------
Vorsitzender des Aufsichtsrates: Alwin Fitting
Geschaeftsfuehrung: Chittur Ramakrishnan (Vorsitzender), 
Stefan Niehusmann

Sitz der Gesellschaft: Dortmund
Eingetragen beim Amtsgericht Dortmund 
Handelsregister-Nr. HR B 21222 
USt.-IdNr. DE 2588 96 719

AW: Need work around for a problem with processpage

Posted by An...@rwe.com.

> On Wed, Jan 21, 2009 at 10:56 AM, Natraj Kadur
> <na...@gmail.com> wrote:
> > I am using the PDFBox for one of the application. What I am
> doing is I
> > am extracting the PDF text from the PDF and generating the TOC
> > entries. But I am facing one problem, that is, if the PDF contains
> > these two characters "&#10016;"(✠) and "&#9402;"(Ⓔ) then the
> > processpage(PDPage,
> > COSStream) gives an IOException "Unknown encoding for
> 'UniJIS-UCS2-H' ". Can
> > you let us know is there any way as to overcome this problem?
>
> Unfortunately not. Unless someone else has a good answer,
> you'll probably need to look at the relevant source code in
> PDFBox to figure out what to do with this. If you do that,
> we'd be happy to apply any fix you may come up with.
I'm haven't a better answer than Jukka, but perhaps a hint were to look for the solution.
As far as I understand, the are several unicode-mappings defined in Resources/cmap. You have to look,
if the 2 characters you mentioned above are part of the mapping-table "UniJIS-UCS2-H". If not, the question
will be: is there a problem with the mapping-file or with the document-producing software.

HTH
Andreas
----------------------------------------------------------------
Vorsitzender des Aufsichtsrates: Alwin Fitting
Geschaeftsfuehrung: Chittur Ramakrishnan (Vorsitzender), 
Stefan Niehusmann

Sitz der Gesellschaft: Dortmund
Eingetragen beim Amtsgericht Dortmund 
Handelsregister-Nr. HR B 21222 
USt.-IdNr. DE 2588 96 719

Re: Need work around for a problem with processpage

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On Wed, Jan 21, 2009 at 10:56 AM, Natraj Kadur <na...@gmail.com> wrote:
> I am using the PDFBox for one of the application. What I am doing is I
> am extracting the PDF text from the PDF and generating the TOC entries. But
> I am facing one problem, that is, if the PDF contains these two
> characters "&#10016;"(✠) and "&#9402;"(Ⓔ) then the processpage(PDPage,
> COSStream) gives an IOException "Unknown encoding for 'UniJIS-UCS2-H' ". Can
> you let us know is there any way as to overcome this problem?

Unfortunately not. Unless someone else has a good answer, you'll
probably need to look at the relevant source code in PDFBox to figure
out what to do with this. If you do that, we'd be happy to apply any
fix you may come up with.

BR,

Jukka Zitting