You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by Ryan Jackson <ry...@workiva.com> on 2022/03/11 20:49:52 UTC

Suspected bug in and proposed fix for ToUnicodeWriter.writeTo

Dear Apache Devs:

I believe that I have identified a bug in the creation of the
(begin/end)bfrange operator used when embedding fonts with the
PDCIDFontType2Embedder class.

The bug exists (as best I can tell) in both the main trunk and in the 2.0
branch. The code in question may be found here
<https://github.com/ryanjackson-wf/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/ToUnicodeWriter.java#L133-L136>.
The portion of the PDF specification (version 1.7) that bears upon this
code is Section 5.9, Example 5.16.

The existing code attempts to limit the range logic to changes less than or
equal to 255 code points, but it fails to account for at least the
following situation by allowing this (for example):

[srcCode1 srcCode2 dstString]
03FF 0400 0036

The overflow between srcCode1 and srcCode2 is not allowed by the
specification and any text extraction will fail. The glyphs themselves
render fine so it is not immediately obvious there is a problem until one
tries to examine the text by using the Content Panel or by copy/pasting
from Acrobat (Pro) to some other document. By contrast the following
bfrange operator does allow the text extraction to work as intended:

[srcCode1 srcCode2 dstString]
03FE 03FF 0035

Notice that no overflow exists, and as such the requirements of the
specification are met.

I've looked briefly at the PDFBOX project in Jira and have found the
following tickets that may be caused by this same problem:

PDFBOX-4785 <https://issues.apache.org/jira/browse/PDFBOX-4785>
PDFBOX-5350 <https://issues.apache.org/jira/browse/PDFBOX-5350>

I have put together a proposed solution here
<https://github.com/ryanjackson-wf/pdfbox/pull/1> in my fork of the PDFBox
GH mirror. With your permission I'd like to open a new Jira ticket for this
and collaborate with whomever would like to help drive this work to get it
reviewed and merged. I do have some open questions about how surrogates are
to be handled. I'm also open to changes in the proposed code.

Thank you for your time.

Sincerely,

Ryan Jackson
Senior Software Engineer
Workiva Inc.

Re: Suspected bug in and proposed fix for ToUnicodeWriter.writeTo

Posted by Ryan Jackson <ry...@workiva.com>.
Dear Andreas,

Thank you. I'll write up a ticket soon. I may not be able to get to it
until Monday MST (US) but will create it and add some sample PDF files to
the ticket. I also have a working Adobe Acrobat Pro example (they are using
the bfchar operator instead).

Vielen Dank!

Ryan.


On Sat, Mar 12, 2022 at 3:28 AM Andreas Lehmkuehler <an...@lehmi.de>
wrote:

> Hi,
>
> Am 11.03.22 um 21:49 schrieb Ryan Jackson:
> > Dear Apache Devs:
> >
> > I believe that I have identified a bug in the creation of the
> > (begin/end)bfrange operator used when embedding fonts with the
> > PDCIDFontType2Embedder class.
> >
> > The bug exists (as best I can tell) in both the main trunk and in the 2.0
> > branch. The code in question may be found here
> > <
> https://github.com/ryanjackson-wf/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/ToUnicodeWriter.java#L133-L136
> >.
> > The portion of the PDF specification (version 1.7) that bears upon this
> > code is Section 5.9, Example 5.16.
> >
> > The existing code attempts to limit the range logic to changes less than
> or
> > equal to 255 code points, but it fails to account for at least the
> > following situation by allowing this (for example):
> >
> > [srcCode1 srcCode2 dstString]
> > 03FF 0400 0036
> >
> > The overflow between srcCode1 and srcCode2 is not allowed by the
> > specification and any text extraction will fail. The glyphs themselves
> > render fine so it is not immediately obvious there is a problem until one
> > tries to examine the text by using the Content Panel or by copy/pasting
> > from Acrobat (Pro) to some other document. By contrast the following
> > bfrange operator does allow the text extraction to work as intended:
> >
> > [srcCode1 srcCode2 dstString]
> > 03FE 03FF 0035
> >
> > Notice that no overflow exists, and as such the requirements of the
> > specification are met.
> I'm afraid you are right, good catch.
>
> > I've looked briefly at the PDFBOX project in Jira and have found the
> > following tickets that may be caused by this same problem:
> >
> > PDFBOX-4785 <https://issues.apache.org/jira/browse/PDFBOX-4785>
> > PDFBOX-5350 <https://issues.apache.org/jira/browse/PDFBOX-5350>
> Yes, somehow. Those are about reading malformed pdfs containing the very
> same
> issue your have described above.
> Fun fact: we are complaining about other pdf writers not following the
> spec and
> are doing the very same: I never came up with the idea to check our own
> code :-(
>
> > I have put together a proposed solution here
> > <https://github.com/ryanjackson-wf/pdfbox/pull/1> in my fork of the
> PDFBox
> > GH mirror. With your permission I'd like to open a new Jira ticket for
> this
> > and collaborate with whomever would like to help drive this work to get
> it
> > reviewed and merged. I do have some open questions about how surrogates
> are
> > to be handled. I'm also open to changes in the proposed code.
> You don't have to wait for permission. Please create a JIRA ticket
> including a
> link to you PR
>
>
> > Thank you for your time.
> Thanks for you time and the proposed solution.
>
> Andreas
>
> >
> > Sincerely,
> >
> > Ryan Jackson
> > Senior Software Engineer
> > Workiva Inc.
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>
>

Re: Suspected bug in and proposed fix for ToUnicodeWriter.writeTo

Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Hi,

Am 11.03.22 um 21:49 schrieb Ryan Jackson:
> Dear Apache Devs:
> 
> I believe that I have identified a bug in the creation of the
> (begin/end)bfrange operator used when embedding fonts with the
> PDCIDFontType2Embedder class.
> 
> The bug exists (as best I can tell) in both the main trunk and in the 2.0
> branch. The code in question may be found here
> <https://github.com/ryanjackson-wf/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/ToUnicodeWriter.java#L133-L136>.
> The portion of the PDF specification (version 1.7) that bears upon this
> code is Section 5.9, Example 5.16.
> 
> The existing code attempts to limit the range logic to changes less than or
> equal to 255 code points, but it fails to account for at least the
> following situation by allowing this (for example):
> 
> [srcCode1 srcCode2 dstString]
> 03FF 0400 0036
> 
> The overflow between srcCode1 and srcCode2 is not allowed by the
> specification and any text extraction will fail. The glyphs themselves
> render fine so it is not immediately obvious there is a problem until one
> tries to examine the text by using the Content Panel or by copy/pasting
> from Acrobat (Pro) to some other document. By contrast the following
> bfrange operator does allow the text extraction to work as intended:
> 
> [srcCode1 srcCode2 dstString]
> 03FE 03FF 0035
> 
> Notice that no overflow exists, and as such the requirements of the
> specification are met.
I'm afraid you are right, good catch.

> I've looked briefly at the PDFBOX project in Jira and have found the
> following tickets that may be caused by this same problem:
> 
> PDFBOX-4785 <https://issues.apache.org/jira/browse/PDFBOX-4785>
> PDFBOX-5350 <https://issues.apache.org/jira/browse/PDFBOX-5350>
Yes, somehow. Those are about reading malformed pdfs containing the very same 
issue your have described above.
Fun fact: we are complaining about other pdf writers not following the spec and 
are doing the very same: I never came up with the idea to check our own code :-(

> I have put together a proposed solution here
> <https://github.com/ryanjackson-wf/pdfbox/pull/1> in my fork of the PDFBox
> GH mirror. With your permission I'd like to open a new Jira ticket for this
> and collaborate with whomever would like to help drive this work to get it
> reviewed and merged. I do have some open questions about how surrogates are
> to be handled. I'm also open to changes in the proposed code.
You don't have to wait for permission. Please create a JIRA ticket including a 
link to you PR


> Thank you for your time.
Thanks for you time and the proposed solution.

Andreas

> 
> Sincerely,
> 
> Ryan Jackson
> Senior Software Engineer
> Workiva Inc.
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org