You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Andreas Lehmkühler (Jira)" <ji...@apache.org> on 2022/03/19 17:04:00 UTC

[jira] [Comment Edited] (PDFBOX-5387) ToUnicodeWriter.writeTo allows byte overflow in bfrange operator

    [ https://issues.apache.org/jira/browse/PDFBOX-5387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17509309#comment-17509309 ] 

Andreas Lehmkühler edited comment on PDFBOX-5387 at 3/19/22, 5:03 PM:
----------------------------------------------------------------------

i try to answer your questions AFAIU the topic
{quote}
Technically we might consider using the {{bfchar}} operator instead of {{bfrange}} when the logic allows only one element (in what would otherwise be a range). That would be a small optimization. In my investigation of this issue, I found that (for the small sample document I used - just using a few numbers - so not representative of a "real" document), Adobe Acrobat (Pro) uses {{bfchar}}. Apparently it must favor {{bfchar}} for small subsets
{quote}
We could do that, but that would require a little refactoring as all of the bfchar entries have to be added in front of the bfrange entries. I'm not sure if the benefit is that much. However, we should open a new ticket if someone want to follow on this

{quote}
With regard to UTF-16 supplementary characters, I am not certain that the existing code (copied below; see allowDestinationRange) actually does what we want:
{quote}
We don't have to care about the conversion of out of range values into surrogates as this is done automatically when creating the string as the internal representation of a string in UTF-16BE based. We have to care about the length of the string as it is allowed to map a cid to more than one codepoint, e.g. to ligatures such as "ff". Saying that, the check of the length does make sense.

{quote}
The last question I have is whether or not I may have violated an assumption of my own in allowCodeRange (which assumes 16-bit values) by calling it from allowDestinationRange with values potentially greater than 16-bit (the code point should reflect UTF-32).
{quote}
{{allowCodeRange}} uses the CID values as input and those are definitely 16bit values.

{quote}
I apologize that I did not mention these questions earlier, but I am glad that we can discuss them and make any changes if necessary
{quote}
There is absolutely no need to apologize, on the contrary. You have found a not that obvious issue, have filed a ticket with a detailed analysis and have provided a working solution as well. IMHO that's an outstanding example for a code contribution, we have to thank you.



was (Author: lehmi):
i try to answer your questions AFAIU the topic
{quote}
Technically we might consider using the {{bfchar}} operator instead of {{bfrange}} when the logic allows only one element (in what would otherwise be a range). That would be a small optimization. In my investigation of this issue, I found that (for the small sample document I used - just using a few numbers - so not representative of a "real" document), Adobe Acrobat (Pro) uses {{{}bfchar{}}}. Apparently it must favor {{bfchar}} for small subsets
{quote}
We could do that, but that would require a little refactoring as all of the bfchar entries have to be added in front of the bfrange entries. I'm not sure if the benefit is that much. However, we should open a new ticket if someone want to follow on this
{quote}With regard to UTF-16 supplementary characters, I am not certain that the existing code (copied below; see allowDestinationRange) actually does what we want:{quote}
We don't have to care about the conversion of out of range values into surrogates as this is done automatically when creating the string as the internal representation of a string in UTF-16BE based. We have to care about the length of the string as it is allowed to map a cid to more than one codepoint, e.g. to ligatures such as "ff". Saying that, the check of the length does make sense.
{quote}
The last question I have is whether or not I may have violated an assumption of my own in allowCodeRange (which assumes 16-bit values) by calling it from allowDestinationRange with values potentially greater than 16-bit (the code point should reflect UTF-32).
{quote}
{{allowCodeRange}} uses the CID values as input and those are definitely 16bit values.

{quote}I apologize that I did not mention these questions earlier, but I am glad that we can discuss them and make any changes if necessary{quote}
There is absolutely no need to apologize, on the contrary. You have found a not that obvious issue, have filed a ticket with a detailed analysis and have provided a working solution as well. IMHO that's an outstanding example for a code contribution, we have to thank you.


> ToUnicodeWriter.writeTo allows byte overflow in bfrange operator
> ----------------------------------------------------------------
>
>                 Key: PDFBOX-5387
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5387
>             Project: PDFBox
>          Issue Type: Bug
>          Components: PDModel
>    Affects Versions: 2.0.25
>            Reporter: Ryan Jackson
>            Assignee: Andreas Lehmkühler
>            Priority: Major
>             Fix For: 2.0.26, 3.0.0 PDFBox
>
>
> The {{writeTo}} method of {{ToUnicodeWriter}} allows overflow in the low-order byte when writing the {{(begin/end)bfrange}} operator.
> As far as I can tell it is used only with the {{PDCIDFontType2Embedder}} class. I believe the bug exists in both the main trunk and in the 2.x branch. The code in question may be found [here|https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/ToUnicodeWriter.java#L133-L136] .
> The portion of the PDF specification (version 1.7) that bears upon this code is Section 5.9, Example 5.16.
> The existing code attempts to limit the range logic to changes less than or equal to 255 code points, but it fails to account for at least the following situation by allowing this (for example):
> [srcCode1 srcCode2 dstString]
> 03FF 0400 0036
> The overflow between srcCode1 and srcCode2 is not allowed by the specification and any text extraction will fail. The glyphs themselves render fine so it is not immediately obvious there is a problem until one tries to examine the text by using the Content Panel or by copy/pasting from Acrobat (Pro) to some other document. By contrast the following bfrange operator does allow the text extraction to work as intended:
> [srcCode1 srcCode2 dstString]
> 03FE 03FF 0035
> Notice that no overflow exists, and as such the requirements of the specification are met.
> I have put together a proposed solution [here|https://github.com/ryanjackson-wf/pdfbox/pull/1] in my fork of the PDFBox GH mirror.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org