You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by Petr Slabý <sl...@kadel.cz> on 2017/09/15 09:28:02 UTC

End of line heuristic

Hi,
I have a PDF which cannot be read by PDFBox 1.8.x, but works fine in PDFBox 2.x. Before I create an issue I need to internally clarify whether I can share the PDF. And I would like to clarify with you whether it makes sense to create an issue at all.

The problem in the PDF is the following.
The PDF consistently uses the line feed (0x0A) character for end of line. Its streams are encrypted. There is a stream having 

<</Length 9 0 R/Filter/FlateDecode/N 3/Range[0 1 0 1 0 1 ]>>

so the length attribute is stored in a separate object and the sequential parser cannot use it. Hence it searches for the “endstream” keyword. The misfortune is that the stream itself ends with the byte 0x0D, which is interpreted by the EndstreamOutputStream as carriage return and stripped off from the stream. The stream content is then one byte short and the decryption fails...

Do you see a chance to improve this? Could the EndstreamOutputStream learn the line ending to search for from the PDF content? I mean my PDF starts with %PDF-1.7<0x0D>, could the EndstreamOutputStream search just for this character in such case? Or are the PDFs which use a mixture of both line endings?

The only other solution I found is to add the missing byte in the method encryptData() of SecurityHandler. There I know that the data length has to be divisible by 16, so I add 0x0D if one byte is missing. But it is rather a hack and I am not sure whether the missing byte might not be 0x0A in some cases. And this only helps for AES encrypted streams anyway.

Please do not suggest to move to the non-sequential parser of PDFBox 2.x (I guess that is the reason why it works there). I would love to move on to the new version, but we are not this far in our software yet. And our customers will move to it one or two years after we are ready...

Best regards,
Petr.

Re: End of line heuristic

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 15.09.2017 um 11:28 schrieb Petr Slabý:
> Hi,
> I have a PDF which cannot be read by PDFBox 1.8.x, but works fine in PDFBox 2.x. Before I create an issue I need to internally clarify whether I can share the PDF. And I would like to clarify with you whether it makes sense to create an issue at all.
>
> The problem in the PDF is the following.
> The PDF consistently uses the line feed (0x0A) character for end of line. Its streams are encrypted. There is a stream having
>
> <</Length 9 0 R/Filter/FlateDecode/N 3/Range[0 1 0 1 0 1 ]>>
>
> so the length attribute is stored in a separate object and the sequential parser cannot use it. Hence it searches for the “endstream” keyword. The misfortune is that the stream itself ends with the byte 0x0D, which is interpreted by the EndstreamOutputStream as carriage return and stripped off from the stream. The stream content is then one byte short and the decryption fails...
>
> Do you see a chance to improve this? Could the EndstreamOutputStream learn the line ending to search for from the PDF content? I mean my PDF starts with %PDF-1.7<0x0D>, could the EndstreamOutputStream search just for this character in such case? Or are the PDFs which use a mixture of both line endings?

I remember that we had some weird endings, which is why there's this 
weird code... so if you want, send a change and I'll let it run on all 
my test files.

Tilman

>
> The only other solution I found is to add the missing byte in the method encryptData() of SecurityHandler. There I know that the data length has to be divisible by 16, so I add 0x0D if one byte is missing. But it is rather a hack and I am not sure whether the missing byte might not be 0x0A in some cases. And this only helps for AES encrypted streams anyway.
>
> Please do not suggest to move to the non-sequential parser of PDFBox 2.x (I guess that is the reason why it works there). I would love to move on to the new version, but we are not this far in our software yet. And our customers will move to it one or two years after we are ready...
>
> Best regards,
> Petr.



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org