You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Ankit Sethi <9....@gmail.com> on 2020/12/10 00:53:26 UTC

Mega-low-memory PDF encryption

Hi all,

I have an HTTP endpoint returning unencrypted PDFs, with no master or user
password set. My task is to perform two simple customizations:

1) set pre-defined access permissions

2) encrypt with pre-defined master and user password

 and then deliver the file to a variety of locations, SMTP, FTP, or HTTP
destinations.

I want to minimize JVM heap size and am considering the tempFile option for
the scratch file.

But on second thought, I feel that it should be possible to do this
"in-flight". Having read the PDF spec a bit, the inputs to the encryption
algorithm seem to be the passwords (known in advance) and the file-id (I'm
ok with discarding what's going to come in the trailer and pre-produce
something out of known business identifiers)

It should be possible to wrap the http response body InputStream with a
custom CipherInputStream that begins encrypting the doc immediately as the
bytes start coming into a buffer. In addition, the CipherInputStream would
need to detect and perform two things towards the end of the response
stream - replace the file id bytes with my own and replace the access
permission integer to my desired value.

This CipherInputStream can then be provided as input to a JavaMail or FTP
or HTTPClient api and voila, I've performed customization without ever
loading any PDF into memory entirely, not counting a small buffer. (I
imagine the buffer length is constrained by the block size of the
encryption)

My question is -- does this sound feasible? Or is there some non-linearity
in the PDF structure that will force me to load the whole file despite the
modest and well-defined updates I need to make?

Would appreciate any suggestions or advice,

Best,

Ankit

Re: Mega-low-memory PDF encryption

Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Am 10.12.20 um 01:53 schrieb Ankit Sethi:
> Hi all,
> 
> I have an HTTP endpoint returning unencrypted PDFs, with no master or user
> password set. My task is to perform two simple customizations:
> 
> 1) set pre-defined access permissions
> 
> 2) encrypt with pre-defined master and user password
> 
>   and then deliver the file to a variety of locations, SMTP, FTP, or HTTP
> destinations.
> 
> I want to minimize JVM heap size and am considering the tempFile option for
> the scratch file.
> 
> But on second thought, I feel that it should be possible to do this
> "in-flight". Having read the PDF spec a bit, the inputs to the encryption
> algorithm seem to be the passwords (known in advance) and the file-id (I'm
> ok with discarding what's going to come in the trailer and pre-produce
> something out of known business identifiers)
> 
> It should be possible to wrap the http response body InputStream with a
> custom CipherInputStream that begins encrypting the doc immediately as the
> bytes start coming into a buffer. In addition, the CipherInputStream would
> need to detect and perform two things towards the end of the response
> stream - replace the file id bytes with my own and replace the access
> permission integer to my desired value.
> 
> This CipherInputStream can then be provided as input to a JavaMail or FTP
> or HTTPClient api and voila, I've performed customization without ever
> loading any PDF into memory entirely, not counting a small buffer. (I
> imagine the buffer length is constrained by the block size of the
> encryption)
> 
> My question is -- does this sound feasible? Or is there some non-linearity
> in the PDF structure that will force me to load the whole file despite the
> modest and well-defined updates I need to make?
I'm afraid the PDF-reality isn't that easy. There isn't one stream to be 
encrypted but many parts of a document and others are omitted.

Chapter 7.6.1 "Encryption" of the PDF spec starts with the following summary;

A PDF document can be encrypted (PDF 1.1) to protect its contents from 
unauthorized access. Encryption applies to all strings and streams in the 
document's PDF file, with the following exceptions:

• The values for the ID entry in the trailer
• Any strings in an Encrypt dictionary
• Any strings that are inside streams such as content streams and compressed 
object streams, which themselves are encrypted

Encryption is not applied to other object types such as integers and boolean 
values, which are used primarily to convey information about the document's 
structure rather than its contents. Leaving these values unencrypted allows 
random access to the objects within a document, whereas encrypting the strings 
and streams protects the document's contents.

Andreas

> 
> Would appreciate any suggestions or advice,
> 
> Best,
> 
> Ankit
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org