You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by Itai <it...@gmail.com> on 2018/03/01 11:54:32 UTC

Allowing subsampled/downscaled rendering of images, or rendering of subregions of images

Hello,

Following a question asked on pdfbox-users [1] , I set about trying to
allow rendering images at lower resolutions, and additionally rendering
only parts of images. The need arises from having very large images,
usually JPEG or JBIG2, which are tens of megabytes in size when compressed,
but may take up 8 or even more gigabytes when rendered as a BufferedImage
at full resolution.
I have come up with a solution that seems to work (passes all of the
built-in PDFBox tests, and a few manual ones I tried), but since it
includes some deep changes in the logic I understand if it won't find its
way into PDFBox.

While working on it, I also came across PDFBOX-3340 [2], and since my hack
relies on making changes to the way filters work, it includes a (partial)
fix for that bug too.

Finally, since I'm not well versed in git/github, I'm not sure of the best
way to share my work. I attach here a unified diff, but let me know if
there is another preferred method (pull request? clone the repository?)

Following is an explanation/description of my changes, for those
interested. I would love to hear any feedback, especially for things which
may increase the likelihood of such a feature being included in future
versions of PDFBox.

Thanks,
Itai.

As stated, the issue pertains mainly to very large images (lots of pixels)
which are highly compressed. Since DCTFilter, JBIG2Filter etc. render the
entire image, I had to augment the way Filter works, to allow it to accept
options.
This is where the class DecodeOptions comes in. It has sub-region and
subsampling options (mirroring those of ImageReadParam), as well as a
"metadata only" param. When decoding, you may pass DecodeOptions, such that
image-related filters can downscale or only render a part of the image.
The "metadata-only" option is used for the `repair` method of
PDImageXObject, as it only really needs the DecodeResult - where applicable
and possible, a filter encountering this option will not decode the stream,
only set the DecodeResult parameters (this is not always possible, e.g. for
JPXFilter, which must decode the image to get the parameters).

The DecodeOptions also has an "honored" flag, which the filter sets to true
if it honored the options - this is needed because when decoding an image
stored in a Flate or LZW stream, the filter doesn't know the image format
(or does it? I couldn't find a simple way of telling), so it can't make
sense of subsampling or partial render options. SampledImageReader checks
this flag, and if it is not set to true it does the subsampling by itself.

This allows the addition of a method in PDImage

BufferedImage getImage(Rectangle region, int subsample) throws
IOException;

The result of which is not cached, as it is not "canonical".
When drawing an image, PDPageDrawer calculates a subsampling factor based
on the desired size:

int subsample = (int)Math.floor(pdImage.getWidth()/at.getScaleX());
if (subsample<1) subsample = 1;
if (subsample>8) subsample = 8;
drawBufferedImage(pdImage.getImage(null, subsample), at);

Such that if e.g. the pixel should be drawn at 0.5 times its pixel-size, it
will be subsampled at 2-pixel intervals.

SampledImageReader issues the corresponding DecodeOptions to
PDImage#createInputStream when rendering, and if the "honored" flag is not
set, it does its own sub-sampling and partial rendering.

I realize most/all of those optimizations won't work for raw, Flate or LZW
encoded images, but presumably those won't be too large in the first place.
Also, this has little to no benefit for PDInlineImage, but as it already
holds all of its raw data I assume little optimization is possible.

In general, this hack allowed me to speed-up rendering of some files by
significant margins (20%-80%, depending on size and desired DPI), and
significantly lower the memory footprint if only a lower-res render is
required, or rendering of small regions of the image.

[1]:
https://lists.apache.org/thread.html/6b396e3d8bfc4ed44bcadf37881035d7447fb711253ef962f187455c@%3Cusers.pdfbox.apache.org%3E

[2]: https://issues.apache.org/jira/browse/PDFBOX-3340

Re: Allowing subsampled/downscaled rendering of images, or rendering of subregions of images

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 01.03.2018 um 23:10 schrieb Itai:
> How would I go about testing it, other than collecting many PDFs with lots
> of images, and timing the calls to getImage?

Yes... for specific PDFs with 1bit, just open them in PDFDebugger a few 
times, and look at the status line. Just press CTRL-R to reload. The 
first value is a bit higher due to some initializations, the second one 
is realistic. And compare these times with a trunk build without the 
changes.

I'll test the same too, at a later time.

Tilman


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org

Re: Allowing subsampled/downscaled rendering of images, or rendering of subregions of images

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 01.03.2018 um 23:10 schrieb Itai:
> I'm not sure I understand your question about scratch file - the
> METADATA_ONLY option is currently only passed in the constructor of
> PDImageXObject, where the decoding was only done for the benefit of the
> "repair" method...
Sorry, I just looked at it again and maybe I've confused something. 
Ignore what I wrote in that line.

Tilman



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org

Re: Allowing subsampled/downscaled rendering of images, or rendering of subregions of images

Posted by Itai <it...@gmail.com>.

Thank you for the reply.

I have opened an issue: https://issues.apache.org/jira/browse/PDFBOX-4137

I have attached a revised patch to it - I have found some bugs and
inconsistencies regarding the (erroneous) way I was using to calculate the
target image size.
I also added documentation and javadoc to methods and classes I've added.

I tried to revert all of the formatting changes, but for some reason
IntelliJ keeps bringing them in as it creates the patch (even though I
asked it not to reformat the code).
After some more struggles I think I have managed to "sanitize" the patch
(see latest attachment to the issue).

Theoretically, the changes shouldn't slow down the normal path, as it's
supposed to be methodically the same.
In practice I guess there may be some minor losses due to differences
between e.g. "++x" and "x+=1".
How would I go about testing it, other than collecting many PDFs with lots
of images, and timing the calls to getImage?

I'm not sure I understand your question about scratch file - the
METADATA_ONLY option is currently only passed in the constructor of
PDImageXObject, where the decoding was only done for the benefit of the
"repair" method...

Itai.


On Thu, Mar 1, 2018 at 7:12 PM, Tilman Hausherr <TH...@t-online.de>
wrote:

> Thanks, sounds interesting. There's definitively a need for that. Just
> create an issue in JIRA with your text and your patch.
> https://issues.apache.org/jira/browse/PDFBOX
>
> About your patch:
> Please remove any changes that are just reformatting. That makes more work
> for us, because it shows more changes than there really are. I try to
> understand everything, not just test that it works. Example:
>
> -                int r = clamp( (1.164f * (Y-16)) + (1.596f * (Cr - 128))
> );
> -                int g = clamp( (1.164f * (Y-16)) + (-0.392f * (Cb-128)) +
> (-0.813f * (Cr-128)));
> -                int b = clamp( (1.164f * (Y-16)) + (2.017f * (Cb-128)));
> +                int r = clamp((1.164f * (Y - 16)) + (1.596f * (Cr -
> 128)));
> +                int g = clamp((1.164f * (Y - 16)) + (-0.392f * (Cb -
> 128)) + (-0.813f * (Cr -
> +                        128)));
> +                int b = clamp((1.164f * (Y - 16)) + (2.017f * (Cb -
> 128)));
>
> Be aware that if your patch changes the public API, then it won't be used
> in the 2.0 branch. (Your patch should still be against the trunk).
>
> Also make sure that your changes in SampledImageReader don't make the
> "normal" path (i.e. reading the entire stream and converting it to an
> image) slower. The current code is the result of several optimizations.
>
> Public API (e.g. DecodeOptions) should have some javadoc. I have no idea
> what "honored" does.
>
> The decode with METADATA_ONLY - does it mean nothing is decoded if there
> is a scratch file???
>
> Tilman
>
>
>
> Am 01.03.2018 um 12:54 schrieb Itai:
>
>> Hello,
>>
>> Following a question asked on pdfbox-users [1] , I set about trying to
>> allow rendering images at lower resolutions, and additionally rendering
>> only parts of images.  The need arises from having very large images,
>> usually JPEG or JBIG2, which are tens of megabytes in size when compressed,
>> but may take up 8 or even more gigabytes when rendered as a BufferedImage
>> at full resolution.
>> I have come up with a solution that seems to work (passes all of the
>> built-in PDFBox tests, and a few manual ones I tried), but since it
>> includes some deep changes in the logic I understand if it won't find its
>> way into PDFBox.
>>
>> While working on it, I also came across PDFBOX-3340 [2], and since my
>> hack relies on making changes to the way filters work, it includes a
>> (partial) fix for that bug too.
>>
>> Finally, since I'm not well versed in git/github, I'm not sure of the
>> best way to share my work. I attach here a unified diff, but let me know if
>> there is another preferred method (pull request? clone the repository?)
>>
>> Following is an explanation/description of my changes, for those
>> interested. I would love to hear any feedback, especially for things which
>> may increase the likelihood of such a feature being included in future
>> versions of PDFBox.
>>
>> Thanks,
>> Itai.
>>
>> --
>>
>> As stated, the issue pertains mainly to very large images (lots of
>> pixels) which are highly compressed. Since DCTFilter, JBIG2Filter etc.
>> render the entire image, I had to augment the way Filter works, to allow it
>> to accept options.
>> This is where the class DecodeOptions comes in. It has sub-region and
>> subsampling options (mirroring those of ImageReadParam), as well as a
>> "metadata only" param. When decoding, you may pass DecodeOptions, such that
>> image-related filters can downscale or only render a part of the image.
>> The "metadata-only" option is used for the `repair` method of
>> PDImageXObject, as it only really needs the DecodeResult - where applicable
>> and possible, a filter encountering this option will not decode the stream,
>> only set the DecodeResult parameters (this is not always possible, e.g. for
>> JPXFilter, which must decode the image to get the parameters).
>>
>> The DecodeOptions also has an "honored" flag, which the filter sets to
>> true if it honored the options - this is needed because when decoding an
>> image stored in a Flate or LZW stream, the filter doesn't know the image
>> format (or does it? I couldn't find a simple way of telling), so it can't
>> make sense of subsampling or partial render options. SampledImageReader
>> checks this flag, and if it is not set to true it does the subsampling by
>> itself.
>>
>> This allows the addition of a method in PDImage
>>
>>      BufferedImage getImage(Rectangle region, int subsample) throws
>> IOException;
>>
>> The result of which is not cached, as it is not "canonical".
>> When drawing an image, PDPageDrawer calculates a subsampling factor based
>> on the desired size:
>>
>>     int subsample = (int)Math.floor(pdImage.getWidth()/at.getScaleX());
>>     if (subsample<1) subsample = 1;
>>     if (subsample>8) subsample = 8;
>>     drawBufferedImage(pdImage.getImage(null, subsample), at);
>>
>> Such that if e.g. the pixel should be drawn at 0.5 times its pixel-size,
>> it will be subsampled at 2-pixel intervals.
>>
>> SampledImageReader issues the corresponding DecodeOptions to
>> PDImage#createInputStream when rendering, and if the "honored" flag is not
>> set, it does its own sub-sampling and partial rendering.
>>
>> I realize most/all of those optimizations won't work for raw, Flate or
>> LZW encoded images, but presumably those won't be too large in the first
>> place. Also, this has little to no benefit for PDInlineImage, but as it
>> already holds all of its raw data I assume little optimization is possible.
>>
>> In general, this hack allowed me to speed-up rendering of some files by
>> significant margins (20%-80%, depending on size and desired DPI), and
>> significantly lower the memory footprint if only a lower-res render is
>> required, or rendering of small regions of the image.
>>
>> --
>>
>> [1]: https://lists.apache.org/thread.html/6b396e3d8bfc4ed44bcadf3
>> 7881035d7447fb711253ef962f187455c@%3Cusers.pdfbox.apache.org%3E
>> [2]: https://issues.apache.org/jira/browse/PDFBOX-3340
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>
>
>
>

Re: Allowing subsampled/downscaled rendering of images, or rendering of subregions of images

Posted by Tilman Hausherr <TH...@t-online.de>.

Thanks, sounds interesting. There's definitively a need for that. Just 
create an issue in JIRA with your text and your patch.
https://issues.apache.org/jira/browse/PDFBOX

About your patch:
Please remove any changes that are just reformatting. That makes more 
work for us, because it shows more changes than there really are. I try 
to understand everything, not just test that it works. Example:

-                int r = clamp( (1.164f * (Y-16)) + (1.596f * (Cr - 128)) );
-                int g = clamp( (1.164f * (Y-16)) + (-0.392f * (Cb-128)) 
+ (-0.813f * (Cr-128)));
-                int b = clamp( (1.164f * (Y-16)) + (2.017f * (Cb-128)));
+                int r = clamp((1.164f * (Y - 16)) + (1.596f * (Cr - 128)));
+                int g = clamp((1.164f * (Y - 16)) + (-0.392f * (Cb - 
128)) + (-0.813f * (Cr -
+                        128)));
+                int b = clamp((1.164f * (Y - 16)) + (2.017f * (Cb - 128)));

Be aware that if your patch changes the public API, then it won't be 
used in the 2.0 branch. (Your patch should still be against the trunk).

Also make sure that your changes in SampledImageReader don't make the 
"normal" path (i.e. reading the entire stream and converting it to an 
image) slower. The current code is the result of several optimizations.

Public API (e.g. DecodeOptions) should have some javadoc. I have no idea 
what "honored" does.

The decode with METADATA_ONLY - does it mean nothing is decoded if there 
is a scratch file???

Tilman


Am 01.03.2018 um 12:54 schrieb Itai:
> Hello,
>
> Following a question asked on pdfbox-users [1] , I set about trying to 
> allow rendering images at lower resolutions, and additionally 
> rendering only parts of images.  The need arises from having very 
> large images, usually JPEG or JBIG2, which are tens of megabytes in 
> size when compressed, but may take up 8 or even more gigabytes when 
> rendered as a BufferedImage at full resolution.
> I have come up with a solution that seems to work (passes all of the 
> built-in PDFBox tests, and a few manual ones I tried), but since it 
> includes some deep changes in the logic I understand if it won't find 
> its way into PDFBox.
>
> While working on it, I also came across PDFBOX-3340 [2], and since my 
> hack relies on making changes to the way filters work, it includes a 
> (partial) fix for that bug too.
>
> Finally, since I'm not well versed in git/github, I'm not sure of the 
> best way to share my work. I attach here a unified diff, but let me 
> know if there is another preferred method (pull request? clone the 
> repository?)
>
> Following is an explanation/description of my changes, for those 
> interested. I would love to hear any feedback, especially for things 
> which may increase the likelihood of such a feature being included in 
> future versions of PDFBox.
>
> Thanks,
> Itai.
>
> -- 
>
> As stated, the issue pertains mainly to very large images (lots of 
> pixels) which are highly compressed. Since DCTFilter, JBIG2Filter etc. 
> render the entire image, I had to augment the way Filter works, to 
> allow it to accept options.
> This is where the class DecodeOptions comes in. It has sub-region and 
> subsampling options (mirroring those of ImageReadParam), as well as a 
> "metadata only" param. When decoding, you may pass DecodeOptions, such 
> that image-related filters can downscale or only render a part of the 
> image.
> The "metadata-only" option is used for the `repair` method of 
> PDImageXObject, as it only really needs the DecodeResult - where 
> applicable and possible, a filter encountering this option will not 
> decode the stream, only set the DecodeResult parameters (this is not 
> always possible, e.g. for JPXFilter, which must decode the image to 
> get the parameters).
>
> The DecodeOptions also has an "honored" flag, which the filter sets to 
> true if it honored the options - this is needed because when decoding 
> an image stored in a Flate or LZW stream, the filter doesn't know the 
> image format (or does it? I couldn't find a simple way of telling), so 
> it can't make sense of subsampling or partial render options. 
> SampledImageReader checks this flag, and if it is not set to true it 
> does the subsampling by itself.
>
> This allows the addition of a method in PDImage
>
>      BufferedImage getImage(Rectangle region, int subsample) throws 
> IOException;
>
> The result of which is not cached, as it is not "canonical".
> When drawing an image, PDPageDrawer calculates a subsampling factor 
> based on the desired size:
>
>     int subsample = (int)Math.floor(pdImage.getWidth()/at.getScaleX());
>     if (subsample<1) subsample = 1;
>     if (subsample>8) subsample = 8;
>     drawBufferedImage(pdImage.getImage(null, subsample), at);
>
> Such that if e.g. the pixel should be drawn at 0.5 times its 
> pixel-size, it will be subsampled at 2-pixel intervals.
>
> SampledImageReader issues the corresponding DecodeOptions to 
> PDImage#createInputStream when rendering, and if the "honored" flag is 
> not set, it does its own sub-sampling and partial rendering.
>
> I realize most/all of those optimizations won't work for raw, Flate or 
> LZW encoded images, but presumably those won't be too large in the 
> first place. Also, this has little to no benefit for PDInlineImage, 
> but as it already holds all of its raw data I assume little 
> optimization is possible.
>
> In general, this hack allowed me to speed-up rendering of some files 
> by significant margins (20%-80%, depending on size and desired DPI), 
> and significantly lower the memory footprint if only a lower-res 
> render is required, or rendering of small regions of the image.
>
> --
>
> [1]: 
> https://lists.apache.org/thread.html/6b396e3d8bfc4ed44bcadf37881035d7447fb711253ef962f187455c@%3Cusers.pdfbox.apache.org%3E 
>
> [2]: https://issues.apache.org/jira/browse/PDFBOX-3340
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org

Re: Allowing subsampled/downscaled rendering of images, or rendering of subregions of images

Posted by Itai <it...@gmail.com>.

My apologies, it seems the last patch did not include the added file
DecodeOptions.java. Attached is a (hopefully) fixed patch.

Itai.

On Thu, Mar 1, 2018 at 1:54 PM, Itai <it...@gmail.com> wrote:

> Hello,
>
> Following a question asked on pdfbox-users [1] , I set about trying to
> allow rendering images at lower resolutions, and additionally rendering
> only parts of images.  The need arises from having very large images,
> usually JPEG or JBIG2, which are tens of megabytes in size when compressed,
> but may take up 8 or even more gigabytes when rendered as a BufferedImage
> at full resolution.
> I have come up with a solution that seems to work (passes all of the
> built-in PDFBox tests, and a few manual ones I tried), but since it
> includes some deep changes in the logic I understand if it won't find its
> way into PDFBox.
>
> While working on it, I also came across PDFBOX-3340 [2], and since my hack
> relies on making changes to the way filters work, it includes a (partial)
> fix for that bug too.
>
> Finally, since I'm not well versed in git/github, I'm not sure of the best
> way to share my work. I attach here a unified diff, but let me know if
> there is another preferred method (pull request? clone the repository?)
>
> Following is an explanation/description of my changes, for those
> interested. I would love to hear any feedback, especially for things which
> may increase the likelihood of such a feature being included in future
> versions of PDFBox.
>
> Thanks,
> Itai.
>
> --
>
> As stated, the issue pertains mainly to very large images (lots of pixels)
> which are highly compressed. Since DCTFilter, JBIG2Filter etc. render the
> entire image, I had to augment the way Filter works, to allow it to accept
> options.
> This is where the class DecodeOptions comes in. It has sub-region and
> subsampling options (mirroring those of ImageReadParam), as well as a
> "metadata only" param. When decoding, you may pass DecodeOptions, such that
> image-related filters can downscale or only render a part of the image.
> The "metadata-only" option is used for the `repair` method of
> PDImageXObject, as it only really needs the DecodeResult - where applicable
> and possible, a filter encountering this option will not decode the stream,
> only set the DecodeResult parameters (this is not always possible, e.g. for
> JPXFilter, which must decode the image to get the parameters).
>
> The DecodeOptions also has an "honored" flag, which the filter sets to
> true if it honored the options - this is needed because when decoding an
> image stored in a Flate or LZW stream, the filter doesn't know the image
> format (or does it? I couldn't find a simple way of telling), so it can't
> make sense of subsampling or partial render options. SampledImageReader
> checks this flag, and if it is not set to true it does the subsampling by
> itself.
>
> This allows the addition of a method in PDImage
>
>      BufferedImage getImage(Rectangle region, int subsample) throws
> IOException;
>
> The result of which is not cached, as it is not "canonical".
> When drawing an image, PDPageDrawer calculates a subsampling factor based
> on the desired size:
>
>     int subsample = (int)Math.floor(pdImage.getWidth()/at.getScaleX());
>     if (subsample<1) subsample = 1;
>     if (subsample>8) subsample = 8;
>     drawBufferedImage(pdImage.getImage(null, subsample), at);
>
> Such that if e.g. the pixel should be drawn at 0.5 times its pixel-size,
> it will be subsampled at 2-pixel intervals.
>
> SampledImageReader issues the corresponding DecodeOptions to
> PDImage#createInputStream when rendering, and if the "honored" flag is not
> set, it does its own sub-sampling and partial rendering.
>
> I realize most/all of those optimizations won't work for raw, Flate or LZW
> encoded images, but presumably those won't be too large in the first place.
> Also, this has little to no benefit for PDInlineImage, but as it already
> holds all of its raw data I assume little optimization is possible.
>
> In general, this hack allowed me to speed-up rendering of some files by
> significant margins (20%-80%, depending on size and desired DPI), and
> significantly lower the memory footprint if only a lower-res render is
> required, or rendering of small regions of the image.
>
> --
>
> [1]: https://lists.apache.org/thread.html/6b396e3d8bfc4ed44bcadf37881035
> d7447fb711253ef962f187455c@%3Cusers.pdfbox.apache.org%3E
> [2]: https://issues.apache.org/jira/browse/PDFBOX-3340
>