You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Tilman Hausherr (JIRA)" <ji...@apache.org> on 2017/02/23 20:53:44 UTC

[jira] [Commented] (PDFBOX-3700) OutOfMemoryException converting PDF to TIFF Images

    [ https://issues.apache.org/jira/browse/PDFBOX-3700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15881238#comment-15881238 ] 

Tilman Hausherr commented on PDFBOX-3700:
-----------------------------------------

With "diff" / "patch" I meant as a file attachment. But it's ok as long as it is clear what you mean.

One could as well put "return false" within the whole {{if (xobject instanceof PDImageXObject)}} segment, that would be the same.

Anyway, what your change does is simply to disable the cache for images.

What I'd prefer is to verify/falsify your theory, which is
{quote}
this is caused by the images cached in DefaultResourceCache
{quote}

The cache with soft objects was made to cache when possible, but release when space is needed.

So either this works, then your OOM has another cause (it could be that you do no longer have OOM because it is all just slower); or it doesn't work, then we should fix it instead of disabling it.

The best would be to have a test with a constant size thread pool with a PDF that has many images. Your theory could be proven if it fails with cache and succeeds without.

> OutOfMemoryException converting PDF to TIFF Images
> --------------------------------------------------
>
>                 Key: PDFBOX-3700
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3700
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Rendering
>    Affects Versions: 2.0.4
>            Reporter: Viraf Bankwalla
>
> I am using PDFBox to convert PDF documents to a series of TIFF images (one for each page).  The implementation uses PDFRenderer to render each page.  Things work fine when I am processing a single document in a single thread, however when I try to process multiple documents (each in its own thread) I get an OutOfMemoryException.
> In analyzing the heap dump, I see that this is caused by the images cached in DefaultResourceCache.  Objects are added to the cache in PDResources, which includes a method private boolean isAllowedCache(PDXObject xobject) that is used to determine whether an PDXObject can be cached.  I have extended this to filter out COSName.IMAGE, and am now able to process multiple documents in parallel.
> A proposed fix would be to include Images in the set of objects not to add to the cache.  For example, the following could be added to  PDResources.isAllowedCache
> {code:title=Bar.java|borderStyle=solid}
> COSBase image =  xobject.getCOSObject().getDictionaryObject(COSName.SUBTYPE);
> if (image instanceof COSName && ((COSName) image).equals(COSName.IMAGE))
> {
>              return false;            
> }
> {code}
> A possible patch is enclosed below.  I would like to get a fix in for the next release.
> diff --git a/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDResources.java b/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDResources.java
> index 6e1e464..aa94122 100644
> --- a/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDResources.java
> +++ b/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDResources.java
> @@ -31,15 +31,15 @@
>  import org.apache.pdfbox.pdmodel.documentinterchange.markedcontent.PDPropertyList;
>  import org.apache.pdfbox.pdmodel.font.PDFont;
>  import org.apache.pdfbox.pdmodel.font.PDFontFactory;
> +import org.apache.pdfbox.pdmodel.graphics.PDXObject;
> +import org.apache.pdfbox.pdmodel.graphics.color.PDColorSpace;
>  import org.apache.pdfbox.pdmodel.graphics.color.PDPattern;
>  import org.apache.pdfbox.pdmodel.graphics.form.PDFormXObject;
> +import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject;
>  import org.apache.pdfbox.pdmodel.graphics.optionalcontent.PDOptionalContentGroup;
> -import org.apache.pdfbox.pdmodel.graphics.state.PDExtendedGraphicsState;
> -import org.apache.pdfbox.pdmodel.graphics.color.PDColorSpace;
>  import org.apache.pdfbox.pdmodel.graphics.pattern.PDAbstractPattern;
>  import org.apache.pdfbox.pdmodel.graphics.shading.PDShading;
> -import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject;
> -import org.apache.pdfbox.pdmodel.graphics.PDXObject;
> +import org.apache.pdfbox.pdmodel.graphics.state.PDExtendedGraphicsState;
>  
>  /**
>   * A set of resources available at the page/pages/stream level.
> @@ -445,6 +445,12 @@
>                      return false;
>                  }
>              }
> +            
> +            COSBase image = xobject.getCOSObject().getDictionaryObject(COSName.SUBTYPE);
> +            if (image instanceof COSName && ((COSName) image).equals(COSName.IMAGE))
> +            {
> +            	return false;
> +            }
>          }
>          return true;
>      }



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org