You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Gilad Denneboom <gi...@gmail.com> on 2018/12/16 13:42:28 UTC

Re: Loading documents with a large amount of annotations

Try specifying the second parameter of the load method so that it uses a
temp file instead of an in-memory variable. That can help when dealing with
very large or complex files, I believe.

On Tue, Nov 13, 2018 at 8:23 AM Nick Westerly <de...@gmail.com> wrote:

> Hi -
>
> I am trying to load a document that has a lot of annotations (50k+) (i.e.
> comments, highlights, etc) However, just calling 'load' on the document is
> extremely slow, and uses a lot of memory (2G+).
>
> I actually don't need to use or access annotations at all (I'm using PDFBOX
> through a separate library that doesn't need them), but do need access to
> the PDDocument. Is there a way to load a document, but ignore all
> annotations when parsing? Similarly, ignoring all items such as fonts
> associated with those annotation objects.
>
> I was browsing through PDFParser#initialiParse and COSParser, but a little
> out of my depth.
> Even something as simple as ignoring objects if they are of some 'type' i
> could check.
>
> Any suggestions, even partial, would be helpful.
>
> Thanks.
>
> Nick
>