You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Esteban R <er...@hotmail.com> on 2019/09/26 19:39:33 UTC

performance issue in v 2.0.17: PDImageXObject.getImage() + MemoryUsageSetting.setupTempFileOnly

Hello. I'm getting a timeout in one of my tests after upgrading to v 2.0.17: PDImageXObject.getImage() takes more than 1:10 minutes instead of less than 2 seconds with previous release 2.0.16.

I cannot provide the sample PDF because it contains sensitive information. I have tried to simplify it but the issue (almost) dissappears even if I save the file without changing anything.

Some more facts:

  *   The issue happens if the load is done with MemoryUsageSetting.setupTempFileOnly (without that flag the issue doesn't happen)
  *   The performance is OK for v 2.0.16
  *   Please find attached the stack trace of the timeout of my original test
  *   PDF structure is quite simple and contains a single image (find attached the relevant data from pdfdebugger)
  *   I have created a sample program to demonstrate the issue: it simply loads the PDF file with the setupTempFileOnly flag and does getImage for all the images (only one in the PDF). It then does the same thing without the flag and after that it does the same thing with another PDF: it is simply the same file saved by PDFBOX with another name.

Output with pdfbox 2.0.16:
java -cp "pdfbox-2.0.16.jar;commons-logging-1.2.jar;." TestExtractImage sample.pdf
With    temp file
    Before getImage: Thu Sep 26 16:32:46 ART 2019
    After  getImage: Thu Sep 26 16:32:46 ART 2019
Without temp file
    Before getImage: Thu Sep 26 16:32:46 ART 2019
    After  getImage: Thu Sep 26 16:32:46 ART 2019
With    temp file after saving with another name
    Before getImage: Thu Sep 26 16:32:46 ART 2019
    After  getImage: Thu Sep 26 16:32:47 ART 2019
(i.e.: less than 2 seconds in all cases)

Output with pdfbox 2.0.17:
java -cp "pdfbox-2.0.17.jar;commons-logging-1.2.jar;." TestExtractImage sample.pdf
With    temp file
    Before getImage: Thu Sep 26 16:31:09 ART 2019
    After  getImage: Thu Sep 26 16:32:30 ART 2019   => more than 1'20"
Without temp file
    Before getImage: Thu Sep 26 16:32:30 ART 2019
    After  getImage: Thu Sep 26 16:32:30 ART 2019
With    temp file after saving with another name
    Before getImage: Thu Sep 26 16:32:30 ART 2019
    After  getImage: Thu Sep 26 16:32:34 ART 2019 => more than 2"

And here the source code for TestExtractImage.java :

import java.io.File;
import java.io.IOException;
import java.util.Date;
import org.apache.pdfbox.cos.COSName;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDResources;
import org.apache.pdfbox.pdmodel.graphics.PDXObject;
import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject;
import org.apache.pdfbox.io.MemoryUsageSetting;

public class TestExtractImage {
    public static void main(String[] args) throws IOException {
        System.out.println("With    temp file");
        PDDocument d = PDDocument.load(new File(args[0]), MemoryUsageSetting.setupTempFileOnly() );
        getImage(d);
        d.close();

        System.out.println("Without temp file");
        d = PDDocument.load(new File(args[0]));
        getImage(d);
        d.save("other.pdf");
        d.close();
        System.out.println("With    temp file after saving with another name");
        d = PDDocument.load(new File("other.pdf"), MemoryUsageSetting.setupTempFileOnly() );
        getImage(d);
    }

    static void getImage(PDDocument d) throws IOException {
        PDResources res = d.getPage(0).getResources();
        for (COSName n: res.getXObjectNames()) {
            PDXObject o = res.getXObject(n);
            if (o instanceof PDImageXObject){
                PDImageXObject i = (PDImageXObject) o;
                if ("png".equals(i.getSuffix())){
                    System.out.println("    Before getImage: "+new Date());
                    i.getImage();
                    System.out.println("    After  getImage: "+new Date());
                }
            }
        }
    }
}

RE: performance issue in v 2.0.17: PDImageXObject.getImage() + MemoryUsageSetting.setupTempFileOnly

Posted by Esteban R <er...@hotmail.com>.

>> Small adition: I have rebooted my machine and run some tests again in GNU/Linux (Ubuntu): performance is good for 2.0.17 and 2.0.18! I have not even moved the files.
>So the whole problem went away thanks to the reboot?

No, it didn't. I meant that the issue happens in Windows but it doesn't occur in GNU/Linux (performance is OK in GNU/Linux, same machine, same jars, and even same file system). I have tried with openjdk version "1.8.0_222" and java version "1.8.0_121" in GNU/Linux.

Esteban

Re: performance issue in v 2.0.17: PDImageXObject.getImage() + MemoryUsageSetting.setupTempFileOnly

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 21.10.2019 um 18:53 schrieb Esteban R:
> Small adition: I have rebooted my machine and run some tests again in GNU/Linux (Ubuntu): performance is good for 2.0.17 and 2.0.18! I have not even moved the files.

So the whole problem went away thanks to the reboot?

After looking at the log in PDFBOX-4601 (search for "fileLen after1:"), 
I conclude that the part that I commented out isn't really needed.

Tilman



>
> Esteban
> ________________________________
> De: Esteban R <er...@hotmail.com>
> Enviado: lunes, 21 de octubre de 2019 16:09
> Para: users@pdfbox.apache.org <us...@pdfbox.apache.org>
> Asunto: RE: performance issue in v 2.0.17: PDImageXObject.getImage() + MemoryUsageSetting.setupTempFileOnly
>
> I have tried with v 2.0.18. It still contains a call to file.length() in the following if:
>
>                      if (fileLen != raf.length()  || fileLen != file.length())
>
>   Performance has improved compared with v.2.0.17 but is still slower than v.2.0.16, so I have also tried commenting out the call to file.length (v 2.0.18-patched):
>
>                      if (fileLen != raf.length()  /* || fileLen != file.length() */ )
>
> That improved the performance back to the values of v 2.0.16.
>
> Results are as following:
>
> time_with_tempfile time_without_tempfile version
> 389 229 2.0.16
> 77950 208 2.0.17
> 39322 208 2.0.18-SNAPSHOT
> 395 223 2.0.18-patched
>
> I have also compared performance with the next slower pdf (sample2.pdf):
> 5270 4584 2.0.16
> 20377 4528 2.0.17
> 12925 4578 2.0.18-SNAPSHOT
>   5258 4465 2.0.18-patched
>
> In case that helps, I have also counted the number of calls to enlarge():
> sample.pdf: 418
> sample2.pdf: 1960 (much more than the "really slow" case!)
>
> So, for me it is clear that the call to file.length() is slow (and exceptionaly slow in one case!?!?). I will solve my issue by reverting ScratchFile to v.2.0.16 in v.2.0.17 by now.
>
> Esteban
> ________________________________
> De: Tilman Hausherr <TH...@t-online.de>
> Enviado: viernes, 18 de octubre de 2019 18:07
> Para: users@pdfbox.apache.org <us...@pdfbox.apache.org>
> Asunto: Re: performance issue in v 2.0.17: PDImageXObject.getImage() + MemoryUsageSetting.setupTempFileOnly
>
> Hi,
> Could you try getting the current 2.0.18 from svn? Alternatively, use
> this code:
>
>                   long fileLen = raf.length();
>                   long expectedFileLen = ((long)pageCount -
> inMemoryMaxPageCount) * PAGE_SIZE;
>
>                   if (expectedFileLen != fileLen)
>                   {
>                       throw new IOException("Expected scratch file size
> of " + expectedFileLen +
>                                             " but found " + fileLen + "
> in file " + file);
>                   }
>
>                   // enlarge if we do not int overflow
>                   if (pageCount + ENLARGE_PAGE_COUNT > pageCount)
>                   {
>                       if (LOG.isDebugEnabled())
>                       {
>                           LOG.debug("file: " + file);
>                           LOG.debug("fileLen before: " + fileLen + ", raf
> length: " + raf.length() +
>                                     ", file length: " + file.length());
>                       }
>                       fileLen += ENLARGE_PAGE_COUNT * PAGE_SIZE;
>
>                       raf.setLength(fileLen);
>                       if (LOG.isDebugEnabled())
>                       {
>                           LOG.debug("fileLen after1: " + fileLen + ", raf
> length: " + raf.length() +
>                                     ", file length: " + file.length());
>                       }
>                       if (fileLen != raf.length() || fileLen !=
> file.length())
>                       {
>                           // PDFBOX-4601 possible AWS lambda bug that
> setLength() doesn't throw
>                           // if not enough space
>                           long origFilePointer = raf.getFilePointer();
>                           raf.seek(fileLen - 1);
>                           raf.write(0);
>                           raf.seek(origFilePointer);
>                           LOG.debug("fileLen after2:  " + fileLen + ",
> raf length: " + raf.length() + ", file length: " + file.length());
>                       }
>                       freePages.set(pageCount, pageCount +
> ENLARGE_PAGE_COUNT);
>                   }
>
> or just remove the log messages from the 2.0.17 source. Now there is
> still a length() operation but that is "raf.length()" so maybe it is
> different.
>
> Sadly we can't go without that, until Amazon has fixed their systems.
>
> Re the difference in processing time between 16 and 17 - I don't know
> because you can't share your file. If you have a lot of time, try with
> different versions from the repostory, i.e. a "bisect" strategy. It
> takes hours, and is a work for patient people.
>
> Tilman
>
> Am 18.10.2019 um 18:45 schrieb Esteban R:
>> Yes, it is. I have reverted ScratchFile.java and processing now takes less than 2 secs.
>>
>> I have noticed that the problem may be related to the call to file.length(): even if I add only the LOGs (or even more: if I only add a line: file.length(); to ScratchFile.java) then performance drops.
>>
>> I have also tested with another +48000 pdf files: I have not found another pdf file with a so big processing time. These are the results for the most time-consuming pdfs. I have noticed that processing time in v2.0.17 seems to be bigger for these files even if I don't use temp files (maybe the machine was more loaded during the tests?)
>>
>> Columns:
>> v2.0.16_with_tempfile v2.0.16_without_tempfile  v2.0.17_with_tempfile v2_0_17_without_tempfile
>> 1635 1410 4304 2581
>> 1642 1528 4304 2585
>> 1652 1373 4483 2662
>> 1652 1461 4556 2727
>> 1684 1441 4626 2835
>> 1833 1339 4362 2610
>> 1863 1641 4436 2689
>> 4961 4350 18605 4296
>>
>>
>> Esteban
>>
>>
>>
>>
>> ________________________________
>> De: Tilman Hausherr <TH...@t-online.de>
>> Enviado: jueves, 26 de septiembre de 2019 19:45
>> Para: users@pdfbox.apache.org <us...@pdfbox.apache.org>
>> Asunto: Re: performance issue in v 2.0.17: PDImageXObject.getImage() + MemoryUsageSetting.setupTempFileOnly
>>
>> Ouch... Might be the workaround implemented for this issue:
>> https://issues.apache.org/jira/browse/PDFBOX-4601
>>
>> Tilman
>>
>> Am 26.09.2019 um 21:39 schrieb Esteban R:
>>> Hello. I'm getting a timeout in one of my tests after upgrading to v
>>> 2.0.17: PDImageXObject.getImage() takes more than 1:10 minutes instead
>>> of less than 2 seconds with previous release 2.0.16.
>>>
>>> I cannot provide the sample PDF because it contains sensitive
>>> information. I have tried to simplify it but the issue (almost)
>>> dissappears even if I save the file without changing anything.
>>>
>>> Some more facts:
>>>
>>>     * The issue happens if the load is done with
>>>       MemoryUsageSetting.setupTempFileOnly (without that flag the issue
>>>       doesn't happen)
>>>     * The performance is OK for v 2.0.16
>>>     * Please find attached the stack trace of the timeout of my original
>>>       test
>>>     * PDF structure is quite simple and contains a single image (find
>>>       attached the relevant data from pdfdebugger)
>>>     * I have created a sample program to demonstrate the issue: it
>>>       simply loads the PDF file with the setupTempFileOnly flag and does
>>>       getImage for all the images (only one in the PDF). It then does
>>>       the same thing without the flag and after that it does the same
>>>       thing with another PDF: it is simply the same file saved by PDFBOX
>>>       with another name.
>>>
>>> Output with pdfbox 2.0.16:
>>> java -cp "pdfbox-2.0.16.jar;commons-logging-1.2.jar;."
>>> TestExtractImage sample.pdf
>>> With    temp file
>>>       Before getImage: Thu Sep 26 16:32:46 ART 2019
>>>       After  getImage: Thu Sep 26 16:32:46 ART 2019
>>> Without temp file
>>>       Before getImage: Thu Sep 26 16:32:46 ART 2019
>>>       After  getImage: Thu Sep 26 16:32:46 ART 2019
>>> With    temp file after saving with another name
>>>       Before getImage: Thu Sep 26 16:32:46 ART 2019
>>>       After  getImage: Thu Sep 26 16:32:47 ART 2019
>>> (i.e.: less than 2 seconds in all cases)
>>>
>>> Output with pdfbox 2.0.17:
>>> java -cp "pdfbox-2.0.17.jar;commons-logging-1.2.jar;."
>>> TestExtractImage sample.pdf
>>> With    temp file
>>>       Before getImage: Thu Sep 26 16:31:09 ART 2019
>>>       After  getImage: Thu Sep 26 16:32:30 ART 2019  => more than 1'20"
>>> Without temp file
>>>       Before getImage: Thu Sep 26 16:32:30 ART 2019
>>>       After  getImage: Thu Sep 26 16:32:30 ART 2019
>>> With    temp file after saving with another name
>>>       Before getImage: Thu Sep 26 16:32:30 ART 2019
>>>       After  getImage: Thu Sep 26 16:32:34 ART 2019 => more than 2"
>>>
>>> And here the source code for TestExtractImage.java :
>>>
>>> import java.io.File;
>>> import java.io.IOException;
>>> import java.util.Date;
>>> import org.apache.pdfbox.cos.COSName;
>>> import org.apache.pdfbox.pdmodel.PDDocument;
>>> import org.apache.pdfbox.pdmodel.PDResources;
>>> import org.apache.pdfbox.pdmodel.graphics.PDXObject;
>>> import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject;
>>> import org.apache.pdfbox.io.MemoryUsageSetting;
>>>
>>> public class TestExtractImage {
>>>       public static void main(String[] args) throws IOException {
>>>           System.out.println("With    temp file");
>>>           PDDocument d = PDDocument.load(new File(args[0]),
>>> MemoryUsageSetting.setupTempFileOnly() );
>>>           getImage(d);
>>>           d.close();
>>>
>>>           System.out.println("Without temp file");
>>>           d = PDDocument.load(new File(args[0]));
>>>           getImage(d);
>>>           d.save("other.pdf");
>>>           d.close();
>>>           System.out.println("With    temp file after saving with
>>> another name");
>>>           d = PDDocument.load(new File("other.pdf"),
>>> MemoryUsageSetting.setupTempFileOnly() );
>>>           getImage(d);
>>>       }
>>>
>>>       static void getImage(PDDocument d) throws IOException {
>>>           PDResources res = d.getPage(0).getResources();
>>>           for (COSName n: res.getXObjectNames()) {
>>>               PDXObject o = res.getXObject(n);
>>>               if (o instanceof PDImageXObject){
>>>                   PDImageXObject i = (PDImageXObject) o;
>>>                   if ("png".equals(i.getSuffix())){
>>>                       System.out.println("    Before getImage: "+new
>>> Date());
>>>                       i.getImage();
>>>                       System.out.println("    After  getImage: "+new
>>> Date());
>>>                   }
>>>               }
>>>           }
>>>       }
>>> }
>>>
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

RE: performance issue in v 2.0.17: PDImageXObject.getImage() + MemoryUsageSetting.setupTempFileOnly

Posted by Esteban R <er...@hotmail.com>.

Small adition: I have rebooted my machine and run some tests again in GNU/Linux (Ubuntu): performance is good for 2.0.17 and 2.0.18! I have not even moved the files.

Esteban
________________________________
De: Esteban R <er...@hotmail.com>
Enviado: lunes, 21 de octubre de 2019 16:09
Para: users@pdfbox.apache.org <us...@pdfbox.apache.org>
Asunto: RE: performance issue in v 2.0.17: PDImageXObject.getImage() + MemoryUsageSetting.setupTempFileOnly

I have tried with v 2.0.18. It still contains a call to file.length() in the following if:

                    if (fileLen != raf.length()  || fileLen != file.length())

 Performance has improved compared with v.2.0.17 but is still slower than v.2.0.16, so I have also tried commenting out the call to file.length (v 2.0.18-patched):

                    if (fileLen != raf.length()  /* || fileLen != file.length() */ )

That improved the performance back to the values of v 2.0.16.

Results are as following:

time_with_tempfile time_without_tempfile version
389 229 2.0.16
77950 208 2.0.17
39322 208 2.0.18-SNAPSHOT
395 223 2.0.18-patched

I have also compared performance with the next slower pdf (sample2.pdf):
5270 4584 2.0.16
20377 4528 2.0.17
12925 4578 2.0.18-SNAPSHOT
 5258 4465 2.0.18-patched

In case that helps, I have also counted the number of calls to enlarge():
sample.pdf: 418
sample2.pdf: 1960 (much more than the "really slow" case!)

So, for me it is clear that the call to file.length() is slow (and exceptionaly slow in one case!?!?). I will solve my issue by reverting ScratchFile to v.2.0.16 in v.2.0.17 by now.

Esteban
________________________________
De: Tilman Hausherr <TH...@t-online.de>
Enviado: viernes, 18 de octubre de 2019 18:07
Para: users@pdfbox.apache.org <us...@pdfbox.apache.org>
Asunto: Re: performance issue in v 2.0.17: PDImageXObject.getImage() + MemoryUsageSetting.setupTempFileOnly

Hi,
Could you try getting the current 2.0.18 from svn? Alternatively, use
this code:

                 long fileLen = raf.length();
                 long expectedFileLen = ((long)pageCount -
inMemoryMaxPageCount) * PAGE_SIZE;

                 if (expectedFileLen != fileLen)
                 {
                     throw new IOException("Expected scratch file size
of " + expectedFileLen +
                                           " but found " + fileLen + "
in file " + file);
                 }

                 // enlarge if we do not int overflow
                 if (pageCount + ENLARGE_PAGE_COUNT > pageCount)
                 {
                     if (LOG.isDebugEnabled())
                     {
                         LOG.debug("file: " + file);
                         LOG.debug("fileLen before: " + fileLen + ", raf
length: " + raf.length() +
                                   ", file length: " + file.length());
                     }
                     fileLen += ENLARGE_PAGE_COUNT * PAGE_SIZE;

                     raf.setLength(fileLen);
                     if (LOG.isDebugEnabled())
                     {
                         LOG.debug("fileLen after1: " + fileLen + ", raf
length: " + raf.length() +
                                   ", file length: " + file.length());
                     }
                     if (fileLen != raf.length() || fileLen !=
file.length())
                     {
                         // PDFBOX-4601 possible AWS lambda bug that
setLength() doesn't throw
                         // if not enough space
                         long origFilePointer = raf.getFilePointer();
                         raf.seek(fileLen - 1);
                         raf.write(0);
                         raf.seek(origFilePointer);
                         LOG.debug("fileLen after2:  " + fileLen + ",
raf length: " + raf.length() + ", file length: " + file.length());
                     }
                     freePages.set(pageCount, pageCount +
ENLARGE_PAGE_COUNT);
                 }

or just remove the log messages from the 2.0.17 source. Now there is
still a length() operation but that is "raf.length()" so maybe it is
different.

Sadly we can't go without that, until Amazon has fixed their systems.

Re the difference in processing time between 16 and 17 - I don't know
because you can't share your file. If you have a lot of time, try with
different versions from the repostory, i.e. a "bisect" strategy. It
takes hours, and is a work for patient people.

Tilman

Am 18.10.2019 um 18:45 schrieb Esteban R:
> Yes, it is. I have reverted ScratchFile.java and processing now takes less than 2 secs.
>
> I have noticed that the problem may be related to the call to file.length(): even if I add only the LOGs (or even more: if I only add a line: file.length(); to ScratchFile.java) then performance drops.
>
> I have also tested with another +48000 pdf files: I have not found another pdf file with a so big processing time. These are the results for the most time-consuming pdfs. I have noticed that processing time in v2.0.17 seems to be bigger for these files even if I don't use temp files (maybe the machine was more loaded during the tests?)
>
> Columns:
> v2.0.16_with_tempfile v2.0.16_without_tempfile  v2.0.17_with_tempfile v2_0_17_without_tempfile
> 1635 1410 4304 2581
> 1642 1528 4304 2585
> 1652 1373 4483 2662
> 1652 1461 4556 2727
> 1684 1441 4626 2835
> 1833 1339 4362 2610
> 1863 1641 4436 2689
> 4961 4350 18605 4296
>
>
> Esteban
>
>
>
>
> ________________________________
> De: Tilman Hausherr <TH...@t-online.de>
> Enviado: jueves, 26 de septiembre de 2019 19:45
> Para: users@pdfbox.apache.org <us...@pdfbox.apache.org>
> Asunto: Re: performance issue in v 2.0.17: PDImageXObject.getImage() + MemoryUsageSetting.setupTempFileOnly
>
> Ouch... Might be the workaround implemented for this issue:
> https://issues.apache.org/jira/browse/PDFBOX-4601
>
> Tilman
>
> Am 26.09.2019 um 21:39 schrieb Esteban R:
>> Hello. I'm getting a timeout in one of my tests after upgrading to v
>> 2.0.17: PDImageXObject.getImage() takes more than 1:10 minutes instead
>> of less than 2 seconds with previous release 2.0.16.
>>
>> I cannot provide the sample PDF because it contains sensitive
>> information. I have tried to simplify it but the issue (almost)
>> dissappears even if I save the file without changing anything.
>>
>> Some more facts:
>>
>>    * The issue happens if the load is done with
>>      MemoryUsageSetting.setupTempFileOnly (without that flag the issue
>>      doesn't happen)
>>    * The performance is OK for v 2.0.16
>>    * Please find attached the stack trace of the timeout of my original
>>      test
>>    * PDF structure is quite simple and contains a single image (find
>>      attached the relevant data from pdfdebugger)
>>    * I have created a sample program to demonstrate the issue: it
>>      simply loads the PDF file with the setupTempFileOnly flag and does
>>      getImage for all the images (only one in the PDF). It then does
>>      the same thing without the flag and after that it does the same
>>      thing with another PDF: it is simply the same file saved by PDFBOX
>>      with another name.
>>
>> Output with pdfbox 2.0.16:
>> java -cp "pdfbox-2.0.16.jar;commons-logging-1.2.jar;."
>> TestExtractImage sample.pdf
>> With    temp file
>>      Before getImage: Thu Sep 26 16:32:46 ART 2019
>>      After  getImage: Thu Sep 26 16:32:46 ART 2019
>> Without temp file
>>      Before getImage: Thu Sep 26 16:32:46 ART 2019
>>      After  getImage: Thu Sep 26 16:32:46 ART 2019
>> With    temp file after saving with another name
>>      Before getImage: Thu Sep 26 16:32:46 ART 2019
>>      After  getImage: Thu Sep 26 16:32:47 ART 2019
>> (i.e.: less than 2 seconds in all cases)
>>
>> Output with pdfbox 2.0.17:
>> java -cp "pdfbox-2.0.17.jar;commons-logging-1.2.jar;."
>> TestExtractImage sample.pdf
>> With    temp file
>>      Before getImage: Thu Sep 26 16:31:09 ART 2019
>>      After  getImage: Thu Sep 26 16:32:30 ART 2019  => more than 1'20"
>> Without temp file
>>      Before getImage: Thu Sep 26 16:32:30 ART 2019
>>      After  getImage: Thu Sep 26 16:32:30 ART 2019
>> With    temp file after saving with another name
>>      Before getImage: Thu Sep 26 16:32:30 ART 2019
>>      After  getImage: Thu Sep 26 16:32:34 ART 2019 => more than 2"
>>
>> And here the source code for TestExtractImage.java :
>>
>> import java.io.File;
>> import java.io.IOException;
>> import java.util.Date;
>> import org.apache.pdfbox.cos.COSName;
>> import org.apache.pdfbox.pdmodel.PDDocument;
>> import org.apache.pdfbox.pdmodel.PDResources;
>> import org.apache.pdfbox.pdmodel.graphics.PDXObject;
>> import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject;
>> import org.apache.pdfbox.io.MemoryUsageSetting;
>>
>> public class TestExtractImage {
>>      public static void main(String[] args) throws IOException {
>>          System.out.println("With    temp file");
>>          PDDocument d = PDDocument.load(new File(args[0]),
>> MemoryUsageSetting.setupTempFileOnly() );
>>          getImage(d);
>>          d.close();
>>
>>          System.out.println("Without temp file");
>>          d = PDDocument.load(new File(args[0]));
>>          getImage(d);
>>          d.save("other.pdf");
>>          d.close();
>>          System.out.println("With    temp file after saving with
>> another name");
>>          d = PDDocument.load(new File("other.pdf"),
>> MemoryUsageSetting.setupTempFileOnly() );
>>          getImage(d);
>>      }
>>
>>      static void getImage(PDDocument d) throws IOException {
>>          PDResources res = d.getPage(0).getResources();
>>          for (COSName n: res.getXObjectNames()) {
>>              PDXObject o = res.getXObject(n);
>>              if (o instanceof PDImageXObject){
>>                  PDImageXObject i = (PDImageXObject) o;
>>                  if ("png".equals(i.getSuffix())){
>>                      System.out.println("    Before getImage: "+new
>> Date());
>>                      i.getImage();
>>                      System.out.println("    After  getImage: "+new
>> Date());
>>                  }
>>              }
>>          }
>>      }
>> }
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: performance issue in v 2.0.17: PDImageXObject.getImage() + MemoryUsageSetting.setupTempFileOnly

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 21.10.2019 um 18:09 schrieb Esteban R:
> I have tried with v 2.0.18. It still contains a call to file.length() in the following if:
>
>                      if (fileLen != raf.length()  || fileLen != file.length())
>
>   Performance has improved compared with v.2.0.17 but is still slower than v.2.0.16, so I have also tried commenting out the call to file.length (v 2.0.18-patched):
>
>                      if (fileLen != raf.length()  /* || fileLen != file.length() */ )
>
> That improved the performance back to the values of v 2.0.16.


Thanks for the test, I have committed the modification in 2.0 as suggested.

Tilman


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

RE: performance issue in v 2.0.17: PDImageXObject.getImage() + MemoryUsageSetting.setupTempFileOnly

Posted by Esteban R <er...@hotmail.com>.

I have tried with v 2.0.18. It still contains a call to file.length() in the following if:

                    if (fileLen != raf.length()  || fileLen != file.length())

 Performance has improved compared with v.2.0.17 but is still slower than v.2.0.16, so I have also tried commenting out the call to file.length (v 2.0.18-patched):

                    if (fileLen != raf.length()  /* || fileLen != file.length() */ )

That improved the performance back to the values of v 2.0.16.

Results are as following:

time_with_tempfile time_without_tempfile version
389 229 2.0.16
77950 208 2.0.17
39322 208 2.0.18-SNAPSHOT
395 223 2.0.18-patched

I have also compared performance with the next slower pdf (sample2.pdf):
5270 4584 2.0.16
20377 4528 2.0.17
12925 4578 2.0.18-SNAPSHOT
 5258 4465 2.0.18-patched

In case that helps, I have also counted the number of calls to enlarge():
sample.pdf: 418
sample2.pdf: 1960 (much more than the "really slow" case!)

So, for me it is clear that the call to file.length() is slow (and exceptionaly slow in one case!?!?). I will solve my issue by reverting ScratchFile to v.2.0.16 in v.2.0.17 by now.

Esteban
________________________________
De: Tilman Hausherr <TH...@t-online.de>
Enviado: viernes, 18 de octubre de 2019 18:07
Para: users@pdfbox.apache.org <us...@pdfbox.apache.org>
Asunto: Re: performance issue in v 2.0.17: PDImageXObject.getImage() + MemoryUsageSetting.setupTempFileOnly

Hi,
Could you try getting the current 2.0.18 from svn? Alternatively, use
this code:

                 long fileLen = raf.length();
                 long expectedFileLen = ((long)pageCount -
inMemoryMaxPageCount) * PAGE_SIZE;

                 if (expectedFileLen != fileLen)
                 {
                     throw new IOException("Expected scratch file size
of " + expectedFileLen +
                                           " but found " + fileLen + "
in file " + file);
                 }

                 // enlarge if we do not int overflow
                 if (pageCount + ENLARGE_PAGE_COUNT > pageCount)
                 {
                     if (LOG.isDebugEnabled())
                     {
                         LOG.debug("file: " + file);
                         LOG.debug("fileLen before: " + fileLen + ", raf
length: " + raf.length() +
                                   ", file length: " + file.length());
                     }
                     fileLen += ENLARGE_PAGE_COUNT * PAGE_SIZE;

                     raf.setLength(fileLen);
                     if (LOG.isDebugEnabled())
                     {
                         LOG.debug("fileLen after1: " + fileLen + ", raf
length: " + raf.length() +
                                   ", file length: " + file.length());
                     }
                     if (fileLen != raf.length() || fileLen !=
file.length())
                     {
                         // PDFBOX-4601 possible AWS lambda bug that
setLength() doesn't throw
                         // if not enough space
                         long origFilePointer = raf.getFilePointer();
                         raf.seek(fileLen - 1);
                         raf.write(0);
                         raf.seek(origFilePointer);
                         LOG.debug("fileLen after2:  " + fileLen + ",
raf length: " + raf.length() + ", file length: " + file.length());
                     }
                     freePages.set(pageCount, pageCount +
ENLARGE_PAGE_COUNT);
                 }

or just remove the log messages from the 2.0.17 source. Now there is
still a length() operation but that is "raf.length()" so maybe it is
different.

Sadly we can't go without that, until Amazon has fixed their systems.

Re the difference in processing time between 16 and 17 - I don't know
because you can't share your file. If you have a lot of time, try with
different versions from the repostory, i.e. a "bisect" strategy. It
takes hours, and is a work for patient people.

Tilman

Am 18.10.2019 um 18:45 schrieb Esteban R:
> Yes, it is. I have reverted ScratchFile.java and processing now takes less than 2 secs.
>
> I have noticed that the problem may be related to the call to file.length(): even if I add only the LOGs (or even more: if I only add a line: file.length(); to ScratchFile.java) then performance drops.
>
> I have also tested with another +48000 pdf files: I have not found another pdf file with a so big processing time. These are the results for the most time-consuming pdfs. I have noticed that processing time in v2.0.17 seems to be bigger for these files even if I don't use temp files (maybe the machine was more loaded during the tests?)
>
> Columns:
> v2.0.16_with_tempfile v2.0.16_without_tempfile  v2.0.17_with_tempfile v2_0_17_without_tempfile
> 1635 1410 4304 2581
> 1642 1528 4304 2585
> 1652 1373 4483 2662
> 1652 1461 4556 2727
> 1684 1441 4626 2835
> 1833 1339 4362 2610
> 1863 1641 4436 2689
> 4961 4350 18605 4296
>
>
> Esteban
>
>
>
>
> ________________________________
> De: Tilman Hausherr <TH...@t-online.de>
> Enviado: jueves, 26 de septiembre de 2019 19:45
> Para: users@pdfbox.apache.org <us...@pdfbox.apache.org>
> Asunto: Re: performance issue in v 2.0.17: PDImageXObject.getImage() + MemoryUsageSetting.setupTempFileOnly
>
> Ouch... Might be the workaround implemented for this issue:
> https://issues.apache.org/jira/browse/PDFBOX-4601
>
> Tilman
>
> Am 26.09.2019 um 21:39 schrieb Esteban R:
>> Hello. I'm getting a timeout in one of my tests after upgrading to v
>> 2.0.17: PDImageXObject.getImage() takes more than 1:10 minutes instead
>> of less than 2 seconds with previous release 2.0.16.
>>
>> I cannot provide the sample PDF because it contains sensitive
>> information. I have tried to simplify it but the issue (almost)
>> dissappears even if I save the file without changing anything.
>>
>> Some more facts:
>>
>>    * The issue happens if the load is done with
>>      MemoryUsageSetting.setupTempFileOnly (without that flag the issue
>>      doesn't happen)
>>    * The performance is OK for v 2.0.16
>>    * Please find attached the stack trace of the timeout of my original
>>      test
>>    * PDF structure is quite simple and contains a single image (find
>>      attached the relevant data from pdfdebugger)
>>    * I have created a sample program to demonstrate the issue: it
>>      simply loads the PDF file with the setupTempFileOnly flag and does
>>      getImage for all the images (only one in the PDF). It then does
>>      the same thing without the flag and after that it does the same
>>      thing with another PDF: it is simply the same file saved by PDFBOX
>>      with another name.
>>
>> Output with pdfbox 2.0.16:
>> java -cp "pdfbox-2.0.16.jar;commons-logging-1.2.jar;."
>> TestExtractImage sample.pdf
>> With    temp file
>>      Before getImage: Thu Sep 26 16:32:46 ART 2019
>>      After  getImage: Thu Sep 26 16:32:46 ART 2019
>> Without temp file
>>      Before getImage: Thu Sep 26 16:32:46 ART 2019
>>      After  getImage: Thu Sep 26 16:32:46 ART 2019
>> With    temp file after saving with another name
>>      Before getImage: Thu Sep 26 16:32:46 ART 2019
>>      After  getImage: Thu Sep 26 16:32:47 ART 2019
>> (i.e.: less than 2 seconds in all cases)
>>
>> Output with pdfbox 2.0.17:
>> java -cp "pdfbox-2.0.17.jar;commons-logging-1.2.jar;."
>> TestExtractImage sample.pdf
>> With    temp file
>>      Before getImage: Thu Sep 26 16:31:09 ART 2019
>>      After  getImage: Thu Sep 26 16:32:30 ART 2019  => more than 1'20"
>> Without temp file
>>      Before getImage: Thu Sep 26 16:32:30 ART 2019
>>      After  getImage: Thu Sep 26 16:32:30 ART 2019
>> With    temp file after saving with another name
>>      Before getImage: Thu Sep 26 16:32:30 ART 2019
>>      After  getImage: Thu Sep 26 16:32:34 ART 2019 => more than 2"
>>
>> And here the source code for TestExtractImage.java :
>>
>> import java.io.File;
>> import java.io.IOException;
>> import java.util.Date;
>> import org.apache.pdfbox.cos.COSName;
>> import org.apache.pdfbox.pdmodel.PDDocument;
>> import org.apache.pdfbox.pdmodel.PDResources;
>> import org.apache.pdfbox.pdmodel.graphics.PDXObject;
>> import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject;
>> import org.apache.pdfbox.io.MemoryUsageSetting;
>>
>> public class TestExtractImage {
>>      public static void main(String[] args) throws IOException {
>>          System.out.println("With    temp file");
>>          PDDocument d = PDDocument.load(new File(args[0]),
>> MemoryUsageSetting.setupTempFileOnly() );
>>          getImage(d);
>>          d.close();
>>
>>          System.out.println("Without temp file");
>>          d = PDDocument.load(new File(args[0]));
>>          getImage(d);
>>          d.save("other.pdf");
>>          d.close();
>>          System.out.println("With    temp file after saving with
>> another name");
>>          d = PDDocument.load(new File("other.pdf"),
>> MemoryUsageSetting.setupTempFileOnly() );
>>          getImage(d);
>>      }
>>
>>      static void getImage(PDDocument d) throws IOException {
>>          PDResources res = d.getPage(0).getResources();
>>          for (COSName n: res.getXObjectNames()) {
>>              PDXObject o = res.getXObject(n);
>>              if (o instanceof PDImageXObject){
>>                  PDImageXObject i = (PDImageXObject) o;
>>                  if ("png".equals(i.getSuffix())){
>>                      System.out.println("    Before getImage: "+new
>> Date());
>>                      i.getImage();
>>                      System.out.println("    After  getImage: "+new
>> Date());
>>                  }
>>              }
>>          }
>>      }
>> }
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: performance issue in v 2.0.17: PDImageXObject.getImage() + MemoryUsageSetting.setupTempFileOnly

Posted by Tilman Hausherr <TH...@t-online.de>.

Hi,
Could you try getting the current 2.0.18 from svn? Alternatively, use 
this code:

                 long fileLen = raf.length();
                 long expectedFileLen = ((long)pageCount - 
inMemoryMaxPageCount) * PAGE_SIZE;

                 if (expectedFileLen != fileLen)
                 {
                     throw new IOException("Expected scratch file size 
of " + expectedFileLen +
                                           " but found " + fileLen + " 
in file " + file);
                 }

                 // enlarge if we do not int overflow
                 if (pageCount + ENLARGE_PAGE_COUNT > pageCount)
                 {
                     if (LOG.isDebugEnabled())
                     {
                         LOG.debug("file: " + file);
                         LOG.debug("fileLen before: " + fileLen + ", raf 
length: " + raf.length() +
                                   ", file length: " + file.length());
                     }
                     fileLen += ENLARGE_PAGE_COUNT * PAGE_SIZE;

                     raf.setLength(fileLen);
                     if (LOG.isDebugEnabled())
                     {
                         LOG.debug("fileLen after1: " + fileLen + ", raf 
length: " + raf.length() +
                                   ", file length: " + file.length());
                     }
                     if (fileLen != raf.length() || fileLen != 
file.length())
                     {
                         // PDFBOX-4601 possible AWS lambda bug that 
setLength() doesn't throw
                         // if not enough space
                         long origFilePointer = raf.getFilePointer();
                         raf.seek(fileLen - 1);
                         raf.write(0);
                         raf.seek(origFilePointer);
                         LOG.debug("fileLen after2:  " + fileLen + ", 
raf length: " + raf.length() + ", file length: " + file.length());
                     }
                     freePages.set(pageCount, pageCount + 
ENLARGE_PAGE_COUNT);
                 }

or just remove the log messages from the 2.0.17 source. Now there is 
still a length() operation but that is "raf.length()" so maybe it is 
different.

Sadly we can't go without that, until Amazon has fixed their systems.

Re the difference in processing time between 16 and 17 - I don't know 
because you can't share your file. If you have a lot of time, try with 
different versions from the repostory, i.e. a "bisect" strategy. It 
takes hours, and is a work for patient people.

Tilman

Am 18.10.2019 um 18:45 schrieb Esteban R:
> Yes, it is. I have reverted ScratchFile.java and processing now takes less than 2 secs.
>
> I have noticed that the problem may be related to the call to file.length(): even if I add only the LOGs (or even more: if I only add a line: file.length(); to ScratchFile.java) then performance drops.
>
> I have also tested with another +48000 pdf files: I have not found another pdf file with a so big processing time. These are the results for the most time-consuming pdfs. I have noticed that processing time in v2.0.17 seems to be bigger for these files even if I don't use temp files (maybe the machine was more loaded during the tests?)
>
> Columns:
> v2.0.16_with_tempfile v2.0.16_without_tempfile  v2.0.17_with_tempfile v2_0_17_without_tempfile
> 1635 1410 4304 2581
> 1642 1528 4304 2585
> 1652 1373 4483 2662
> 1652 1461 4556 2727
> 1684 1441 4626 2835
> 1833 1339 4362 2610
> 1863 1641 4436 2689
> 4961 4350 18605 4296
>
>
> Esteban
>
>
>
>
> ________________________________
> De: Tilman Hausherr <TH...@t-online.de>
> Enviado: jueves, 26 de septiembre de 2019 19:45
> Para: users@pdfbox.apache.org <us...@pdfbox.apache.org>
> Asunto: Re: performance issue in v 2.0.17: PDImageXObject.getImage() + MemoryUsageSetting.setupTempFileOnly
>
> Ouch... Might be the workaround implemented for this issue:
> https://issues.apache.org/jira/browse/PDFBOX-4601
>
> Tilman
>
> Am 26.09.2019 um 21:39 schrieb Esteban R:
>> Hello. I'm getting a timeout in one of my tests after upgrading to v
>> 2.0.17: PDImageXObject.getImage() takes more than 1:10 minutes instead
>> of less than 2 seconds with previous release 2.0.16.
>>
>> I cannot provide the sample PDF because it contains sensitive
>> information. I have tried to simplify it but the issue (almost)
>> dissappears even if I save the file without changing anything.
>>
>> Some more facts:
>>
>>    * The issue happens if the load is done with
>>      MemoryUsageSetting.setupTempFileOnly (without that flag the issue
>>      doesn't happen)
>>    * The performance is OK for v 2.0.16
>>    * Please find attached the stack trace of the timeout of my original
>>      test
>>    * PDF structure is quite simple and contains a single image (find
>>      attached the relevant data from pdfdebugger)
>>    * I have created a sample program to demonstrate the issue: it
>>      simply loads the PDF file with the setupTempFileOnly flag and does
>>      getImage for all the images (only one in the PDF). It then does
>>      the same thing without the flag and after that it does the same
>>      thing with another PDF: it is simply the same file saved by PDFBOX
>>      with another name.
>>
>> Output with pdfbox 2.0.16:
>> java -cp "pdfbox-2.0.16.jar;commons-logging-1.2.jar;."
>> TestExtractImage sample.pdf
>> With    temp file
>>      Before getImage: Thu Sep 26 16:32:46 ART 2019
>>      After  getImage: Thu Sep 26 16:32:46 ART 2019
>> Without temp file
>>      Before getImage: Thu Sep 26 16:32:46 ART 2019
>>      After  getImage: Thu Sep 26 16:32:46 ART 2019
>> With    temp file after saving with another name
>>      Before getImage: Thu Sep 26 16:32:46 ART 2019
>>      After  getImage: Thu Sep 26 16:32:47 ART 2019
>> (i.e.: less than 2 seconds in all cases)
>>
>> Output with pdfbox 2.0.17:
>> java -cp "pdfbox-2.0.17.jar;commons-logging-1.2.jar;."
>> TestExtractImage sample.pdf
>> With    temp file
>>      Before getImage: Thu Sep 26 16:31:09 ART 2019
>>      After  getImage: Thu Sep 26 16:32:30 ART 2019  => more than 1'20"
>> Without temp file
>>      Before getImage: Thu Sep 26 16:32:30 ART 2019
>>      After  getImage: Thu Sep 26 16:32:30 ART 2019
>> With    temp file after saving with another name
>>      Before getImage: Thu Sep 26 16:32:30 ART 2019
>>      After  getImage: Thu Sep 26 16:32:34 ART 2019 => more than 2"
>>
>> And here the source code for TestExtractImage.java :
>>
>> import java.io.File;
>> import java.io.IOException;
>> import java.util.Date;
>> import org.apache.pdfbox.cos.COSName;
>> import org.apache.pdfbox.pdmodel.PDDocument;
>> import org.apache.pdfbox.pdmodel.PDResources;
>> import org.apache.pdfbox.pdmodel.graphics.PDXObject;
>> import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject;
>> import org.apache.pdfbox.io.MemoryUsageSetting;
>>
>> public class TestExtractImage {
>>      public static void main(String[] args) throws IOException {
>>          System.out.println("With    temp file");
>>          PDDocument d = PDDocument.load(new File(args[0]),
>> MemoryUsageSetting.setupTempFileOnly() );
>>          getImage(d);
>>          d.close();
>>
>>          System.out.println("Without temp file");
>>          d = PDDocument.load(new File(args[0]));
>>          getImage(d);
>>          d.save("other.pdf");
>>          d.close();
>>          System.out.println("With    temp file after saving with
>> another name");
>>          d = PDDocument.load(new File("other.pdf"),
>> MemoryUsageSetting.setupTempFileOnly() );
>>          getImage(d);
>>      }
>>
>>      static void getImage(PDDocument d) throws IOException {
>>          PDResources res = d.getPage(0).getResources();
>>          for (COSName n: res.getXObjectNames()) {
>>              PDXObject o = res.getXObject(n);
>>              if (o instanceof PDImageXObject){
>>                  PDImageXObject i = (PDImageXObject) o;
>>                  if ("png".equals(i.getSuffix())){
>>                      System.out.println("    Before getImage: "+new
>> Date());
>>                      i.getImage();
>>                      System.out.println("    After  getImage: "+new
>> Date());
>>                  }
>>              }
>>          }
>>      }
>> }
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

RE: performance issue in v 2.0.17: PDImageXObject.getImage() + MemoryUsageSetting.setupTempFileOnly

Posted by Esteban R <er...@hotmail.com>.

Yes, it is. I have reverted ScratchFile.java and processing now takes less than 2 secs.

I have noticed that the problem may be related to the call to file.length(): even if I add only the LOGs (or even more: if I only add a line: file.length(); to ScratchFile.java) then performance drops.

I have also tested with another +48000 pdf files: I have not found another pdf file with a so big processing time. These are the results for the most time-consuming pdfs. I have noticed that processing time in v2.0.17 seems to be bigger for these files even if I don't use temp files (maybe the machine was more loaded during the tests?)

Columns:
v2.0.16_with_tempfile v2.0.16_without_tempfile  v2.0.17_with_tempfile v2_0_17_without_tempfile
1635 1410 4304 2581
1642 1528 4304 2585
1652 1373 4483 2662
1652 1461 4556 2727
1684 1441 4626 2835
1833 1339 4362 2610
1863 1641 4436 2689
4961 4350 18605 4296


Esteban




________________________________
De: Tilman Hausherr <TH...@t-online.de>
Enviado: jueves, 26 de septiembre de 2019 19:45
Para: users@pdfbox.apache.org <us...@pdfbox.apache.org>
Asunto: Re: performance issue in v 2.0.17: PDImageXObject.getImage() + MemoryUsageSetting.setupTempFileOnly

Ouch... Might be the workaround implemented for this issue:
https://issues.apache.org/jira/browse/PDFBOX-4601

Tilman

Am 26.09.2019 um 21:39 schrieb Esteban R:
> Hello. I'm getting a timeout in one of my tests after upgrading to v
> 2.0.17: PDImageXObject.getImage() takes more than 1:10 minutes instead
> of less than 2 seconds with previous release 2.0.16.
>
> I cannot provide the sample PDF because it contains sensitive
> information. I have tried to simplify it but the issue (almost)
> dissappears even if I save the file without changing anything.
>
> Some more facts:
>
>   * The issue happens if the load is done with
>     MemoryUsageSetting.setupTempFileOnly (without that flag the issue
>     doesn't happen)
>   * The performance is OK for v 2.0.16
>   * Please find attached the stack trace of the timeout of my original
>     test
>   * PDF structure is quite simple and contains a single image (find
>     attached the relevant data from pdfdebugger)
>   * I have created a sample program to demonstrate the issue: it
>     simply loads the PDF file with the setupTempFileOnly flag and does
>     getImage for all the images (only one in the PDF). It then does
>     the same thing without the flag and after that it does the same
>     thing with another PDF: it is simply the same file saved by PDFBOX
>     with another name.
>
> Output with pdfbox 2.0.16:
> java -cp "pdfbox-2.0.16.jar;commons-logging-1.2.jar;."
> TestExtractImage sample.pdf
> With    temp file
>     Before getImage: Thu Sep 26 16:32:46 ART 2019
>     After  getImage: Thu Sep 26 16:32:46 ART 2019
> Without temp file
>     Before getImage: Thu Sep 26 16:32:46 ART 2019
>     After  getImage: Thu Sep 26 16:32:46 ART 2019
> With    temp file after saving with another name
>     Before getImage: Thu Sep 26 16:32:46 ART 2019
>     After  getImage: Thu Sep 26 16:32:47 ART 2019
> (i.e.: less than 2 seconds in all cases)
>
> Output with pdfbox 2.0.17:
> java -cp "pdfbox-2.0.17.jar;commons-logging-1.2.jar;."
> TestExtractImage sample.pdf
> With    temp file
>     Before getImage: Thu Sep 26 16:31:09 ART 2019
>     After  getImage: Thu Sep 26 16:32:30 ART 2019  => more than 1'20"
> Without temp file
>     Before getImage: Thu Sep 26 16:32:30 ART 2019
>     After  getImage: Thu Sep 26 16:32:30 ART 2019
> With    temp file after saving with another name
>     Before getImage: Thu Sep 26 16:32:30 ART 2019
>     After  getImage: Thu Sep 26 16:32:34 ART 2019 => more than 2"
>
> And here the source code for TestExtractImage.java :
>
> import java.io.File;
> import java.io.IOException;
> import java.util.Date;
> import org.apache.pdfbox.cos.COSName;
> import org.apache.pdfbox.pdmodel.PDDocument;
> import org.apache.pdfbox.pdmodel.PDResources;
> import org.apache.pdfbox.pdmodel.graphics.PDXObject;
> import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject;
> import org.apache.pdfbox.io.MemoryUsageSetting;
>
> public class TestExtractImage {
>     public static void main(String[] args) throws IOException {
>         System.out.println("With    temp file");
>         PDDocument d = PDDocument.load(new File(args[0]),
> MemoryUsageSetting.setupTempFileOnly() );
>         getImage(d);
>         d.close();
>
>         System.out.println("Without temp file");
>         d = PDDocument.load(new File(args[0]));
>         getImage(d);
>         d.save("other.pdf");
>         d.close();
>         System.out.println("With    temp file after saving with
> another name");
>         d = PDDocument.load(new File("other.pdf"),
> MemoryUsageSetting.setupTempFileOnly() );
>         getImage(d);
>     }
>
>     static void getImage(PDDocument d) throws IOException {
>         PDResources res = d.getPage(0).getResources();
>         for (COSName n: res.getXObjectNames()) {
>             PDXObject o = res.getXObject(n);
>             if (o instanceof PDImageXObject){
>                 PDImageXObject i = (PDImageXObject) o;
>                 if ("png".equals(i.getSuffix())){
>                     System.out.println("    Before getImage: "+new
> Date());
>                     i.getImage();
>                     System.out.println("    After  getImage: "+new
> Date());
>                 }
>             }
>         }
>     }
> }
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org

Re: performance issue in v 2.0.17: PDImageXObject.getImage() + MemoryUsageSetting.setupTempFileOnly

Posted by Tilman Hausherr <TH...@t-online.de>.

Ouch... Might be the workaround implemented for this issue:
https://issues.apache.org/jira/browse/PDFBOX-4601

Tilman

Am 26.09.2019 um 21:39 schrieb Esteban R:
> Hello. I'm getting a timeout in one of my tests after upgrading to v 
> 2.0.17: PDImageXObject.getImage() takes more than 1:10 minutes instead 
> of less than 2 seconds with previous release 2.0.16.
>
> I cannot provide the sample PDF because it contains sensitive 
> information. I have tried to simplify it but the issue (almost) 
> dissappears even if I save the file without changing anything.
>
> Some more facts:
>
>   * The issue happens if the load is done with
>     MemoryUsageSetting.setupTempFileOnly (without that flag the issue
>     doesn't happen)
>   * The performance is OK for v 2.0.16
>   * Please find attached the stack trace of the timeout of my original
>     test
>   * PDF structure is quite simple and contains a single image (find
>     attached the relevant data from pdfdebugger)
>   * I have created a sample program to demonstrate the issue: it
>     simply loads the PDF file with the setupTempFileOnly flag and does
>     getImage for all the images (only one in the PDF). It then does
>     the same thing without the flag and after that it does the same
>     thing with another PDF: it is simply the same file saved by PDFBOX
>     with another name.
>
> Output with pdfbox 2.0.16:
> java -cp "pdfbox-2.0.16.jar;commons-logging-1.2.jar;." 
> TestExtractImage sample.pdf
> With    temp file
>     Before getImage: Thu Sep 26 16:32:46 ART 2019
>     After  getImage: Thu Sep 26 16:32:46 ART 2019
> Without temp file
>     Before getImage: Thu Sep 26 16:32:46 ART 2019
>     After  getImage: Thu Sep 26 16:32:46 ART 2019
> With    temp file after saving with another name
>     Before getImage: Thu Sep 26 16:32:46 ART 2019
>     After  getImage: Thu Sep 26 16:32:47 ART 2019
> (i.e.: less than 2 seconds in all cases)
>
> Output with pdfbox 2.0.17:
> java -cp "pdfbox-2.0.17.jar;commons-logging-1.2.jar;." 
> TestExtractImage sample.pdf
> With    temp file
>     Before getImage: Thu Sep 26 16:31:09 ART 2019
>     After  getImage: Thu Sep 26 16:32:30 ART 2019  => more than 1'20"
> Without temp file
>     Before getImage: Thu Sep 26 16:32:30 ART 2019
>     After  getImage: Thu Sep 26 16:32:30 ART 2019
> With    temp file after saving with another name
>     Before getImage: Thu Sep 26 16:32:30 ART 2019
>     After  getImage: Thu Sep 26 16:32:34 ART 2019 => more than 2"
>
> And here the source code for TestExtractImage.java :
>
> import java.io.File;
> import java.io.IOException;
> import java.util.Date;
> import org.apache.pdfbox.cos.COSName;
> import org.apache.pdfbox.pdmodel.PDDocument;
> import org.apache.pdfbox.pdmodel.PDResources;
> import org.apache.pdfbox.pdmodel.graphics.PDXObject;
> import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject;
> import org.apache.pdfbox.io.MemoryUsageSetting;
>
> public class TestExtractImage {
>     public static void main(String[] args) throws IOException {
>         System.out.println("With    temp file");
>         PDDocument d = PDDocument.load(new File(args[0]), 
> MemoryUsageSetting.setupTempFileOnly() );
>         getImage(d);
>         d.close();
>
>         System.out.println("Without temp file");
>         d = PDDocument.load(new File(args[0]));
>         getImage(d);
>         d.save("other.pdf");
>         d.close();
>         System.out.println("With    temp file after saving with 
> another name");
>         d = PDDocument.load(new File("other.pdf"), 
> MemoryUsageSetting.setupTempFileOnly() );
>         getImage(d);
>     }
>
>     static void getImage(PDDocument d) throws IOException {
>         PDResources res = d.getPage(0).getResources();
>         for (COSName n: res.getXObjectNames()) {
>             PDXObject o = res.getXObject(n);
>             if (o instanceof PDImageXObject){
>                 PDImageXObject i = (PDImageXObject) o;
>                 if ("png".equals(i.getSuffix())){
>                     System.out.println("    Before getImage: "+new 
> Date());
>                     i.getImage();
>                     System.out.println("    After  getImage: "+new 
> Date());
>                 }
>             }
>         }
>     }
> }
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org