You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Arthur Wang <ar...@hotmail.com> on 2018/04/16 07:21:35 UTC

Performance issue with PDFBox 2.0.8

Hi, everyone,



I am using PDFBox 2.0.8 and java 8 running in tomcat 8 for production to convert pdf into image for display. it works very well for pdf file size less than 5 M, it takes about 3800 milliseconds. however, it slows down very much when the file size is increased to 50 M. it takes about 70,000 milliseconds, after setting system property of sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider", it does increase the performance to 32550 milliseconds, which almost double the speed. but for 32 seconds to load a web page still too slow. Is there any other way to speed up the performance? would adding a GPU into the server help the performance? or any other software or hardware solution could help on the processing speed? My current server come with 32 G RAM, and the server never used more than half of it.


thanks,


Arthur

Re: Fw: Performance issue with PDFBox 2.0.8

Posted by Arthur Wang <ar...@hotmail.com>.
Tilman,


Yes, to convert pdf at the background either at night or at the moment right after the uploading is finished are good option for the performance improvement. Anyway, regardless when to process it, once the image is produced and saved to the storage, every access to the image file will be very fast. I probably will ask those designers to verify the files after they finish uploading the file such that the image file can be generated by pdfbox, the rest of users would not feel any performance issue then.


I was trying to test the pdf.js as a comparison, however, after a few hours of configuration, I am still not able to view the helloworld.pdf on the web page of my testing apache server, so I will stick with pdfbox for now, thanks for the recommendation.


Arthur

________________________________
From: Tilman Hausherr <TH...@t-online.de>
Sent: Tuesday, April 17, 2018 3:17 PM
To: users@pdfbox.apache.org
Subject: Re: Fw: Performance issue with PDFBox 2.0.8

Hi,

Yeah, for thumbnails / previews the subsampling option is definitively
for you.

Can you calculate the preview in the background? I.e. at the time the
PDFs are uploaded, instead of when the download page is requested?

Re pdf.js you can test it here:
https://mozilla.github.io/pdf.js/web/viewer.html
PDF.js viewer - GitHub Pages<https://mozilla.github.io/pdf.js/web/viewer.html>
mozilla.github.io
Enter the password to open this PDF file: Cancel OK. File name:-



I tried the Herman file, to me it seemed that it is slower with PDF.js
than PDFBox, which is a bit surprising because usually they're faster.

With PDFDebugger with subsampling enabled it is rendered in 4409ms on my
system. I don't know if my graphics card plays any role in this.

Tilman

Am 17.04.2018 um 23:40 schrieb Arthur Wang:
> Arthur Wang has shared OneDrive files with you. To view them, click the links below.
>
>
> <https://1drv.ms/b/s!AhA_REgBppCpgQ062sb5LoKlZkC4>
> [https://r1.res.office365.com/owa/prem/images/dc-pdf_20.png]<https://1drv.ms/b/s!AhA_REgBppCpgQ062sb5LoKlZkC4>
>
> Herman & hiss - PPHI101201 - FV 1.pdf<https://1drv.ms/b/s!AhA_REgBppCpgQ062sb5LoKlZkC4>
>
> <https://1drv.ms/u/s!AhA_REgBppCpgQ7jWRqI5BtoKiMx>
> [https://r1.res.office365.com/owa/prem/images/dc-png_20.png]<https://1drv.ms/u/s!AhA_REgBppCpgQ7jWRqI5BtoKiMx>
>
> fileListPage.png<https://1drv.ms/u/s!AhA_REgBppCpgQ7jWRqI5BtoKiMx>
>
> <https://1drv.ms/u/s!AhA_REgBppCpgQ9zgx9cBhmI2DfH>
> [https://r1.res.office365.com/owa/prem/images/dc-png_20.png]<https://1drv.ms/u/s!AhA_REgBppCpgQ9zgx9cBhmI2DfH>
>
> downloadpage.png<https://1drv.ms/u/s!AhA_REgBppCpgQ9zgx9cBhmI2DfH>
>
>
>
>
> Arthur Wang has shared a OneDrive file with you. To view it, click the link below.
>
>
> <https://1drv.ms/b/s!AhA_REgBppCpgQ062sb5LoKlZkC4>
> [https://r1.res.office365.com/owa/prem/images/dc-pdf_20.png]<https://1drv.ms/b/s!AhA_REgBppCpgQ062sb5LoKlZkC4>
>
> Herman & hiss - PPHI101201 - FV 1.pdf<https://1drv.ms/b/s!AhA_REgBppCpgQ062sb5LoKlZkC4>
>
>
> [https://ipodlq.by.files.1drv.com/y4mV0VxHK2D4uxyC4TQL_Pm0kYsbjNHGUAguMLxlVa4ykUtL6nwg19za0G74IoFusv2FwwJzohKoGdIKXg_MF26eavOY6hXbddC36qMI8vALhyNSmU8cAlpuAsMwah5b5__skoa2koVvs5wP2tekcxmBSQE-KPRahIsVu6ZtVnLV_I?width=800&height=800&cropmode=none]<https://1drv.ms/b/s!AhA_REgBppCpgQ062sb5LoKlZkC4>
>
> [https://r1.res.office365.com/owa/prem/images/dc-pdf_40.png]<https://1drv.ms/b/s!AhA_REgBppCpgQ062sb5LoKlZkC4>
> Herman & hiss - PPHI101201 - FV 1.pdf<https://1drv.ms/b/s!AhA_REgBppCpgQ062sb5LoKlZkC4>
> Shared via OneDrive
>
>
>
>
> Tilman,
>
>
> Since my email got rejected due to the size limit of apache mail server--1m. I send it again here.
>
>
> First, thank you very much for the extra information and udpate.
>
>
> My application is an internal web based production system. Many designers in our graphic department uploaded the print-ready file to the system every hours, and other users include prepress, press, shipping, customers will log into the system to download the files. The print-ready pdf file sometimes are extremely big in terms of the size, 5 M to 1 G are most popular, 2G to 5G are rare, but do happen sometimes. Please refer to the attached two screenshots(fileListPage, downloadPage). What I am trying to do is to show a thumbnail on the fileDownloadPage. we used to show a download icon on the download page instead of the thumbnail, but users have to download the file to their local computer before actually seeing it. Sometimes the fileListPage show a long list of files, people get confused, it would be more convenient for the user to have a peek of the file before actually download it. so it's better to have a thumbnail on the download page. In terms of the pdf.js, I had never tried, do you think it can load 40M or 50 M file in one or two seconds by the apache server?
>
>
> I copied my code below for you reference.(one is for testing, the other one is for production .)
>
>
> Attached you will also find a pdf file named Herman..pdf. it only has two pages, but by only convert the first page, the best I can do it 7 seconds. that would be very slow for web application. If by adding a GPU, the performance could get better, I would certainly like to try, just not sure if it's going to work.
>
>
> ******************below are testing code running on eclilpse platform**************
>
>
> package com.test;
>
> import org.apache.pdfbox.pdmodel.PDDocument;
> import org.apache.pdfbox.rendering.ImageType;
> import org.apache.pdfbox.rendering.PDFRenderer;
> import org.apache.pdfbox.tools.imageio.ImageIOUtil;
>
> import java.awt.image.BufferedImage;
> import java.io.File;
> import java.io.IOException;
> import org.apache.commons.lang3.time.StopWatch;
> import org.apache.commons.lang3.StringUtils;
>
> public class PdfToImage {
>
>      private static final String OUTPUT_DIR = "/Users/someone/Desktop/";
>
>      public static void main(String[] args) throws Exception{
>
>          System.setProperty("sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider");
>
>          StopWatch stopwatch = new StopWatch();
>
>          stopwatch.start();
>
>          try (final PDDocument document = PDDocument.load(new File("/Users/someone/Desktop/Herman & hiss - PPHI101201 - FV.pdf"))){
>              PDFRenderer pdfRenderer = new PDFRenderer(document);
>              pdfRenderer.setSubsamplingAllowed(true);
>              //for (int page = 0; page < document.getNumberOfPages(); ++page)
>              for (int page = 0; page < 1; ++page)
>              {
>                  BufferedImage bim = pdfRenderer.renderImageWithDPI(page, 72, ImageType.RGB); //<--this number have performance impact
>                  String fileName = OUTPUT_DIR + "Herman & hiss - PPHI101201 - FV" + page + ".jpg";
>                  ImageIOUtil.writeImage(bim, fileName, 72); //<---this number
>              }
>              document.close();
>          } catch (IOException e){
>              System.err.println("Exception while trying to create pdf document - " + e);
>          }
>
>           stopwatch.stop(); // optional
>          System.out.println("Time elapsed is "+ stopwatch.getTime() + " milliseconds");
>
>
>      }
>      //test Files: Ashley NJ_HHL101125_FV.pdf, 40M, 4 pages
>      //try Ashley without set property: 4 pages@70074 milliseconds
>      //try Ashley with property set:   4 pagesQ@32552 milliseconds
>      //try have subSampling true set: 4 pages@9481 milliseconds
>      //try Herman & hiss - PPHI101201 - FV.png: two pages@14050 milliseconds
>      //try Herman & hiss - PPHI101201 - FV.jpg: two pages@13612 milliseconds
>      //try Herman: 1 page@7625
>      //try Ashley: 1 page@3237
>      //try Ashely with 72 dpi: 1 page@2807
>      //try Herman with 72 dpi: 1 page@6788
>      //try herman without subSampling true setting: 1 page@7087
>
> }
>
>
>
> *****************below is production code running as an action class of struts *********
>
>
> public void processPdf(String pdfFilePath, String imageFilePath){
>
>          System.setProperty("sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider");
>
>          try (final PDDocument document = PDDocument.load(new File(pdfFilePath))){
>              PDFRenderer pdfRenderer = new PDFRenderer(document);
>              pdfRenderer.setSubsamplingAllowed(true);
>              //for (int page = 0; page < document.getNumberOfPages(); ++page)
>              for (int page = 0; page < 1; ++page)
>              {
>                  BufferedImage bim = pdfRenderer.renderImageWithDPI(page, 72, ImageType.RGB);
>
>                  ImageIOUtil.writeImage(bim, imageFilePath, 72);
>              }
>              document.close();
>          } catch (IOException e){
>                  log.info("Exception while trying to create pdf document - " + e);
>          }
>
>
>      }
>
>
> *********************
>
>
>
> ________________________________
> From: Tilman Hausherr <TH...@t-online.de>
> Sent: Tuesday, April 17, 2018 10:39 AM
> To: users@pdfbox.apache.org
> Subject: Re: Performance issue with PDFBox 2.0.8
>
> Hi,
>
> I ran the Ashley file through the profiler, most time is used for
> decoding the jpeg files within and converting some of the from CMYK to
> RGB. Nothing to optimize. I also found another one-time initialization
> that takes 100-300ms, which I will add to the next version of PDFDebugger.
>
>       FilterFactory.INSTANCE.getFilter(COSName.FLATE_DECODE);
>
> I also tested the UsePureJavaCMYKConversion, it made rendering much
> slower. IIRC, that option only helps with files with many tiny CMYK images.
>
> I have committed a change that adds the subsampling option to
> PDFToImage, that version will be available within a few hours at
> https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/2.0.10-SNAPSHOT/
> look for todays date.
>
> Or get the source code here:
> https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/PDFToImage.java?revision=1829374&view=markup
>
> What type of application are you creating? If you want to show a PDF in
> the browser, PDF.js works nicely, is free and included in firefox. If
> you want to do thumbnails, then you should use a smaller dpi value. In
> that case using subsampling would help even more.
>
> Tilman
>
>
>
> Am 17.04.2018 um 07:26 schrieb Tilman Hausherr:
>> Hi,
>>
>> I have a Ryzen 1700 cpu and for tests I'm running it on max energy
>> settings. It is unclear if a mac has a similar setting.  This url
>> http://www.macos.utah.edu/documentation/administration/pmset.html
>> shows there is a setting for "better performance" but I don't know if
>> that does the same as on Windows where I get a performance doubling.
>> Try PDFDebugger, it has a built-in benchmark feature, it shows the
>> rendering speed in the status line.
>>
>> I'm also avoiding that one-time initializations are part of the
>> benchmark results with this code that is also in PDFDebugger:
>>
>>          // trigger premature initializations for more accurate
>> rendering benchmarks
>>          // See discussion in PDFBOX-3988
>>          if (PDType1Font.COURIER.isStandard14())
>>          {
>>              // Yes this is always true
>>              PDDeviceCMYK.INSTANCE.toRGB(new float[] { 0, 0, 0, 0} );
>>              PDDeviceRGB.INSTANCE.toRGB(new float[] { 0, 0, 0 } );
>>              IIORegistry.getDefaultInstance();
>>          }
>>
>> I see you're using the PDFToImage utility. That one doesn't support
>> subsampling yet, it has been on my "todo" list for a few days, I'll
>> try to do it tonight... But PDFToImage is really just a command line
>> utility.
>>
>> Args 7, 8 and 11 don't work that way. Re arg 7 and 8, you need to call
>> System.setProperty(). Re arg 11, you need to have a PDFRenderer object.
>>
>> Another way to convert to images is explained here:
>> https://stackoverflow.com/questions/23326562/apache-pdfbox-convert-pdf-to-images
>>
>>
>> there call pdfRenderer.setSubsamplingAllowed(true) to activate
>> subsampling. PDFDebugger also supports it in the menu.
>>
>> Tilman
>>
>> Am 17.04.2018 um 01:20 schrieb Arthur Wang:
>>> Tilman,
>>>
>>>
>>> Thanks for the quick response and testing on my case. Below is my
>>> java code, my test result after adding the subsampling allowed. For
>>> the first page of ashley file, it cost 3362 milliseconds.
>>>
>>> For the Gill file, the time elapsed is 2456 milliseconds.
>>>
>>> My test were conducted on my MAC with processor: 2.2GHz, Core i7.
>>> how come your PC runs so fast? 1.4 seconds is fast enough for a web
>>> access. Maybe there is something wrong with my code? I would
>>> appreciate if you take a look at my code.
>>>
>>>
>>> Best,
>>>
>>>
>>> Arthur
>>>
>>>
>>> *******************
>>>
>>> import org.apache.pdfbox.tools.PDFToImage;
>>> //import java.awt.image.BufferedImage;
>>> import java.io.File;
>>> //import java.io.IOException;
>>> //import java.io.OutputStream;
>>> import org.apache.commons.lang3.time.StopWatch;
>>>
>>>
>>> public class PdfToImage2 {
>>>
>>>       private static final String OUTPUT_DIR = "/Users/someone/Desktop/";
>>>
>>>       public static void main(String[] args) throws Exception{
>>>
>>>           String pdfPath = "/Users/someone/Desktop/Ashley
>>> NJ_HHL101125_FV.pdf";
>>>           //config option 2:convert page 1 in pdf to image
>>>           String [] args_1 =  new String[13];
>>>           args_1[0] = "-startPage";
>>>           args_1[1] = "1";
>>>           args_1[2] = "-endPage";
>>>           args_1[3] = "1";
>>>           args_1[4] = "-outputPrefix";
>>>           args_1[5] = OUTPUT_DIR+"Ashley NJ_HHL101125_FV1";
>>>           args_1[6] = pdfPath;
>>>           args_1[7] =
>>> "-Dorg.apache.pdfbox.rendering.UsePureJavaCMYKConversion";
>>>           args_1[8] = "true";
>>>           args_1[9] = "-dpi";
>>>           args_1[10] = "72";//@48-->3283 milliseconds, @96>3545
>>> milliseconds, @72--> 3362milliseconds
>>>           args_1[11] = "-PDFRenderer.setSubsamplingAllowed";
>>>           args_1[12] = "true";
>>>
>>>           File f = new File(args_1[5]+"1.jpg");
>>>           if(f.exists() && !f.isDirectory()) {
>>>               System.out.println("file exist already");;
>>>           }
>>>           else{
>>>
>>>               StopWatch stopwatch = new StopWatch();
>>>
>>>               stopwatch.start();
>>>
>>>                 try {
>>>
>>>                   System.setProperty("sun.java2d.cmm",
>>> "sun.java2d.cmm.kcms.KcmsServiceProvider");
>>>                   PDFToImage.main(args_1);
>>>                   System.out.println("Done!");
>>>                 } catch (Exception e) {
>>>                     System.err.println("Exception while trying to
>>> create pdf document - " + e);
>>>                 }
>>>
>>>                    stopwatch.stop(); // optional
>>>                   System.out.println("Time elapsed is "+
>>> stopwatch.getTime() + " milliseconds");
>>>
>>>
>>>           }//else
>>>
>>>           //first try without setting property: 3779 milliseconds
>>>           //second try with the property set: 3852 milliseconds
>>>           //third try with subsamplingAllowed: 3362 milliseconds
>>>
>>>       }
>>>
>>> *******************************
>>>
>>> ________________________________
>>> From: Tilman Hausherr <TH...@t-online.de>
>>> Sent: Monday, April 16, 2018 10:55 AM
>>> To: users@pdfbox.apache.org
>>> Subject: Re: Performance issue with PDFBox 2.0.8
>>>
>>> The java code didn't get through, most attachments get deleted. Call
>>> PDFRenderer.setSubsamplingAllowed(true) to activate subsampling.
>>>
>>> I had a look at your files... These are not extremely slow renderings. 4
>>> seconds for such a page is pretty good.
>>>
>>> On my PC, the first page of the Ashley file is rendered in PDFDebugger
>>> in 1.4 seconds at 72dpi. The Gill file is done in less than a second.
>>>
>>> Tilman
>>>
>>> Am 16.04.2018 um 19:05 schrieb Arthur Wang:
>>>> Arthur Wang has shared OneDrive files with you. To view them, click
>>>> the links below.
>>>>
>>>> <https://1drv.ms/b/s%21AhA_REgBppCpgQluAoJe28B935ru>
>>> [https://t5xdlg.by.files.1drv.com/y4miTL2BDX0qtz_xYqqct6mwo2l56s1alwlz1rLGpW_Mc5E7Ru9u3d-eAzERHVsoihIPc7xtNgaswIIeBlDh-hbm93zWQ6vL_PbzONFlHkd9shlCjeSfByIpMBE3EY161sZ77ggD87nTJRTRysSn4sLSrmwyenyMNNSuT6_EsqGw8Db-iHQ8Fr14T8lW0hdVFrxBDvGPI4J5G6IV2RH21FHiw?width=800&height=800&cropmode=none]<https://1drv.ms/b/s%21AhA_REgBppCpgQluAoJe28B935ru>
>>>
>>>
>>> [https://r1.res.office365.com/owa/prem/images/dc-pdf_40.png]<https://1drv.ms/b/s%21AhA_REgBppCpgQluAoJe28B935ru>
>>>
>>> Ashley
>>> NJ_HHL101125_FV.pdf<https://1drv.ms/b/s%21AhA_REgBppCpgQluAoJe28B935ru>
>>> Shared via OneDrive
>>>
>>>
>>>
>>>> Ashley NJ_HHL101125_FV.pdf
>>>> <https://1drv.ms/b/s%21AhA_REgBppCpgQluAoJe28B935ru>
>>>>
>>>>
>>>> <https://1drv.ms/b/s%21AhA_REgBppCpgQpdnBIl_hmK6Wt0>
>>>>
>>>> Gill1-1356_KM102685-INS_FV.pdf
>>>> <https://1drv.ms/b/s%21AhA_REgBppCpgQpdnBIl_hmK6Wt0>
>>>>
>>>>
>>>> <https://1drv.ms/u/s%21AhA_REgBppCpgQvygYjm2eaJQmSH>
>>>>
>>>> Screen Shot 2018-04-16 at 9.23.52 AM.png
>>>> <https://1drv.ms/u/s%21AhA_REgBppCpgQvygYjm2eaJQmSH>
>>>>          [Screen Shot 2018-04-16 at 9.23.52 AM.png]
>>>>
>>>> just tried on 2.0.9, it works almost the same. to process all 4 pages
>>>> cost 32 seconds, if only process the first page, it cost about 4
>>>> seconds.
>>>>
>>>>
>>>> My server is HP DL380 with dual Xeon processors and 32 G RAM, the hard
>>>> drive is Intel Optane SSD NVMe.
>>>>
>>>> Once the JPG image is produced, the access of the image is almost
>>>> instant regardless the size of the image file, so the open and close
>>>> time of the image file are insignificant and could be ignored.
>>>>
>>>>
>>>> By enable subsampling, do you mean to set up the dpi option ? do you
>>>> have the sample code for PDFRenderer ? attached file
>>>> ---PdfToImage2.java is my testing code. Ashley...pdf is a file with
>>>> size about 45 M, and Gill...pdf is a file with size about 5 M. with
>>>> the size 1/10th of the other one, the processing time is cut down to
>>>> 2657 milliseconds compare to 3779 milliseconds. seems like the size
>>>> does matter.
>>>>
>>>>
>>>> thanks,
>>>>
>>>>
>>>> Arthur
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------
>>>>
>>>> *From:* Tilman Hausherr <TH...@t-online.de>
>>>> *Sent:* Monday, April 16, 2018 8:57 AM
>>>> *To:* users@pdfbox.apache.org
>>>> *Subject:* Re: Performance issue with PDFBox 2.0.8
>>>> Please
>>>> - retry with the current version 2.0.9
>>>> - share your file for a profiler analysis
>>>> - as said by Itai (who implemented it) try enabling subsampling in
>>>> PDFRenderer (read the javadoc first). Compare the results and decide
>>>> whether the quality is OK for you.
>>>> - set the energy settings of your computer to maximum or at least to
>>>> "balanced", not to "energy save"
>>>> - don't know if adding GPU will help;
>>>> - try also the
>>>> -Dorg.apache.pdfbox.rendering.UsePureJavaCMYKConversion=true option
>>>>
>>>> The speed is not related to the size but to the complexity. 32 seconds
>>>> may sound disappointing but it's not the worst I've ever seen. "Nice
>>>> illustrations" with nested patterns or large shadings may be slow.
>>>>
>>>> Tilman
>>>>
>>>> Am 16.04.2018 um 09:21 schrieb Arthur Wang:
>>>>> Hi, everyone,
>>>>>
>>>>>
>>>>>
>>>>> I am using PDFBox 2.0.8 and java 8 running in tomcat 8 for
>>>> production to convert pdf into image for display. it works very well
>>>> for pdf file size less than 5 M, it takes about 3800 milliseconds.
>>>> however, it slows down very much when the file size is increased to 50
>>>> M. it takes about 70,000 milliseconds, after setting system property
>>>> of sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider", it does
>>>> increase the performance to 32550 milliseconds, which almost double
>>>> the speed. but for 32 seconds to load a web page still too slow. Is
>>>> there any other way to speed up the performance? would adding a GPU
>>>> into the server help the performance? or any other software or
>>>> hardware solution could help on the processing speed? My current
>>>> server come with 32 G RAM, and the server never used more than half
>>>> of it.
>>>>> thanks,
>>>>>
>>>>>
>>>>> Arthur
>>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: Fw: Performance issue with PDFBox 2.0.8

Posted by Arthur Wang <ar...@hotmail.com>.
Thank you very much for those information, it's very helpful.  I will probably try to change the energy setting of the server to see if it runs faster.


Arthur

________________________________
From: Tilman Hausherr <TH...@t-online.de>
Sent: Wednesday, April 18, 2018 8:22 PM
To: users@pdfbox.apache.org
Subject: Re: Fw: Performance issue with PDFBox 2.0.8

Am 19.04.2018 um 00:18 schrieb Arthur Wang:
> Hi, Tilman,
>
>
>  From you last email----" I don't know if my graphics card plays any role in this", by any chance, would it be possible that you could benchmark if the GPU really help the pdfbox processing or not? I can not plug in a GPU into my mac to do the testing, but if you already have a GPU in your PC, I would be very interested to know if the performance would be decreased after you unplug it (if it's unluggable). if it proved to be helpful, I would like to buy a nice GPU and put it into our production server to improve the performance.

Sorry, no, I can't... My GPU is a $100 model (I don't go games), and I
didn't find any setting to turn it on/off, and I don't have a different
replacement adapter that I could switch. In java there is
"-Dsun.java2d.opengl=true" but it has no effect.
https://docs.oracle.com/javase/8/docs/technotes/guides/2d/flags.html#opengl
System Properties for Java 2D Technology - Oracle<https://docs.oracle.com/javase/8/docs/technotes/guides/2d/flags.html#opengl>
docs.oracle.com
This document describes several unsupported properties that you can use to customize how the 2D painting system operates. You might use these properties to improve performance, fix incorrect rendering, or avoid system crashes under certain configurations. For example, on a small set of Microsoft ...



The only things that make a difference on my system (windows) are
enabling subsampling and setting energy settings to max performance.
Using both then the first page of the Ashley file is rendered in 922ms.

Tilman
>
>
> If this take some effort or time, never mind,
>
>
> thanks for all the help,
>
>
> Arthur
>
>
> ________________________________
> From: Tilman Hausherr <TH...@t-online.de>
> Sent: Tuesday, April 17, 2018 3:17 PM
> To: users@pdfbox.apache.org
> Subject: Re: Fw: Performance issue with PDFBox 2.0.8
>
> Hi,
>
> Yeah, for thumbnails / previews the subsampling option is definitively
> for you.
>
> Can you calculate the preview in the background? I.e. at the time the
> PDFs are uploaded, instead of when the download page is requested?
>
> Re pdf.js you can test it here:
> https://mozilla.github.io/pdf.js/web/viewer.html
> PDF.js viewer - GitHub Pages<https://mozilla.github.io/pdf.js/web/viewer.html>
> mozilla.github.io
> Enter the password to open this PDF file: Cancel OK. File name:-
>
>
>
> I tried the Herman file, to me it seemed that it is slower with PDF.js
> than PDFBox, which is a bit surprising because usually they're faster.
>
> With PDFDebugger with subsampling enabled it is rendered in 4409ms on my
> system. I don't know if my graphics card plays any role in this.
>
> Tilman
>
> Am 17.04.2018 um 23:40 schrieb Arthur Wang:
>> Arthur Wang has shared OneDrive files with you. To view them, click the links below.
>>
>>
>> <https://1drv.ms/b/s!AhA_REgBppCpgQ062sb5LoKlZkC4>
>> [https://r1.res.office365.com/owa/prem/images/dc-pdf_20.png]<https://1drv.ms/b/s!AhA_REgBppCpgQ062sb5LoKlZkC4>
>>
>> Herman & hiss - PPHI101201 - FV 1.pdf<https://1drv.ms/b/s!AhA_REgBppCpgQ062sb5LoKlZkC4>
>>
>> <https://1drv.ms/u/s!AhA_REgBppCpgQ7jWRqI5BtoKiMx>
>> [https://r1.res.office365.com/owa/prem/images/dc-png_20.png]<https://1drv.ms/u/s!AhA_REgBppCpgQ7jWRqI5BtoKiMx>
>>
>> fileListPage.png<https://1drv.ms/u/s!AhA_REgBppCpgQ7jWRqI5BtoKiMx>
>>
>> <https://1drv.ms/u/s!AhA_REgBppCpgQ9zgx9cBhmI2DfH>
>> [https://r1.res.office365.com/owa/prem/images/dc-png_20.png]<https://1drv.ms/u/s!AhA_REgBppCpgQ9zgx9cBhmI2DfH>
>>
>> downloadpage.png<https://1drv.ms/u/s!AhA_REgBppCpgQ9zgx9cBhmI2DfH>
>>
>>
>>
>>
>> Arthur Wang has shared a OneDrive file with you. To view it, click the link below.
>>
>>
>> <https://1drv.ms/b/s!AhA_REgBppCpgQ062sb5LoKlZkC4>
>> [https://r1.res.office365.com/owa/prem/images/dc-pdf_20.png]<https://1drv.ms/b/s!AhA_REgBppCpgQ062sb5LoKlZkC4>
>>
>> Herman & hiss - PPHI101201 - FV 1.pdf<https://1drv.ms/b/s!AhA_REgBppCpgQ062sb5LoKlZkC4>
>>
>>
>> [https://ipodlq.by.files.1drv.com/y4mV0VxHK2D4uxyC4TQL_Pm0kYsbjNHGUAguMLxlVa4ykUtL6nwg19za0G74IoFusv2FwwJzohKoGdIKXg_MF26eavOY6hXbddC36qMI8vALhyNSmU8cAlpuAsMwah5b5__skoa2koVvs5wP2tekcxmBSQE-KPRahIsVu6ZtVnLV_I?width=800&height=800&cropmode=none]<https://1drv.ms/b/s!AhA_REgBppCpgQ062sb5LoKlZkC4>
>>
>> [https://r1.res.office365.com/owa/prem/images/dc-pdf_40.png]<https://1drv.ms/b/s!AhA_REgBppCpgQ062sb5LoKlZkC4>
>> Herman & hiss - PPHI101201 - FV 1.pdf<https://1drv.ms/b/s!AhA_REgBppCpgQ062sb5LoKlZkC4>
>> Shared via OneDrive
>>
>>
>>
>>
>> Tilman,
>>
>>
>> Since my email got rejected due to the size limit of apache mail server--1m. I send it again here.
>>
>>
>> First, thank you very much for the extra information and udpate.
>>
>>
>> My application is an internal web based production system. Many designers in our graphic department uploaded the print-ready file to the system every hours, and other users include prepress, press, shipping, customers will log into the system to download the files. The print-ready pdf file sometimes are extremely big in terms of the size, 5 M to 1 G are most popular, 2G to 5G are rare, but do happen sometimes. Please refer to the attached two screenshots(fileListPage, downloadPage). What I am trying to do is to show a thumbnail on the fileDownloadPage. we used to show a download icon on the download page instead of the thumbnail, but users have to download the file to their local computer before actually seeing it. Sometimes the fileListPage show a long list of files, people get confused, it would be more convenient for the user to have a peek of the file before actually download it. so it's better to have a thumbnail on the download page. In terms of the pdf.js, I had never tried, do you think it can load 40M or 50 M file in one or two seconds by the apache server?
>>
>>
>> I copied my code below for you reference.(one is for testing, the other one is for production .)
>>
>>
>> Attached you will also find a pdf file named Herman..pdf. it only has two pages, but by only convert the first page, the best I can do it 7 seconds. that would be very slow for web application. If by adding a GPU, the performance could get better, I would certainly like to try, just not sure if it's going to work.
>>
>>
>> ******************below are testing code running on eclilpse platform**************
>>
>>
>> package com.test;
>>
>> import org.apache.pdfbox.pdmodel.PDDocument;
>> import org.apache.pdfbox.rendering.ImageType;
>> import org.apache.pdfbox.rendering.PDFRenderer;
>> import org.apache.pdfbox.tools.imageio.ImageIOUtil;
>>
>> import java.awt.image.BufferedImage;
>> import java.io.File;
>> import java.io.IOException;
>> import org.apache.commons.lang3.time.StopWatch;
>> import org.apache.commons.lang3.StringUtils;
>>
>> public class PdfToImage {
>>
>>       private static final String OUTPUT_DIR = "/Users/someone/Desktop/";
>>
>>       public static void main(String[] args) throws Exception{
>>
>>           System.setProperty("sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider");
>>
>>           StopWatch stopwatch = new StopWatch();
>>
>>           stopwatch.start();
>>
>>           try (final PDDocument document = PDDocument.load(new File("/Users/someone/Desktop/Herman & hiss - PPHI101201 - FV.pdf"))){
>>               PDFRenderer pdfRenderer = new PDFRenderer(document);
>>               pdfRenderer.setSubsamplingAllowed(true);
>>               //for (int page = 0; page < document.getNumberOfPages(); ++page)
>>               for (int page = 0; page < 1; ++page)
>>               {
>>                   BufferedImage bim = pdfRenderer.renderImageWithDPI(page, 72, ImageType.RGB); //<--this number have performance impact
>>                   String fileName = OUTPUT_DIR + "Herman & hiss - PPHI101201 - FV" + page + ".jpg";
>>                   ImageIOUtil.writeImage(bim, fileName, 72); //<---this number
>>               }
>>               document.close();
>>           } catch (IOException e){
>>               System.err.println("Exception while trying to create pdf document - " + e);
>>           }
>>
>>            stopwatch.stop(); // optional
>>           System.out.println("Time elapsed is "+ stopwatch.getTime() + " milliseconds");
>>
>>
>>       }
>>       //test Files: Ashley NJ_HHL101125_FV.pdf, 40M, 4 pages
>>       //try Ashley without set property: 4 pages@70074 milliseconds
>>       //try Ashley with property set:   4 pagesQ@32552 milliseconds
>>       //try have subSampling true set: 4 pages@9481 milliseconds
>>       //try Herman & hiss - PPHI101201 - FV.png: two pages@14050 milliseconds
>>       //try Herman & hiss - PPHI101201 - FV.jpg: two pages@13612 milliseconds
>>       //try Herman: 1 page@7625
>>       //try Ashley: 1 page@3237
>>       //try Ashely with 72 dpi: 1 page@2807
>>       //try Herman with 72 dpi: 1 page@6788
>>       //try herman without subSampling true setting: 1 page@7087
>>
>> }
>>
>>
>>
>> *****************below is production code running as an action class of struts *********
>>
>>
>> public void processPdf(String pdfFilePath, String imageFilePath){
>>
>>           System.setProperty("sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider");
>>
>>           try (final PDDocument document = PDDocument.load(new File(pdfFilePath))){
>>               PDFRenderer pdfRenderer = new PDFRenderer(document);
>>               pdfRenderer.setSubsamplingAllowed(true);
>>               //for (int page = 0; page < document.getNumberOfPages(); ++page)
>>               for (int page = 0; page < 1; ++page)
>>               {
>>                   BufferedImage bim = pdfRenderer.renderImageWithDPI(page, 72, ImageType.RGB);
>>
>>                   ImageIOUtil.writeImage(bim, imageFilePath, 72);
>>               }
>>               document.close();
>>           } catch (IOException e){
>>                   log.info("Exception while trying to create pdf document - " + e);
>>           }
>>
>>
>>       }
>>
>>
>> *********************
>>
>>
>>
>> ________________________________
>> From: Tilman Hausherr <TH...@t-online.de>
>> Sent: Tuesday, April 17, 2018 10:39 AM
>> To: users@pdfbox.apache.org
>> Subject: Re: Performance issue with PDFBox 2.0.8
>>
>> Hi,
>>
>> I ran the Ashley file through the profiler, most time is used for
>> decoding the jpeg files within and converting some of the from CMYK to
>> RGB. Nothing to optimize. I also found another one-time initialization
>> that takes 100-300ms, which I will add to the next version of PDFDebugger.
>>
>>        FilterFactory.INSTANCE.getFilter(COSName.FLATE_DECODE);
>>
>> I also tested the UsePureJavaCMYKConversion, it made rendering much
>> slower. IIRC, that option only helps with files with many tiny CMYK images.
>>
>> I have committed a change that adds the subsampling option to
>> PDFToImage, that version will be available within a few hours at
>> https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/2.0.10-SNAPSHOT/
>> look for todays date.
>>
>> Or get the source code here:
>> https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/PDFToImage.java?revision=1829374&view=markup
>>
>> What type of application are you creating? If you want to show a PDF in
>> the browser, PDF.js works nicely, is free and included in firefox. If
>> you want to do thumbnails, then you should use a smaller dpi value. In
>> that case using subsampling would help even more.
>>
>> Tilman
>>
>>
>>
>> Am 17.04.2018 um 07:26 schrieb Tilman Hausherr:
>>> Hi,
>>>
>>> I have a Ryzen 1700 cpu and for tests I'm running it on max energy
>>> settings. It is unclear if a mac has a similar setting.  This url
>>> http://www.macos.utah.edu/documentation/administration/pmset.html
>>> shows there is a setting for "better performance" but I don't know if
>>> that does the same as on Windows where I get a performance doubling.
>>> Try PDFDebugger, it has a built-in benchmark feature, it shows the
>>> rendering speed in the status line.
>>>
>>> I'm also avoiding that one-time initializations are part of the
>>> benchmark results with this code that is also in PDFDebugger:
>>>
>>>           // trigger premature initializations for more accurate
>>> rendering benchmarks
>>>           // See discussion in PDFBOX-3988
>>>           if (PDType1Font.COURIER.isStandard14())
>>>           {
>>>               // Yes this is always true
>>>               PDDeviceCMYK.INSTANCE.toRGB(new float[] { 0, 0, 0, 0} );
>>>               PDDeviceRGB.INSTANCE.toRGB(new float[] { 0, 0, 0 } );
>>>               IIORegistry.getDefaultInstance();
>>>           }
>>>
>>> I see you're using the PDFToImage utility. That one doesn't support
>>> subsampling yet, it has been on my "todo" list for a few days, I'll
>>> try to do it tonight... But PDFToImage is really just a command line
>>> utility.
>>>
>>> Args 7, 8 and 11 don't work that way. Re arg 7 and 8, you need to call
>>> System.setProperty(). Re arg 11, you need to have a PDFRenderer object.
>>>
>>> Another way to convert to images is explained here:
>>> https://stackoverflow.com/questions/23326562/apache-pdfbox-convert-pdf-to-images
>>>
>>>
>>> there call pdfRenderer.setSubsamplingAllowed(true) to activate
>>> subsampling. PDFDebugger also supports it in the menu.
>>>
>>> Tilman
>>>
>>> Am 17.04.2018 um 01:20 schrieb Arthur Wang:
>>>> Tilman,
>>>>
>>>>
>>>> Thanks for the quick response and testing on my case. Below is my
>>>> java code, my test result after adding the subsampling allowed. For
>>>> the first page of ashley file, it cost 3362 milliseconds.
>>>>
>>>> For the Gill file, the time elapsed is 2456 milliseconds.
>>>>
>>>> My test were conducted on my MAC with processor: 2.2GHz, Core i7.
>>>> how come your PC runs so fast? 1.4 seconds is fast enough for a web
>>>> access. Maybe there is something wrong with my code? I would
>>>> appreciate if you take a look at my code.
>>>>
>>>>
>>>> Best,
>>>>
>>>>
>>>> Arthur
>>>>
>>>>
>>>> *******************
>>>>
>>>> import org.apache.pdfbox.tools.PDFToImage;
>>>> //import java.awt.image.BufferedImage;
>>>> import java.io.File;
>>>> //import java.io.IOException;
>>>> //import java.io.OutputStream;
>>>> import org.apache.commons.lang3.time.StopWatch;
>>>>
>>>>
>>>> public class PdfToImage2 {
>>>>
>>>>        private static final String OUTPUT_DIR = "/Users/someone/Desktop/";
>>>>
>>>>        public static void main(String[] args) throws Exception{
>>>>
>>>>            String pdfPath = "/Users/someone/Desktop/Ashley
>>>> NJ_HHL101125_FV.pdf";
>>>>            //config option 2:convert page 1 in pdf to image
>>>>            String [] args_1 =  new String[13];
>>>>            args_1[0] = "-startPage";
>>>>            args_1[1] = "1";
>>>>            args_1[2] = "-endPage";
>>>>            args_1[3] = "1";
>>>>            args_1[4] = "-outputPrefix";
>>>>            args_1[5] = OUTPUT_DIR+"Ashley NJ_HHL101125_FV1";
>>>>            args_1[6] = pdfPath;
>>>>            args_1[7] =
>>>> "-Dorg.apache.pdfbox.rendering.UsePureJavaCMYKConversion";
>>>>            args_1[8] = "true";
>>>>            args_1[9] = "-dpi";
>>>>            args_1[10] = "72";//@48-->3283 milliseconds, @96>3545
>>>> milliseconds, @72--> 3362milliseconds
>>>>            args_1[11] = "-PDFRenderer.setSubsamplingAllowed";
>>>>            args_1[12] = "true";
>>>>
>>>>            File f = new File(args_1[5]+"1.jpg");
>>>>            if(f.exists() && !f.isDirectory()) {
>>>>                System.out.println("file exist already");;
>>>>            }
>>>>            else{
>>>>
>>>>                StopWatch stopwatch = new StopWatch();
>>>>
>>>>                stopwatch.start();
>>>>
>>>>                  try {
>>>>
>>>>                    System.setProperty("sun.java2d.cmm",
>>>> "sun.java2d.cmm.kcms.KcmsServiceProvider");
>>>>                    PDFToImage.main(args_1);
>>>>                    System.out.println("Done!");
>>>>                  } catch (Exception e) {
>>>>                      System.err.println("Exception while trying to
>>>> create pdf document - " + e);
>>>>                  }
>>>>
>>>>                     stopwatch.stop(); // optional
>>>>                    System.out.println("Time elapsed is "+
>>>> stopwatch.getTime() + " milliseconds");
>>>>
>>>>
>>>>            }//else
>>>>
>>>>            //first try without setting property: 3779 milliseconds
>>>>            //second try with the property set: 3852 milliseconds
>>>>            //third try with subsamplingAllowed: 3362 milliseconds
>>>>
>>>>        }
>>>>
>>>> *******************************
>>>>
>>>> ________________________________
>>>> From: Tilman Hausherr <TH...@t-online.de>
>>>> Sent: Monday, April 16, 2018 10:55 AM
>>>> To: users@pdfbox.apache.org
>>>> Subject: Re: Performance issue with PDFBox 2.0.8
>>>>
>>>> The java code didn't get through, most attachments get deleted. Call
>>>> PDFRenderer.setSubsamplingAllowed(true) to activate subsampling.
>>>>
>>>> I had a look at your files... These are not extremely slow renderings. 4
>>>> seconds for such a page is pretty good.
>>>>
>>>> On my PC, the first page of the Ashley file is rendered in PDFDebugger
>>>> in 1.4 seconds at 72dpi. The Gill file is done in less than a second.
>>>>
>>>> Tilman
>>>>
>>>> Am 16.04.2018 um 19:05 schrieb Arthur Wang:
>>>>> Arthur Wang has shared OneDrive files with you. To view them, click
>>>>> the links below.
>>>>>
>>>>> <https://1drv.ms/b/s%21AhA_REgBppCpgQluAoJe28B935ru>
>>>> [https://t5xdlg.by.files.1drv.com/y4miTL2BDX0qtz_xYqqct6mwo2l56s1alwlz1rLGpW_Mc5E7Ru9u3d-eAzERHVsoihIPc7xtNgaswIIeBlDh-hbm93zWQ6vL_PbzONFlHkd9shlCjeSfByIpMBE3EY161sZ77ggD87nTJRTRysSn4sLSrmwyenyMNNSuT6_EsqGw8Db-iHQ8Fr14T8lW0hdVFrxBDvGPI4J5G6IV2RH21FHiw?width=800&height=800&cropmode=none]<https://1drv.ms/b/s%21AhA_REgBppCpgQluAoJe28B935ru>
>>>>
>>>>
>>>> [https://r1.res.office365.com/owa/prem/images/dc-pdf_40.png]<https://1drv.ms/b/s%21AhA_REgBppCpgQluAoJe28B935ru>
>>>>
>>>> Ashley
>>>> NJ_HHL101125_FV.pdf<https://1drv.ms/b/s%21AhA_REgBppCpgQluAoJe28B935ru>
>>>> Shared via OneDrive
>>>>
>>>>
>>>>
>>>>> Ashley NJ_HHL101125_FV.pdf
>>>>> <https://1drv.ms/b/s%21AhA_REgBppCpgQluAoJe28B935ru>
>>>>>
>>>>>
>>>>> <https://1drv.ms/b/s%21AhA_REgBppCpgQpdnBIl_hmK6Wt0>
>>>>>
>>>>> Gill1-1356_KM102685-INS_FV.pdf
>>>>> <https://1drv.ms/b/s%21AhA_REgBppCpgQpdnBIl_hmK6Wt0>
>>>>>
>>>>>
>>>>> <https://1drv.ms/u/s%21AhA_REgBppCpgQvygYjm2eaJQmSH>
>>>>>
>>>>> Screen Shot 2018-04-16 at 9.23.52 AM.png
>>>>> <https://1drv.ms/u/s%21AhA_REgBppCpgQvygYjm2eaJQmSH>
>>>>>           [Screen Shot 2018-04-16 at 9.23.52 AM.png]
>>>>>
>>>>> just tried on 2.0.9, it works almost the same. to process all 4 pages
>>>>> cost 32 seconds, if only process the first page, it cost about 4
>>>>> seconds.
>>>>>
>>>>>
>>>>> My server is HP DL380 with dual Xeon processors and 32 G RAM, the hard
>>>>> drive is Intel Optane SSD NVMe.
>>>>>
>>>>> Once the JPG image is produced, the access of the image is almost
>>>>> instant regardless the size of the image file, so the open and close
>>>>> time of the image file are insignificant and could be ignored.
>>>>>
>>>>>
>>>>> By enable subsampling, do you mean to set up the dpi option ? do you
>>>>> have the sample code for PDFRenderer ? attached file
>>>>> ---PdfToImage2.java is my testing code. Ashley...pdf is a file with
>>>>> size about 45 M, and Gill...pdf is a file with size about 5 M. with
>>>>> the size 1/10th of the other one, the processing time is cut down to
>>>>> 2657 milliseconds compare to 3779 milliseconds. seems like the size
>>>>> does matter.
>>>>>
>>>>>
>>>>> thanks,
>>>>>
>>>>>
>>>>> Arthur
>>>>>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------
>>>>>
>>>>> *From:* Tilman Hausherr <TH...@t-online.de>
>>>>> *Sent:* Monday, April 16, 2018 8:57 AM
>>>>> *To:* users@pdfbox.apache.org
>>>>> *Subject:* Re: Performance issue with PDFBox 2.0.8
>>>>> Please
>>>>> - retry with the current version 2.0.9
>>>>> - share your file for a profiler analysis
>>>>> - as said by Itai (who implemented it) try enabling subsampling in
>>>>> PDFRenderer (read the javadoc first). Compare the results and decide
>>>>> whether the quality is OK for you.
>>>>> - set the energy settings of your computer to maximum or at least to
>>>>> "balanced", not to "energy save"
>>>>> - don't know if adding GPU will help;
>>>>> - try also the
>>>>> -Dorg.apache.pdfbox.rendering.UsePureJavaCMYKConversion=true option
>>>>>
>>>>> The speed is not related to the size but to the complexity. 32 seconds
>>>>> may sound disappointing but it's not the worst I've ever seen. "Nice
>>>>> illustrations" with nested patterns or large shadings may be slow.
>>>>>
>>>>> Tilman
>>>>>
>>>>> Am 16.04.2018 um 09:21 schrieb Arthur Wang:
>>>>>> Hi, everyone,
>>>>>>
>>>>>>
>>>>>>
>>>>>> I am using PDFBox 2.0.8 and java 8 running in tomcat 8 for
>>>>> production to convert pdf into image for display. it works very well
>>>>> for pdf file size less than 5 M, it takes about 3800 milliseconds.
>>>>> however, it slows down very much when the file size is increased to 50
>>>>> M. it takes about 70,000 milliseconds, after setting system property
>>>>> of sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider", it does
>>>>> increase the performance to 32550 milliseconds, which almost double
>>>>> the speed. but for 32 seconds to load a web page still too slow. Is
>>>>> there any other way to speed up the performance? would adding a GPU
>>>>> into the server help the performance? or any other software or
>>>>> hardware solution could help on the processing speed? My current
>>>>> server come with 32 G RAM, and the server never used more than half
>>>>> of it.
>>>>>> thanks,
>>>>>>
>>>>>>
>>>>>> Arthur
>>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>


Re: Fw: Performance issue with PDFBox 2.0.8

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 19.04.2018 um 00:18 schrieb Arthur Wang:
> Hi, Tilman,
>
>
>  From you last email----" I don't know if my graphics card plays any role in this", by any chance, would it be possible that you could benchmark if the GPU really help the pdfbox processing or not? I can not plug in a GPU into my mac to do the testing, but if you already have a GPU in your PC, I would be very interested to know if the performance would be decreased after you unplug it (if it's unluggable). if it proved to be helpful, I would like to buy a nice GPU and put it into our production server to improve the performance.

Sorry, no, I can't... My GPU is a $100 model (I don't go games), and I 
didn't find any setting to turn it on/off, and I don't have a different 
replacement adapter that I could switch. In java there is 
"-Dsun.java2d.opengl=true" but it has no effect.
https://docs.oracle.com/javase/8/docs/technotes/guides/2d/flags.html#opengl

The only things that make a difference on my system (windows) are 
enabling subsampling and setting energy settings to max performance. 
Using both then the first page of the Ashley file is rendered in 922ms.

Tilman
>
>
> If this take some effort or time, never mind,
>
>
> thanks for all the help,
>
>
> Arthur
>
>
> ________________________________
> From: Tilman Hausherr <TH...@t-online.de>
> Sent: Tuesday, April 17, 2018 3:17 PM
> To: users@pdfbox.apache.org
> Subject: Re: Fw: Performance issue with PDFBox 2.0.8
>
> Hi,
>
> Yeah, for thumbnails / previews the subsampling option is definitively
> for you.
>
> Can you calculate the preview in the background? I.e. at the time the
> PDFs are uploaded, instead of when the download page is requested?
>
> Re pdf.js you can test it here:
> https://mozilla.github.io/pdf.js/web/viewer.html
> PDF.js viewer - GitHub Pages<https://mozilla.github.io/pdf.js/web/viewer.html>
> mozilla.github.io
> Enter the password to open this PDF file: Cancel OK. File name:-
>
>
>
> I tried the Herman file, to me it seemed that it is slower with PDF.js
> than PDFBox, which is a bit surprising because usually they're faster.
>
> With PDFDebugger with subsampling enabled it is rendered in 4409ms on my
> system. I don't know if my graphics card plays any role in this.
>
> Tilman
>
> Am 17.04.2018 um 23:40 schrieb Arthur Wang:
>> Arthur Wang has shared OneDrive files with you. To view them, click the links below.
>>
>>
>> <https://1drv.ms/b/s!AhA_REgBppCpgQ062sb5LoKlZkC4>
>> [https://r1.res.office365.com/owa/prem/images/dc-pdf_20.png]<https://1drv.ms/b/s!AhA_REgBppCpgQ062sb5LoKlZkC4>
>>
>> Herman & hiss - PPHI101201 - FV 1.pdf<https://1drv.ms/b/s!AhA_REgBppCpgQ062sb5LoKlZkC4>
>>
>> <https://1drv.ms/u/s!AhA_REgBppCpgQ7jWRqI5BtoKiMx>
>> [https://r1.res.office365.com/owa/prem/images/dc-png_20.png]<https://1drv.ms/u/s!AhA_REgBppCpgQ7jWRqI5BtoKiMx>
>>
>> fileListPage.png<https://1drv.ms/u/s!AhA_REgBppCpgQ7jWRqI5BtoKiMx>
>>
>> <https://1drv.ms/u/s!AhA_REgBppCpgQ9zgx9cBhmI2DfH>
>> [https://r1.res.office365.com/owa/prem/images/dc-png_20.png]<https://1drv.ms/u/s!AhA_REgBppCpgQ9zgx9cBhmI2DfH>
>>
>> downloadpage.png<https://1drv.ms/u/s!AhA_REgBppCpgQ9zgx9cBhmI2DfH>
>>
>>
>>
>>
>> Arthur Wang has shared a OneDrive file with you. To view it, click the link below.
>>
>>
>> <https://1drv.ms/b/s!AhA_REgBppCpgQ062sb5LoKlZkC4>
>> [https://r1.res.office365.com/owa/prem/images/dc-pdf_20.png]<https://1drv.ms/b/s!AhA_REgBppCpgQ062sb5LoKlZkC4>
>>
>> Herman & hiss - PPHI101201 - FV 1.pdf<https://1drv.ms/b/s!AhA_REgBppCpgQ062sb5LoKlZkC4>
>>
>>
>> [https://ipodlq.by.files.1drv.com/y4mV0VxHK2D4uxyC4TQL_Pm0kYsbjNHGUAguMLxlVa4ykUtL6nwg19za0G74IoFusv2FwwJzohKoGdIKXg_MF26eavOY6hXbddC36qMI8vALhyNSmU8cAlpuAsMwah5b5__skoa2koVvs5wP2tekcxmBSQE-KPRahIsVu6ZtVnLV_I?width=800&height=800&cropmode=none]<https://1drv.ms/b/s!AhA_REgBppCpgQ062sb5LoKlZkC4>
>>
>> [https://r1.res.office365.com/owa/prem/images/dc-pdf_40.png]<https://1drv.ms/b/s!AhA_REgBppCpgQ062sb5LoKlZkC4>
>> Herman & hiss - PPHI101201 - FV 1.pdf<https://1drv.ms/b/s!AhA_REgBppCpgQ062sb5LoKlZkC4>
>> Shared via OneDrive
>>
>>
>>
>>
>> Tilman,
>>
>>
>> Since my email got rejected due to the size limit of apache mail server--1m. I send it again here.
>>
>>
>> First, thank you very much for the extra information and udpate.
>>
>>
>> My application is an internal web based production system. Many designers in our graphic department uploaded the print-ready file to the system every hours, and other users include prepress, press, shipping, customers will log into the system to download the files. The print-ready pdf file sometimes are extremely big in terms of the size, 5 M to 1 G are most popular, 2G to 5G are rare, but do happen sometimes. Please refer to the attached two screenshots(fileListPage, downloadPage). What I am trying to do is to show a thumbnail on the fileDownloadPage. we used to show a download icon on the download page instead of the thumbnail, but users have to download the file to their local computer before actually seeing it. Sometimes the fileListPage show a long list of files, people get confused, it would be more convenient for the user to have a peek of the file before actually download it. so it's better to have a thumbnail on the download page. In terms of the pdf.js, I had never tried, do you think it can load 40M or 50 M file in one or two seconds by the apache server?
>>
>>
>> I copied my code below for you reference.(one is for testing, the other one is for production .)
>>
>>
>> Attached you will also find a pdf file named Herman..pdf. it only has two pages, but by only convert the first page, the best I can do it 7 seconds. that would be very slow for web application. If by adding a GPU, the performance could get better, I would certainly like to try, just not sure if it's going to work.
>>
>>
>> ******************below are testing code running on eclilpse platform**************
>>
>>
>> package com.test;
>>
>> import org.apache.pdfbox.pdmodel.PDDocument;
>> import org.apache.pdfbox.rendering.ImageType;
>> import org.apache.pdfbox.rendering.PDFRenderer;
>> import org.apache.pdfbox.tools.imageio.ImageIOUtil;
>>
>> import java.awt.image.BufferedImage;
>> import java.io.File;
>> import java.io.IOException;
>> import org.apache.commons.lang3.time.StopWatch;
>> import org.apache.commons.lang3.StringUtils;
>>
>> public class PdfToImage {
>>
>>       private static final String OUTPUT_DIR = "/Users/someone/Desktop/";
>>
>>       public static void main(String[] args) throws Exception{
>>
>>           System.setProperty("sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider");
>>
>>           StopWatch stopwatch = new StopWatch();
>>
>>           stopwatch.start();
>>
>>           try (final PDDocument document = PDDocument.load(new File("/Users/someone/Desktop/Herman & hiss - PPHI101201 - FV.pdf"))){
>>               PDFRenderer pdfRenderer = new PDFRenderer(document);
>>               pdfRenderer.setSubsamplingAllowed(true);
>>               //for (int page = 0; page < document.getNumberOfPages(); ++page)
>>               for (int page = 0; page < 1; ++page)
>>               {
>>                   BufferedImage bim = pdfRenderer.renderImageWithDPI(page, 72, ImageType.RGB); //<--this number have performance impact
>>                   String fileName = OUTPUT_DIR + "Herman & hiss - PPHI101201 - FV" + page + ".jpg";
>>                   ImageIOUtil.writeImage(bim, fileName, 72); //<---this number
>>               }
>>               document.close();
>>           } catch (IOException e){
>>               System.err.println("Exception while trying to create pdf document - " + e);
>>           }
>>
>>            stopwatch.stop(); // optional
>>           System.out.println("Time elapsed is "+ stopwatch.getTime() + " milliseconds");
>>
>>
>>       }
>>       //test Files: Ashley NJ_HHL101125_FV.pdf, 40M, 4 pages
>>       //try Ashley without set property: 4 pages@70074 milliseconds
>>       //try Ashley with property set:   4 pagesQ@32552 milliseconds
>>       //try have subSampling true set: 4 pages@9481 milliseconds
>>       //try Herman & hiss - PPHI101201 - FV.png: two pages@14050 milliseconds
>>       //try Herman & hiss - PPHI101201 - FV.jpg: two pages@13612 milliseconds
>>       //try Herman: 1 page@7625
>>       //try Ashley: 1 page@3237
>>       //try Ashely with 72 dpi: 1 page@2807
>>       //try Herman with 72 dpi: 1 page@6788
>>       //try herman without subSampling true setting: 1 page@7087
>>
>> }
>>
>>
>>
>> *****************below is production code running as an action class of struts *********
>>
>>
>> public void processPdf(String pdfFilePath, String imageFilePath){
>>
>>           System.setProperty("sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider");
>>
>>           try (final PDDocument document = PDDocument.load(new File(pdfFilePath))){
>>               PDFRenderer pdfRenderer = new PDFRenderer(document);
>>               pdfRenderer.setSubsamplingAllowed(true);
>>               //for (int page = 0; page < document.getNumberOfPages(); ++page)
>>               for (int page = 0; page < 1; ++page)
>>               {
>>                   BufferedImage bim = pdfRenderer.renderImageWithDPI(page, 72, ImageType.RGB);
>>
>>                   ImageIOUtil.writeImage(bim, imageFilePath, 72);
>>               }
>>               document.close();
>>           } catch (IOException e){
>>                   log.info("Exception while trying to create pdf document - " + e);
>>           }
>>
>>
>>       }
>>
>>
>> *********************
>>
>>
>>
>> ________________________________
>> From: Tilman Hausherr <TH...@t-online.de>
>> Sent: Tuesday, April 17, 2018 10:39 AM
>> To: users@pdfbox.apache.org
>> Subject: Re: Performance issue with PDFBox 2.0.8
>>
>> Hi,
>>
>> I ran the Ashley file through the profiler, most time is used for
>> decoding the jpeg files within and converting some of the from CMYK to
>> RGB. Nothing to optimize. I also found another one-time initialization
>> that takes 100-300ms, which I will add to the next version of PDFDebugger.
>>
>>        FilterFactory.INSTANCE.getFilter(COSName.FLATE_DECODE);
>>
>> I also tested the UsePureJavaCMYKConversion, it made rendering much
>> slower. IIRC, that option only helps with files with many tiny CMYK images.
>>
>> I have committed a change that adds the subsampling option to
>> PDFToImage, that version will be available within a few hours at
>> https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/2.0.10-SNAPSHOT/
>> look for todays date.
>>
>> Or get the source code here:
>> https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/PDFToImage.java?revision=1829374&view=markup
>>
>> What type of application are you creating? If you want to show a PDF in
>> the browser, PDF.js works nicely, is free and included in firefox. If
>> you want to do thumbnails, then you should use a smaller dpi value. In
>> that case using subsampling would help even more.
>>
>> Tilman
>>
>>
>>
>> Am 17.04.2018 um 07:26 schrieb Tilman Hausherr:
>>> Hi,
>>>
>>> I have a Ryzen 1700 cpu and for tests I'm running it on max energy
>>> settings. It is unclear if a mac has a similar setting.  This url
>>> http://www.macos.utah.edu/documentation/administration/pmset.html
>>> shows there is a setting for "better performance" but I don't know if
>>> that does the same as on Windows where I get a performance doubling.
>>> Try PDFDebugger, it has a built-in benchmark feature, it shows the
>>> rendering speed in the status line.
>>>
>>> I'm also avoiding that one-time initializations are part of the
>>> benchmark results with this code that is also in PDFDebugger:
>>>
>>>           // trigger premature initializations for more accurate
>>> rendering benchmarks
>>>           // See discussion in PDFBOX-3988
>>>           if (PDType1Font.COURIER.isStandard14())
>>>           {
>>>               // Yes this is always true
>>>               PDDeviceCMYK.INSTANCE.toRGB(new float[] { 0, 0, 0, 0} );
>>>               PDDeviceRGB.INSTANCE.toRGB(new float[] { 0, 0, 0 } );
>>>               IIORegistry.getDefaultInstance();
>>>           }
>>>
>>> I see you're using the PDFToImage utility. That one doesn't support
>>> subsampling yet, it has been on my "todo" list for a few days, I'll
>>> try to do it tonight... But PDFToImage is really just a command line
>>> utility.
>>>
>>> Args 7, 8 and 11 don't work that way. Re arg 7 and 8, you need to call
>>> System.setProperty(). Re arg 11, you need to have a PDFRenderer object.
>>>
>>> Another way to convert to images is explained here:
>>> https://stackoverflow.com/questions/23326562/apache-pdfbox-convert-pdf-to-images
>>>
>>>
>>> there call pdfRenderer.setSubsamplingAllowed(true) to activate
>>> subsampling. PDFDebugger also supports it in the menu.
>>>
>>> Tilman
>>>
>>> Am 17.04.2018 um 01:20 schrieb Arthur Wang:
>>>> Tilman,
>>>>
>>>>
>>>> Thanks for the quick response and testing on my case. Below is my
>>>> java code, my test result after adding the subsampling allowed. For
>>>> the first page of ashley file, it cost 3362 milliseconds.
>>>>
>>>> For the Gill file, the time elapsed is 2456 milliseconds.
>>>>
>>>> My test were conducted on my MAC with processor: 2.2GHz, Core i7.
>>>> how come your PC runs so fast? 1.4 seconds is fast enough for a web
>>>> access. Maybe there is something wrong with my code? I would
>>>> appreciate if you take a look at my code.
>>>>
>>>>
>>>> Best,
>>>>
>>>>
>>>> Arthur
>>>>
>>>>
>>>> *******************
>>>>
>>>> import org.apache.pdfbox.tools.PDFToImage;
>>>> //import java.awt.image.BufferedImage;
>>>> import java.io.File;
>>>> //import java.io.IOException;
>>>> //import java.io.OutputStream;
>>>> import org.apache.commons.lang3.time.StopWatch;
>>>>
>>>>
>>>> public class PdfToImage2 {
>>>>
>>>>        private static final String OUTPUT_DIR = "/Users/someone/Desktop/";
>>>>
>>>>        public static void main(String[] args) throws Exception{
>>>>
>>>>            String pdfPath = "/Users/someone/Desktop/Ashley
>>>> NJ_HHL101125_FV.pdf";
>>>>            //config option 2:convert page 1 in pdf to image
>>>>            String [] args_1 =  new String[13];
>>>>            args_1[0] = "-startPage";
>>>>            args_1[1] = "1";
>>>>            args_1[2] = "-endPage";
>>>>            args_1[3] = "1";
>>>>            args_1[4] = "-outputPrefix";
>>>>            args_1[5] = OUTPUT_DIR+"Ashley NJ_HHL101125_FV1";
>>>>            args_1[6] = pdfPath;
>>>>            args_1[7] =
>>>> "-Dorg.apache.pdfbox.rendering.UsePureJavaCMYKConversion";
>>>>            args_1[8] = "true";
>>>>            args_1[9] = "-dpi";
>>>>            args_1[10] = "72";//@48-->3283 milliseconds, @96>3545
>>>> milliseconds, @72--> 3362milliseconds
>>>>            args_1[11] = "-PDFRenderer.setSubsamplingAllowed";
>>>>            args_1[12] = "true";
>>>>
>>>>            File f = new File(args_1[5]+"1.jpg");
>>>>            if(f.exists() && !f.isDirectory()) {
>>>>                System.out.println("file exist already");;
>>>>            }
>>>>            else{
>>>>
>>>>                StopWatch stopwatch = new StopWatch();
>>>>
>>>>                stopwatch.start();
>>>>
>>>>                  try {
>>>>
>>>>                    System.setProperty("sun.java2d.cmm",
>>>> "sun.java2d.cmm.kcms.KcmsServiceProvider");
>>>>                    PDFToImage.main(args_1);
>>>>                    System.out.println("Done!");
>>>>                  } catch (Exception e) {
>>>>                      System.err.println("Exception while trying to
>>>> create pdf document - " + e);
>>>>                  }
>>>>
>>>>                     stopwatch.stop(); // optional
>>>>                    System.out.println("Time elapsed is "+
>>>> stopwatch.getTime() + " milliseconds");
>>>>
>>>>
>>>>            }//else
>>>>
>>>>            //first try without setting property: 3779 milliseconds
>>>>            //second try with the property set: 3852 milliseconds
>>>>            //third try with subsamplingAllowed: 3362 milliseconds
>>>>
>>>>        }
>>>>
>>>> *******************************
>>>>
>>>> ________________________________
>>>> From: Tilman Hausherr <TH...@t-online.de>
>>>> Sent: Monday, April 16, 2018 10:55 AM
>>>> To: users@pdfbox.apache.org
>>>> Subject: Re: Performance issue with PDFBox 2.0.8
>>>>
>>>> The java code didn't get through, most attachments get deleted. Call
>>>> PDFRenderer.setSubsamplingAllowed(true) to activate subsampling.
>>>>
>>>> I had a look at your files... These are not extremely slow renderings. 4
>>>> seconds for such a page is pretty good.
>>>>
>>>> On my PC, the first page of the Ashley file is rendered in PDFDebugger
>>>> in 1.4 seconds at 72dpi. The Gill file is done in less than a second.
>>>>
>>>> Tilman
>>>>
>>>> Am 16.04.2018 um 19:05 schrieb Arthur Wang:
>>>>> Arthur Wang has shared OneDrive files with you. To view them, click
>>>>> the links below.
>>>>>
>>>>> <https://1drv.ms/b/s%21AhA_REgBppCpgQluAoJe28B935ru>
>>>> [https://t5xdlg.by.files.1drv.com/y4miTL2BDX0qtz_xYqqct6mwo2l56s1alwlz1rLGpW_Mc5E7Ru9u3d-eAzERHVsoihIPc7xtNgaswIIeBlDh-hbm93zWQ6vL_PbzONFlHkd9shlCjeSfByIpMBE3EY161sZ77ggD87nTJRTRysSn4sLSrmwyenyMNNSuT6_EsqGw8Db-iHQ8Fr14T8lW0hdVFrxBDvGPI4J5G6IV2RH21FHiw?width=800&height=800&cropmode=none]<https://1drv.ms/b/s%21AhA_REgBppCpgQluAoJe28B935ru>
>>>>
>>>>
>>>> [https://r1.res.office365.com/owa/prem/images/dc-pdf_40.png]<https://1drv.ms/b/s%21AhA_REgBppCpgQluAoJe28B935ru>
>>>>
>>>> Ashley
>>>> NJ_HHL101125_FV.pdf<https://1drv.ms/b/s%21AhA_REgBppCpgQluAoJe28B935ru>
>>>> Shared via OneDrive
>>>>
>>>>
>>>>
>>>>> Ashley NJ_HHL101125_FV.pdf
>>>>> <https://1drv.ms/b/s%21AhA_REgBppCpgQluAoJe28B935ru>
>>>>>
>>>>>
>>>>> <https://1drv.ms/b/s%21AhA_REgBppCpgQpdnBIl_hmK6Wt0>
>>>>>
>>>>> Gill1-1356_KM102685-INS_FV.pdf
>>>>> <https://1drv.ms/b/s%21AhA_REgBppCpgQpdnBIl_hmK6Wt0>
>>>>>
>>>>>
>>>>> <https://1drv.ms/u/s%21AhA_REgBppCpgQvygYjm2eaJQmSH>
>>>>>
>>>>> Screen Shot 2018-04-16 at 9.23.52 AM.png
>>>>> <https://1drv.ms/u/s%21AhA_REgBppCpgQvygYjm2eaJQmSH>
>>>>>           [Screen Shot 2018-04-16 at 9.23.52 AM.png]
>>>>>
>>>>> just tried on 2.0.9, it works almost the same. to process all 4 pages
>>>>> cost 32 seconds, if only process the first page, it cost about 4
>>>>> seconds.
>>>>>
>>>>>
>>>>> My server is HP DL380 with dual Xeon processors and 32 G RAM, the hard
>>>>> drive is Intel Optane SSD NVMe.
>>>>>
>>>>> Once the JPG image is produced, the access of the image is almost
>>>>> instant regardless the size of the image file, so the open and close
>>>>> time of the image file are insignificant and could be ignored.
>>>>>
>>>>>
>>>>> By enable subsampling, do you mean to set up the dpi option ? do you
>>>>> have the sample code for PDFRenderer ? attached file
>>>>> ---PdfToImage2.java is my testing code. Ashley...pdf is a file with
>>>>> size about 45 M, and Gill...pdf is a file with size about 5 M. with
>>>>> the size 1/10th of the other one, the processing time is cut down to
>>>>> 2657 milliseconds compare to 3779 milliseconds. seems like the size
>>>>> does matter.
>>>>>
>>>>>
>>>>> thanks,
>>>>>
>>>>>
>>>>> Arthur
>>>>>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------
>>>>>
>>>>> *From:* Tilman Hausherr <TH...@t-online.de>
>>>>> *Sent:* Monday, April 16, 2018 8:57 AM
>>>>> *To:* users@pdfbox.apache.org
>>>>> *Subject:* Re: Performance issue with PDFBox 2.0.8
>>>>> Please
>>>>> - retry with the current version 2.0.9
>>>>> - share your file for a profiler analysis
>>>>> - as said by Itai (who implemented it) try enabling subsampling in
>>>>> PDFRenderer (read the javadoc first). Compare the results and decide
>>>>> whether the quality is OK for you.
>>>>> - set the energy settings of your computer to maximum or at least to
>>>>> "balanced", not to "energy save"
>>>>> - don't know if adding GPU will help;
>>>>> - try also the
>>>>> -Dorg.apache.pdfbox.rendering.UsePureJavaCMYKConversion=true option
>>>>>
>>>>> The speed is not related to the size but to the complexity. 32 seconds
>>>>> may sound disappointing but it's not the worst I've ever seen. "Nice
>>>>> illustrations" with nested patterns or large shadings may be slow.
>>>>>
>>>>> Tilman
>>>>>
>>>>> Am 16.04.2018 um 09:21 schrieb Arthur Wang:
>>>>>> Hi, everyone,
>>>>>>
>>>>>>
>>>>>>
>>>>>> I am using PDFBox 2.0.8 and java 8 running in tomcat 8 for
>>>>> production to convert pdf into image for display. it works very well
>>>>> for pdf file size less than 5 M, it takes about 3800 milliseconds.
>>>>> however, it slows down very much when the file size is increased to 50
>>>>> M. it takes about 70,000 milliseconds, after setting system property
>>>>> of sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider", it does
>>>>> increase the performance to 32550 milliseconds, which almost double
>>>>> the speed. but for 32 seconds to load a web page still too slow. Is
>>>>> there any other way to speed up the performance? would adding a GPU
>>>>> into the server help the performance? or any other software or
>>>>> hardware solution could help on the processing speed? My current
>>>>> server come with 32 G RAM, and the server never used more than half
>>>>> of it.
>>>>>> thanks,
>>>>>>
>>>>>>
>>>>>> Arthur
>>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>


Re: Fw: Performance issue with PDFBox 2.0.8

Posted by Arthur Wang <ar...@hotmail.com>.
Hi, Tilman,


From you last email----" I don't know if my graphics card plays any role in this", by any chance, would it be possible that you could benchmark if the GPU really help the pdfbox processing or not? I can not plug in a GPU into my mac to do the testing, but if you already have a GPU in your PC, I would be very interested to know if the performance would be decreased after you unplug it (if it's unluggable). if it proved to be helpful, I would like to buy a nice GPU and put it into our production server to improve the performance.


If this take some effort or time, never mind,


thanks for all the help,


Arthur


________________________________
From: Tilman Hausherr <TH...@t-online.de>
Sent: Tuesday, April 17, 2018 3:17 PM
To: users@pdfbox.apache.org
Subject: Re: Fw: Performance issue with PDFBox 2.0.8

Hi,

Yeah, for thumbnails / previews the subsampling option is definitively
for you.

Can you calculate the preview in the background? I.e. at the time the
PDFs are uploaded, instead of when the download page is requested?

Re pdf.js you can test it here:
https://mozilla.github.io/pdf.js/web/viewer.html
PDF.js viewer - GitHub Pages<https://mozilla.github.io/pdf.js/web/viewer.html>
mozilla.github.io
Enter the password to open this PDF file: Cancel OK. File name:-



I tried the Herman file, to me it seemed that it is slower with PDF.js
than PDFBox, which is a bit surprising because usually they're faster.

With PDFDebugger with subsampling enabled it is rendered in 4409ms on my
system. I don't know if my graphics card plays any role in this.

Tilman

Am 17.04.2018 um 23:40 schrieb Arthur Wang:
> Arthur Wang has shared OneDrive files with you. To view them, click the links below.
>
>
> <https://1drv.ms/b/s!AhA_REgBppCpgQ062sb5LoKlZkC4>
> [https://r1.res.office365.com/owa/prem/images/dc-pdf_20.png]<https://1drv.ms/b/s!AhA_REgBppCpgQ062sb5LoKlZkC4>
>
> Herman & hiss - PPHI101201 - FV 1.pdf<https://1drv.ms/b/s!AhA_REgBppCpgQ062sb5LoKlZkC4>
>
> <https://1drv.ms/u/s!AhA_REgBppCpgQ7jWRqI5BtoKiMx>
> [https://r1.res.office365.com/owa/prem/images/dc-png_20.png]<https://1drv.ms/u/s!AhA_REgBppCpgQ7jWRqI5BtoKiMx>
>
> fileListPage.png<https://1drv.ms/u/s!AhA_REgBppCpgQ7jWRqI5BtoKiMx>
>
> <https://1drv.ms/u/s!AhA_REgBppCpgQ9zgx9cBhmI2DfH>
> [https://r1.res.office365.com/owa/prem/images/dc-png_20.png]<https://1drv.ms/u/s!AhA_REgBppCpgQ9zgx9cBhmI2DfH>
>
> downloadpage.png<https://1drv.ms/u/s!AhA_REgBppCpgQ9zgx9cBhmI2DfH>
>
>
>
>
> Arthur Wang has shared a OneDrive file with you. To view it, click the link below.
>
>
> <https://1drv.ms/b/s!AhA_REgBppCpgQ062sb5LoKlZkC4>
> [https://r1.res.office365.com/owa/prem/images/dc-pdf_20.png]<https://1drv.ms/b/s!AhA_REgBppCpgQ062sb5LoKlZkC4>
>
> Herman & hiss - PPHI101201 - FV 1.pdf<https://1drv.ms/b/s!AhA_REgBppCpgQ062sb5LoKlZkC4>
>
>
> [https://ipodlq.by.files.1drv.com/y4mV0VxHK2D4uxyC4TQL_Pm0kYsbjNHGUAguMLxlVa4ykUtL6nwg19za0G74IoFusv2FwwJzohKoGdIKXg_MF26eavOY6hXbddC36qMI8vALhyNSmU8cAlpuAsMwah5b5__skoa2koVvs5wP2tekcxmBSQE-KPRahIsVu6ZtVnLV_I?width=800&height=800&cropmode=none]<https://1drv.ms/b/s!AhA_REgBppCpgQ062sb5LoKlZkC4>
>
> [https://r1.res.office365.com/owa/prem/images/dc-pdf_40.png]<https://1drv.ms/b/s!AhA_REgBppCpgQ062sb5LoKlZkC4>
> Herman & hiss - PPHI101201 - FV 1.pdf<https://1drv.ms/b/s!AhA_REgBppCpgQ062sb5LoKlZkC4>
> Shared via OneDrive
>
>
>
>
> Tilman,
>
>
> Since my email got rejected due to the size limit of apache mail server--1m. I send it again here.
>
>
> First, thank you very much for the extra information and udpate.
>
>
> My application is an internal web based production system. Many designers in our graphic department uploaded the print-ready file to the system every hours, and other users include prepress, press, shipping, customers will log into the system to download the files. The print-ready pdf file sometimes are extremely big in terms of the size, 5 M to 1 G are most popular, 2G to 5G are rare, but do happen sometimes. Please refer to the attached two screenshots(fileListPage, downloadPage). What I am trying to do is to show a thumbnail on the fileDownloadPage. we used to show a download icon on the download page instead of the thumbnail, but users have to download the file to their local computer before actually seeing it. Sometimes the fileListPage show a long list of files, people get confused, it would be more convenient for the user to have a peek of the file before actually download it. so it's better to have a thumbnail on the download page. In terms of the pdf.js, I had never tried, do you think it can load 40M or 50 M file in one or two seconds by the apache server?
>
>
> I copied my code below for you reference.(one is for testing, the other one is for production .)
>
>
> Attached you will also find a pdf file named Herman..pdf. it only has two pages, but by only convert the first page, the best I can do it 7 seconds. that would be very slow for web application. If by adding a GPU, the performance could get better, I would certainly like to try, just not sure if it's going to work.
>
>
> ******************below are testing code running on eclilpse platform**************
>
>
> package com.test;
>
> import org.apache.pdfbox.pdmodel.PDDocument;
> import org.apache.pdfbox.rendering.ImageType;
> import org.apache.pdfbox.rendering.PDFRenderer;
> import org.apache.pdfbox.tools.imageio.ImageIOUtil;
>
> import java.awt.image.BufferedImage;
> import java.io.File;
> import java.io.IOException;
> import org.apache.commons.lang3.time.StopWatch;
> import org.apache.commons.lang3.StringUtils;
>
> public class PdfToImage {
>
>      private static final String OUTPUT_DIR = "/Users/someone/Desktop/";
>
>      public static void main(String[] args) throws Exception{
>
>          System.setProperty("sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider");
>
>          StopWatch stopwatch = new StopWatch();
>
>          stopwatch.start();
>
>          try (final PDDocument document = PDDocument.load(new File("/Users/someone/Desktop/Herman & hiss - PPHI101201 - FV.pdf"))){
>              PDFRenderer pdfRenderer = new PDFRenderer(document);
>              pdfRenderer.setSubsamplingAllowed(true);
>              //for (int page = 0; page < document.getNumberOfPages(); ++page)
>              for (int page = 0; page < 1; ++page)
>              {
>                  BufferedImage bim = pdfRenderer.renderImageWithDPI(page, 72, ImageType.RGB); //<--this number have performance impact
>                  String fileName = OUTPUT_DIR + "Herman & hiss - PPHI101201 - FV" + page + ".jpg";
>                  ImageIOUtil.writeImage(bim, fileName, 72); //<---this number
>              }
>              document.close();
>          } catch (IOException e){
>              System.err.println("Exception while trying to create pdf document - " + e);
>          }
>
>           stopwatch.stop(); // optional
>          System.out.println("Time elapsed is "+ stopwatch.getTime() + " milliseconds");
>
>
>      }
>      //test Files: Ashley NJ_HHL101125_FV.pdf, 40M, 4 pages
>      //try Ashley without set property: 4 pages@70074 milliseconds
>      //try Ashley with property set:   4 pagesQ@32552 milliseconds
>      //try have subSampling true set: 4 pages@9481 milliseconds
>      //try Herman & hiss - PPHI101201 - FV.png: two pages@14050 milliseconds
>      //try Herman & hiss - PPHI101201 - FV.jpg: two pages@13612 milliseconds
>      //try Herman: 1 page@7625
>      //try Ashley: 1 page@3237
>      //try Ashely with 72 dpi: 1 page@2807
>      //try Herman with 72 dpi: 1 page@6788
>      //try herman without subSampling true setting: 1 page@7087
>
> }
>
>
>
> *****************below is production code running as an action class of struts *********
>
>
> public void processPdf(String pdfFilePath, String imageFilePath){
>
>          System.setProperty("sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider");
>
>          try (final PDDocument document = PDDocument.load(new File(pdfFilePath))){
>              PDFRenderer pdfRenderer = new PDFRenderer(document);
>              pdfRenderer.setSubsamplingAllowed(true);
>              //for (int page = 0; page < document.getNumberOfPages(); ++page)
>              for (int page = 0; page < 1; ++page)
>              {
>                  BufferedImage bim = pdfRenderer.renderImageWithDPI(page, 72, ImageType.RGB);
>
>                  ImageIOUtil.writeImage(bim, imageFilePath, 72);
>              }
>              document.close();
>          } catch (IOException e){
>                  log.info("Exception while trying to create pdf document - " + e);
>          }
>
>
>      }
>
>
> *********************
>
>
>
> ________________________________
> From: Tilman Hausherr <TH...@t-online.de>
> Sent: Tuesday, April 17, 2018 10:39 AM
> To: users@pdfbox.apache.org
> Subject: Re: Performance issue with PDFBox 2.0.8
>
> Hi,
>
> I ran the Ashley file through the profiler, most time is used for
> decoding the jpeg files within and converting some of the from CMYK to
> RGB. Nothing to optimize. I also found another one-time initialization
> that takes 100-300ms, which I will add to the next version of PDFDebugger.
>
>       FilterFactory.INSTANCE.getFilter(COSName.FLATE_DECODE);
>
> I also tested the UsePureJavaCMYKConversion, it made rendering much
> slower. IIRC, that option only helps with files with many tiny CMYK images.
>
> I have committed a change that adds the subsampling option to
> PDFToImage, that version will be available within a few hours at
> https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/2.0.10-SNAPSHOT/
> look for todays date.
>
> Or get the source code here:
> https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/PDFToImage.java?revision=1829374&view=markup
>
> What type of application are you creating? If you want to show a PDF in
> the browser, PDF.js works nicely, is free and included in firefox. If
> you want to do thumbnails, then you should use a smaller dpi value. In
> that case using subsampling would help even more.
>
> Tilman
>
>
>
> Am 17.04.2018 um 07:26 schrieb Tilman Hausherr:
>> Hi,
>>
>> I have a Ryzen 1700 cpu and for tests I'm running it on max energy
>> settings. It is unclear if a mac has a similar setting.  This url
>> http://www.macos.utah.edu/documentation/administration/pmset.html
>> shows there is a setting for "better performance" but I don't know if
>> that does the same as on Windows where I get a performance doubling.
>> Try PDFDebugger, it has a built-in benchmark feature, it shows the
>> rendering speed in the status line.
>>
>> I'm also avoiding that one-time initializations are part of the
>> benchmark results with this code that is also in PDFDebugger:
>>
>>          // trigger premature initializations for more accurate
>> rendering benchmarks
>>          // See discussion in PDFBOX-3988
>>          if (PDType1Font.COURIER.isStandard14())
>>          {
>>              // Yes this is always true
>>              PDDeviceCMYK.INSTANCE.toRGB(new float[] { 0, 0, 0, 0} );
>>              PDDeviceRGB.INSTANCE.toRGB(new float[] { 0, 0, 0 } );
>>              IIORegistry.getDefaultInstance();
>>          }
>>
>> I see you're using the PDFToImage utility. That one doesn't support
>> subsampling yet, it has been on my "todo" list for a few days, I'll
>> try to do it tonight... But PDFToImage is really just a command line
>> utility.
>>
>> Args 7, 8 and 11 don't work that way. Re arg 7 and 8, you need to call
>> System.setProperty(). Re arg 11, you need to have a PDFRenderer object.
>>
>> Another way to convert to images is explained here:
>> https://stackoverflow.com/questions/23326562/apache-pdfbox-convert-pdf-to-images
>>
>>
>> there call pdfRenderer.setSubsamplingAllowed(true) to activate
>> subsampling. PDFDebugger also supports it in the menu.
>>
>> Tilman
>>
>> Am 17.04.2018 um 01:20 schrieb Arthur Wang:
>>> Tilman,
>>>
>>>
>>> Thanks for the quick response and testing on my case. Below is my
>>> java code, my test result after adding the subsampling allowed. For
>>> the first page of ashley file, it cost 3362 milliseconds.
>>>
>>> For the Gill file, the time elapsed is 2456 milliseconds.
>>>
>>> My test were conducted on my MAC with processor: 2.2GHz, Core i7.
>>> how come your PC runs so fast? 1.4 seconds is fast enough for a web
>>> access. Maybe there is something wrong with my code? I would
>>> appreciate if you take a look at my code.
>>>
>>>
>>> Best,
>>>
>>>
>>> Arthur
>>>
>>>
>>> *******************
>>>
>>> import org.apache.pdfbox.tools.PDFToImage;
>>> //import java.awt.image.BufferedImage;
>>> import java.io.File;
>>> //import java.io.IOException;
>>> //import java.io.OutputStream;
>>> import org.apache.commons.lang3.time.StopWatch;
>>>
>>>
>>> public class PdfToImage2 {
>>>
>>>       private static final String OUTPUT_DIR = "/Users/someone/Desktop/";
>>>
>>>       public static void main(String[] args) throws Exception{
>>>
>>>           String pdfPath = "/Users/someone/Desktop/Ashley
>>> NJ_HHL101125_FV.pdf";
>>>           //config option 2:convert page 1 in pdf to image
>>>           String [] args_1 =  new String[13];
>>>           args_1[0] = "-startPage";
>>>           args_1[1] = "1";
>>>           args_1[2] = "-endPage";
>>>           args_1[3] = "1";
>>>           args_1[4] = "-outputPrefix";
>>>           args_1[5] = OUTPUT_DIR+"Ashley NJ_HHL101125_FV1";
>>>           args_1[6] = pdfPath;
>>>           args_1[7] =
>>> "-Dorg.apache.pdfbox.rendering.UsePureJavaCMYKConversion";
>>>           args_1[8] = "true";
>>>           args_1[9] = "-dpi";
>>>           args_1[10] = "72";//@48-->3283 milliseconds, @96>3545
>>> milliseconds, @72--> 3362milliseconds
>>>           args_1[11] = "-PDFRenderer.setSubsamplingAllowed";
>>>           args_1[12] = "true";
>>>
>>>           File f = new File(args_1[5]+"1.jpg");
>>>           if(f.exists() && !f.isDirectory()) {
>>>               System.out.println("file exist already");;
>>>           }
>>>           else{
>>>
>>>               StopWatch stopwatch = new StopWatch();
>>>
>>>               stopwatch.start();
>>>
>>>                 try {
>>>
>>>                   System.setProperty("sun.java2d.cmm",
>>> "sun.java2d.cmm.kcms.KcmsServiceProvider");
>>>                   PDFToImage.main(args_1);
>>>                   System.out.println("Done!");
>>>                 } catch (Exception e) {
>>>                     System.err.println("Exception while trying to
>>> create pdf document - " + e);
>>>                 }
>>>
>>>                    stopwatch.stop(); // optional
>>>                   System.out.println("Time elapsed is "+
>>> stopwatch.getTime() + " milliseconds");
>>>
>>>
>>>           }//else
>>>
>>>           //first try without setting property: 3779 milliseconds
>>>           //second try with the property set: 3852 milliseconds
>>>           //third try with subsamplingAllowed: 3362 milliseconds
>>>
>>>       }
>>>
>>> *******************************
>>>
>>> ________________________________
>>> From: Tilman Hausherr <TH...@t-online.de>
>>> Sent: Monday, April 16, 2018 10:55 AM
>>> To: users@pdfbox.apache.org
>>> Subject: Re: Performance issue with PDFBox 2.0.8
>>>
>>> The java code didn't get through, most attachments get deleted. Call
>>> PDFRenderer.setSubsamplingAllowed(true) to activate subsampling.
>>>
>>> I had a look at your files... These are not extremely slow renderings. 4
>>> seconds for such a page is pretty good.
>>>
>>> On my PC, the first page of the Ashley file is rendered in PDFDebugger
>>> in 1.4 seconds at 72dpi. The Gill file is done in less than a second.
>>>
>>> Tilman
>>>
>>> Am 16.04.2018 um 19:05 schrieb Arthur Wang:
>>>> Arthur Wang has shared OneDrive files with you. To view them, click
>>>> the links below.
>>>>
>>>> <https://1drv.ms/b/s%21AhA_REgBppCpgQluAoJe28B935ru>
>>> [https://t5xdlg.by.files.1drv.com/y4miTL2BDX0qtz_xYqqct6mwo2l56s1alwlz1rLGpW_Mc5E7Ru9u3d-eAzERHVsoihIPc7xtNgaswIIeBlDh-hbm93zWQ6vL_PbzONFlHkd9shlCjeSfByIpMBE3EY161sZ77ggD87nTJRTRysSn4sLSrmwyenyMNNSuT6_EsqGw8Db-iHQ8Fr14T8lW0hdVFrxBDvGPI4J5G6IV2RH21FHiw?width=800&height=800&cropmode=none]<https://1drv.ms/b/s%21AhA_REgBppCpgQluAoJe28B935ru>
>>>
>>>
>>> [https://r1.res.office365.com/owa/prem/images/dc-pdf_40.png]<https://1drv.ms/b/s%21AhA_REgBppCpgQluAoJe28B935ru>
>>>
>>> Ashley
>>> NJ_HHL101125_FV.pdf<https://1drv.ms/b/s%21AhA_REgBppCpgQluAoJe28B935ru>
>>> Shared via OneDrive
>>>
>>>
>>>
>>>> Ashley NJ_HHL101125_FV.pdf
>>>> <https://1drv.ms/b/s%21AhA_REgBppCpgQluAoJe28B935ru>
>>>>
>>>>
>>>> <https://1drv.ms/b/s%21AhA_REgBppCpgQpdnBIl_hmK6Wt0>
>>>>
>>>> Gill1-1356_KM102685-INS_FV.pdf
>>>> <https://1drv.ms/b/s%21AhA_REgBppCpgQpdnBIl_hmK6Wt0>
>>>>
>>>>
>>>> <https://1drv.ms/u/s%21AhA_REgBppCpgQvygYjm2eaJQmSH>
>>>>
>>>> Screen Shot 2018-04-16 at 9.23.52 AM.png
>>>> <https://1drv.ms/u/s%21AhA_REgBppCpgQvygYjm2eaJQmSH>
>>>>          [Screen Shot 2018-04-16 at 9.23.52 AM.png]
>>>>
>>>> just tried on 2.0.9, it works almost the same. to process all 4 pages
>>>> cost 32 seconds, if only process the first page, it cost about 4
>>>> seconds.
>>>>
>>>>
>>>> My server is HP DL380 with dual Xeon processors and 32 G RAM, the hard
>>>> drive is Intel Optane SSD NVMe.
>>>>
>>>> Once the JPG image is produced, the access of the image is almost
>>>> instant regardless the size of the image file, so the open and close
>>>> time of the image file are insignificant and could be ignored.
>>>>
>>>>
>>>> By enable subsampling, do you mean to set up the dpi option ? do you
>>>> have the sample code for PDFRenderer ? attached file
>>>> ---PdfToImage2.java is my testing code. Ashley...pdf is a file with
>>>> size about 45 M, and Gill...pdf is a file with size about 5 M. with
>>>> the size 1/10th of the other one, the processing time is cut down to
>>>> 2657 milliseconds compare to 3779 milliseconds. seems like the size
>>>> does matter.
>>>>
>>>>
>>>> thanks,
>>>>
>>>>
>>>> Arthur
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------
>>>>
>>>> *From:* Tilman Hausherr <TH...@t-online.de>
>>>> *Sent:* Monday, April 16, 2018 8:57 AM
>>>> *To:* users@pdfbox.apache.org
>>>> *Subject:* Re: Performance issue with PDFBox 2.0.8
>>>> Please
>>>> - retry with the current version 2.0.9
>>>> - share your file for a profiler analysis
>>>> - as said by Itai (who implemented it) try enabling subsampling in
>>>> PDFRenderer (read the javadoc first). Compare the results and decide
>>>> whether the quality is OK for you.
>>>> - set the energy settings of your computer to maximum or at least to
>>>> "balanced", not to "energy save"
>>>> - don't know if adding GPU will help;
>>>> - try also the
>>>> -Dorg.apache.pdfbox.rendering.UsePureJavaCMYKConversion=true option
>>>>
>>>> The speed is not related to the size but to the complexity. 32 seconds
>>>> may sound disappointing but it's not the worst I've ever seen. "Nice
>>>> illustrations" with nested patterns or large shadings may be slow.
>>>>
>>>> Tilman
>>>>
>>>> Am 16.04.2018 um 09:21 schrieb Arthur Wang:
>>>>> Hi, everyone,
>>>>>
>>>>>
>>>>>
>>>>> I am using PDFBox 2.0.8 and java 8 running in tomcat 8 for
>>>> production to convert pdf into image for display. it works very well
>>>> for pdf file size less than 5 M, it takes about 3800 milliseconds.
>>>> however, it slows down very much when the file size is increased to 50
>>>> M. it takes about 70,000 milliseconds, after setting system property
>>>> of sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider", it does
>>>> increase the performance to 32550 milliseconds, which almost double
>>>> the speed. but for 32 seconds to load a web page still too slow. Is
>>>> there any other way to speed up the performance? would adding a GPU
>>>> into the server help the performance? or any other software or
>>>> hardware solution could help on the processing speed? My current
>>>> server come with 32 G RAM, and the server never used more than half
>>>> of it.
>>>>> thanks,
>>>>>
>>>>>
>>>>> Arthur
>>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: Fw: Performance issue with PDFBox 2.0.8

Posted by Tilman Hausherr <TH...@t-online.de>.
Hi,

Yeah, for thumbnails / previews the subsampling option is definitively 
for you.

Can you calculate the preview in the background? I.e. at the time the 
PDFs are uploaded, instead of when the download page is requested?

Re pdf.js you can test it here:
https://mozilla.github.io/pdf.js/web/viewer.html

I tried the Herman file, to me it seemed that it is slower with PDF.js 
than PDFBox, which is a bit surprising because usually they're faster.

With PDFDebugger with subsampling enabled it is rendered in 4409ms on my 
system. I don't know if my graphics card plays any role in this.

Tilman

Am 17.04.2018 um 23:40 schrieb Arthur Wang:
> Arthur Wang has shared OneDrive files with you. To view them, click the links below.
>
>
> <https://1drv.ms/b/s!AhA_REgBppCpgQ062sb5LoKlZkC4>
> [https://r1.res.office365.com/owa/prem/images/dc-pdf_20.png]<https://1drv.ms/b/s!AhA_REgBppCpgQ062sb5LoKlZkC4>
>
> Herman & hiss - PPHI101201 - FV 1.pdf<https://1drv.ms/b/s!AhA_REgBppCpgQ062sb5LoKlZkC4>
>
> <https://1drv.ms/u/s!AhA_REgBppCpgQ7jWRqI5BtoKiMx>
> [https://r1.res.office365.com/owa/prem/images/dc-png_20.png]<https://1drv.ms/u/s!AhA_REgBppCpgQ7jWRqI5BtoKiMx>
>
> fileListPage.png<https://1drv.ms/u/s!AhA_REgBppCpgQ7jWRqI5BtoKiMx>
>
> <https://1drv.ms/u/s!AhA_REgBppCpgQ9zgx9cBhmI2DfH>
> [https://r1.res.office365.com/owa/prem/images/dc-png_20.png]<https://1drv.ms/u/s!AhA_REgBppCpgQ9zgx9cBhmI2DfH>
>
> downloadpage.png<https://1drv.ms/u/s!AhA_REgBppCpgQ9zgx9cBhmI2DfH>
>
>
>
>
> Arthur Wang has shared a OneDrive file with you. To view it, click the link below.
>
>
> <https://1drv.ms/b/s!AhA_REgBppCpgQ062sb5LoKlZkC4>
> [https://r1.res.office365.com/owa/prem/images/dc-pdf_20.png]<https://1drv.ms/b/s!AhA_REgBppCpgQ062sb5LoKlZkC4>
>
> Herman & hiss - PPHI101201 - FV 1.pdf<https://1drv.ms/b/s!AhA_REgBppCpgQ062sb5LoKlZkC4>
>
>
> [https://ipodlq.by.files.1drv.com/y4mV0VxHK2D4uxyC4TQL_Pm0kYsbjNHGUAguMLxlVa4ykUtL6nwg19za0G74IoFusv2FwwJzohKoGdIKXg_MF26eavOY6hXbddC36qMI8vALhyNSmU8cAlpuAsMwah5b5__skoa2koVvs5wP2tekcxmBSQE-KPRahIsVu6ZtVnLV_I?width=800&height=800&cropmode=none]<https://1drv.ms/b/s!AhA_REgBppCpgQ062sb5LoKlZkC4>
>
> [https://r1.res.office365.com/owa/prem/images/dc-pdf_40.png]<https://1drv.ms/b/s!AhA_REgBppCpgQ062sb5LoKlZkC4>
> Herman & hiss - PPHI101201 - FV 1.pdf<https://1drv.ms/b/s!AhA_REgBppCpgQ062sb5LoKlZkC4>
> Shared via OneDrive
>
>
>
>
> Tilman,
>
>
> Since my email got rejected due to the size limit of apache mail server--1m. I send it again here.
>
>
> First, thank you very much for the extra information and udpate.
>
>
> My application is an internal web based production system. Many designers in our graphic department uploaded the print-ready file to the system every hours, and other users include prepress, press, shipping, customers will log into the system to download the files. The print-ready pdf file sometimes are extremely big in terms of the size, 5 M to 1 G are most popular, 2G to 5G are rare, but do happen sometimes. Please refer to the attached two screenshots(fileListPage, downloadPage). What I am trying to do is to show a thumbnail on the fileDownloadPage. we used to show a download icon on the download page instead of the thumbnail, but users have to download the file to their local computer before actually seeing it. Sometimes the fileListPage show a long list of files, people get confused, it would be more convenient for the user to have a peek of the file before actually download it. so it's better to have a thumbnail on the download page. In terms of the pdf.js, I had never tried, do you think it can load 40M or 50 M file in one or two seconds by the apache server?
>
>
> I copied my code below for you reference.(one is for testing, the other one is for production .)
>
>
> Attached you will also find a pdf file named Herman..pdf. it only has two pages, but by only convert the first page, the best I can do it 7 seconds. that would be very slow for web application. If by adding a GPU, the performance could get better, I would certainly like to try, just not sure if it's going to work.
>
>
> ******************below are testing code running on eclilpse platform**************
>
>
> package com.test;
>
> import org.apache.pdfbox.pdmodel.PDDocument;
> import org.apache.pdfbox.rendering.ImageType;
> import org.apache.pdfbox.rendering.PDFRenderer;
> import org.apache.pdfbox.tools.imageio.ImageIOUtil;
>
> import java.awt.image.BufferedImage;
> import java.io.File;
> import java.io.IOException;
> import org.apache.commons.lang3.time.StopWatch;
> import org.apache.commons.lang3.StringUtils;
>
> public class PdfToImage {
>
>      private static final String OUTPUT_DIR = "/Users/someone/Desktop/";
>
>      public static void main(String[] args) throws Exception{
>
>          System.setProperty("sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider");
>
>          StopWatch stopwatch = new StopWatch();
>
>          stopwatch.start();
>
>          try (final PDDocument document = PDDocument.load(new File("/Users/someone/Desktop/Herman & hiss - PPHI101201 - FV.pdf"))){
>              PDFRenderer pdfRenderer = new PDFRenderer(document);
>              pdfRenderer.setSubsamplingAllowed(true);
>              //for (int page = 0; page < document.getNumberOfPages(); ++page)
>              for (int page = 0; page < 1; ++page)
>              {
>                  BufferedImage bim = pdfRenderer.renderImageWithDPI(page, 72, ImageType.RGB); //<--this number have performance impact
>                  String fileName = OUTPUT_DIR + "Herman & hiss - PPHI101201 - FV" + page + ".jpg";
>                  ImageIOUtil.writeImage(bim, fileName, 72); //<---this number
>              }
>              document.close();
>          } catch (IOException e){
>              System.err.println("Exception while trying to create pdf document - " + e);
>          }
>
>           stopwatch.stop(); // optional
>          System.out.println("Time elapsed is "+ stopwatch.getTime() + " milliseconds");
>
>
>      }
>      //test Files: Ashley NJ_HHL101125_FV.pdf, 40M, 4 pages
>      //try Ashley without set property: 4 pages@70074 milliseconds
>      //try Ashley with property set:   4 pagesQ@32552 milliseconds
>      //try have subSampling true set: 4 pages@9481 milliseconds
>      //try Herman & hiss - PPHI101201 - FV.png: two pages@14050 milliseconds
>      //try Herman & hiss - PPHI101201 - FV.jpg: two pages@13612 milliseconds
>      //try Herman: 1 page@7625
>      //try Ashley: 1 page@3237
>      //try Ashely with 72 dpi: 1 page@2807
>      //try Herman with 72 dpi: 1 page@6788
>      //try herman without subSampling true setting: 1 page@7087
>
> }
>
>
>
> *****************below is production code running as an action class of struts *********
>
>
> public void processPdf(String pdfFilePath, String imageFilePath){
>
>          System.setProperty("sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider");
>
>          try (final PDDocument document = PDDocument.load(new File(pdfFilePath))){
>              PDFRenderer pdfRenderer = new PDFRenderer(document);
>              pdfRenderer.setSubsamplingAllowed(true);
>              //for (int page = 0; page < document.getNumberOfPages(); ++page)
>              for (int page = 0; page < 1; ++page)
>              {
>                  BufferedImage bim = pdfRenderer.renderImageWithDPI(page, 72, ImageType.RGB);
>
>                  ImageIOUtil.writeImage(bim, imageFilePath, 72);
>              }
>              document.close();
>          } catch (IOException e){
>                  log.info("Exception while trying to create pdf document - " + e);
>          }
>
>
>      }
>
>
> *********************
>
>
>
> ________________________________
> From: Tilman Hausherr <TH...@t-online.de>
> Sent: Tuesday, April 17, 2018 10:39 AM
> To: users@pdfbox.apache.org
> Subject: Re: Performance issue with PDFBox 2.0.8
>
> Hi,
>
> I ran the Ashley file through the profiler, most time is used for
> decoding the jpeg files within and converting some of the from CMYK to
> RGB. Nothing to optimize. I also found another one-time initialization
> that takes 100-300ms, which I will add to the next version of PDFDebugger.
>
>       FilterFactory.INSTANCE.getFilter(COSName.FLATE_DECODE);
>
> I also tested the UsePureJavaCMYKConversion, it made rendering much
> slower. IIRC, that option only helps with files with many tiny CMYK images.
>
> I have committed a change that adds the subsampling option to
> PDFToImage, that version will be available within a few hours at
> https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/2.0.10-SNAPSHOT/
> look for todays date.
>
> Or get the source code here:
> https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/PDFToImage.java?revision=1829374&view=markup
>
> What type of application are you creating? If you want to show a PDF in
> the browser, PDF.js works nicely, is free and included in firefox. If
> you want to do thumbnails, then you should use a smaller dpi value. In
> that case using subsampling would help even more.
>
> Tilman
>
>
>
> Am 17.04.2018 um 07:26 schrieb Tilman Hausherr:
>> Hi,
>>
>> I have a Ryzen 1700 cpu and for tests I'm running it on max energy
>> settings. It is unclear if a mac has a similar setting.  This url
>> http://www.macos.utah.edu/documentation/administration/pmset.html
>> shows there is a setting for "better performance" but I don't know if
>> that does the same as on Windows where I get a performance doubling.
>> Try PDFDebugger, it has a built-in benchmark feature, it shows the
>> rendering speed in the status line.
>>
>> I'm also avoiding that one-time initializations are part of the
>> benchmark results with this code that is also in PDFDebugger:
>>
>>          // trigger premature initializations for more accurate
>> rendering benchmarks
>>          // See discussion in PDFBOX-3988
>>          if (PDType1Font.COURIER.isStandard14())
>>          {
>>              // Yes this is always true
>>              PDDeviceCMYK.INSTANCE.toRGB(new float[] { 0, 0, 0, 0} );
>>              PDDeviceRGB.INSTANCE.toRGB(new float[] { 0, 0, 0 } );
>>              IIORegistry.getDefaultInstance();
>>          }
>>
>> I see you're using the PDFToImage utility. That one doesn't support
>> subsampling yet, it has been on my "todo" list for a few days, I'll
>> try to do it tonight... But PDFToImage is really just a command line
>> utility.
>>
>> Args 7, 8 and 11 don't work that way. Re arg 7 and 8, you need to call
>> System.setProperty(). Re arg 11, you need to have a PDFRenderer object.
>>
>> Another way to convert to images is explained here:
>> https://stackoverflow.com/questions/23326562/apache-pdfbox-convert-pdf-to-images
>>
>>
>> there call pdfRenderer.setSubsamplingAllowed(true) to activate
>> subsampling. PDFDebugger also supports it in the menu.
>>
>> Tilman
>>
>> Am 17.04.2018 um 01:20 schrieb Arthur Wang:
>>> Tilman,
>>>
>>>
>>> Thanks for the quick response and testing on my case. Below is my
>>> java code, my test result after adding the subsampling allowed. For
>>> the first page of ashley file, it cost 3362 milliseconds.
>>>
>>> For the Gill file, the time elapsed is 2456 milliseconds.
>>>
>>> My test were conducted on my MAC with processor: 2.2GHz, Core i7.
>>> how come your PC runs so fast? 1.4 seconds is fast enough for a web
>>> access. Maybe there is something wrong with my code? I would
>>> appreciate if you take a look at my code.
>>>
>>>
>>> Best,
>>>
>>>
>>> Arthur
>>>
>>>
>>> *******************
>>>
>>> import org.apache.pdfbox.tools.PDFToImage;
>>> //import java.awt.image.BufferedImage;
>>> import java.io.File;
>>> //import java.io.IOException;
>>> //import java.io.OutputStream;
>>> import org.apache.commons.lang3.time.StopWatch;
>>>
>>>
>>> public class PdfToImage2 {
>>>
>>>       private static final String OUTPUT_DIR = "/Users/someone/Desktop/";
>>>
>>>       public static void main(String[] args) throws Exception{
>>>
>>>           String pdfPath = "/Users/someone/Desktop/Ashley
>>> NJ_HHL101125_FV.pdf";
>>>           //config option 2:convert page 1 in pdf to image
>>>           String [] args_1 =  new String[13];
>>>           args_1[0] = "-startPage";
>>>           args_1[1] = "1";
>>>           args_1[2] = "-endPage";
>>>           args_1[3] = "1";
>>>           args_1[4] = "-outputPrefix";
>>>           args_1[5] = OUTPUT_DIR+"Ashley NJ_HHL101125_FV1";
>>>           args_1[6] = pdfPath;
>>>           args_1[7] =
>>> "-Dorg.apache.pdfbox.rendering.UsePureJavaCMYKConversion";
>>>           args_1[8] = "true";
>>>           args_1[9] = "-dpi";
>>>           args_1[10] = "72";//@48-->3283 milliseconds, @96>3545
>>> milliseconds, @72--> 3362milliseconds
>>>           args_1[11] = "-PDFRenderer.setSubsamplingAllowed";
>>>           args_1[12] = "true";
>>>
>>>           File f = new File(args_1[5]+"1.jpg");
>>>           if(f.exists() && !f.isDirectory()) {
>>>               System.out.println("file exist already");;
>>>           }
>>>           else{
>>>
>>>               StopWatch stopwatch = new StopWatch();
>>>
>>>               stopwatch.start();
>>>
>>>                 try {
>>>
>>>                   System.setProperty("sun.java2d.cmm",
>>> "sun.java2d.cmm.kcms.KcmsServiceProvider");
>>>                   PDFToImage.main(args_1);
>>>                   System.out.println("Done!");
>>>                 } catch (Exception e) {
>>>                     System.err.println("Exception while trying to
>>> create pdf document - " + e);
>>>                 }
>>>
>>>                    stopwatch.stop(); // optional
>>>                   System.out.println("Time elapsed is "+
>>> stopwatch.getTime() + " milliseconds");
>>>
>>>
>>>           }//else
>>>
>>>           //first try without setting property: 3779 milliseconds
>>>           //second try with the property set: 3852 milliseconds
>>>           //third try with subsamplingAllowed: 3362 milliseconds
>>>
>>>       }
>>>
>>> *******************************
>>>
>>> ________________________________
>>> From: Tilman Hausherr <TH...@t-online.de>
>>> Sent: Monday, April 16, 2018 10:55 AM
>>> To: users@pdfbox.apache.org
>>> Subject: Re: Performance issue with PDFBox 2.0.8
>>>
>>> The java code didn't get through, most attachments get deleted. Call
>>> PDFRenderer.setSubsamplingAllowed(true) to activate subsampling.
>>>
>>> I had a look at your files... These are not extremely slow renderings. 4
>>> seconds for such a page is pretty good.
>>>
>>> On my PC, the first page of the Ashley file is rendered in PDFDebugger
>>> in 1.4 seconds at 72dpi. The Gill file is done in less than a second.
>>>
>>> Tilman
>>>
>>> Am 16.04.2018 um 19:05 schrieb Arthur Wang:
>>>> Arthur Wang has shared OneDrive files with you. To view them, click
>>>> the links below.
>>>>
>>>> <https://1drv.ms/b/s%21AhA_REgBppCpgQluAoJe28B935ru>
>>> [https://t5xdlg.by.files.1drv.com/y4miTL2BDX0qtz_xYqqct6mwo2l56s1alwlz1rLGpW_Mc5E7Ru9u3d-eAzERHVsoihIPc7xtNgaswIIeBlDh-hbm93zWQ6vL_PbzONFlHkd9shlCjeSfByIpMBE3EY161sZ77ggD87nTJRTRysSn4sLSrmwyenyMNNSuT6_EsqGw8Db-iHQ8Fr14T8lW0hdVFrxBDvGPI4J5G6IV2RH21FHiw?width=800&height=800&cropmode=none]<https://1drv.ms/b/s%21AhA_REgBppCpgQluAoJe28B935ru>
>>>
>>>
>>> [https://r1.res.office365.com/owa/prem/images/dc-pdf_40.png]<https://1drv.ms/b/s%21AhA_REgBppCpgQluAoJe28B935ru>
>>>
>>> Ashley
>>> NJ_HHL101125_FV.pdf<https://1drv.ms/b/s%21AhA_REgBppCpgQluAoJe28B935ru>
>>> Shared via OneDrive
>>>
>>>
>>>
>>>> Ashley NJ_HHL101125_FV.pdf
>>>> <https://1drv.ms/b/s%21AhA_REgBppCpgQluAoJe28B935ru>
>>>>
>>>>
>>>> <https://1drv.ms/b/s%21AhA_REgBppCpgQpdnBIl_hmK6Wt0>
>>>>
>>>> Gill1-1356_KM102685-INS_FV.pdf
>>>> <https://1drv.ms/b/s%21AhA_REgBppCpgQpdnBIl_hmK6Wt0>
>>>>
>>>>
>>>> <https://1drv.ms/u/s%21AhA_REgBppCpgQvygYjm2eaJQmSH>
>>>>
>>>> Screen Shot 2018-04-16 at 9.23.52 AM.png
>>>> <https://1drv.ms/u/s%21AhA_REgBppCpgQvygYjm2eaJQmSH>
>>>>          [Screen Shot 2018-04-16 at 9.23.52 AM.png]
>>>>
>>>> just tried on 2.0.9, it works almost the same. to process all 4 pages
>>>> cost 32 seconds, if only process the first page, it cost about 4
>>>> seconds.
>>>>
>>>>
>>>> My server is HP DL380 with dual Xeon processors and 32 G RAM, the hard
>>>> drive is Intel Optane SSD NVMe.
>>>>
>>>> Once the JPG image is produced, the access of the image is almost
>>>> instant regardless the size of the image file, so the open and close
>>>> time of the image file are insignificant and could be ignored.
>>>>
>>>>
>>>> By enable subsampling, do you mean to set up the dpi option ? do you
>>>> have the sample code for PDFRenderer ? attached file
>>>> ---PdfToImage2.java is my testing code. Ashley...pdf is a file with
>>>> size about 45 M, and Gill...pdf is a file with size about 5 M. with
>>>> the size 1/10th of the other one, the processing time is cut down to
>>>> 2657 milliseconds compare to 3779 milliseconds. seems like the size
>>>> does matter.
>>>>
>>>>
>>>> thanks,
>>>>
>>>>
>>>> Arthur
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------
>>>>
>>>> *From:* Tilman Hausherr <TH...@t-online.de>
>>>> *Sent:* Monday, April 16, 2018 8:57 AM
>>>> *To:* users@pdfbox.apache.org
>>>> *Subject:* Re: Performance issue with PDFBox 2.0.8
>>>> Please
>>>> - retry with the current version 2.0.9
>>>> - share your file for a profiler analysis
>>>> - as said by Itai (who implemented it) try enabling subsampling in
>>>> PDFRenderer (read the javadoc first). Compare the results and decide
>>>> whether the quality is OK for you.
>>>> - set the energy settings of your computer to maximum or at least to
>>>> "balanced", not to "energy save"
>>>> - don't know if adding GPU will help;
>>>> - try also the
>>>> -Dorg.apache.pdfbox.rendering.UsePureJavaCMYKConversion=true option
>>>>
>>>> The speed is not related to the size but to the complexity. 32 seconds
>>>> may sound disappointing but it's not the worst I've ever seen. "Nice
>>>> illustrations" with nested patterns or large shadings may be slow.
>>>>
>>>> Tilman
>>>>
>>>> Am 16.04.2018 um 09:21 schrieb Arthur Wang:
>>>>> Hi, everyone,
>>>>>
>>>>>
>>>>>
>>>>> I am using PDFBox 2.0.8 and java 8 running in tomcat 8 for
>>>> production to convert pdf into image for display. it works very well
>>>> for pdf file size less than 5 M, it takes about 3800 milliseconds.
>>>> however, it slows down very much when the file size is increased to 50
>>>> M. it takes about 70,000 milliseconds, after setting system property
>>>> of sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider", it does
>>>> increase the performance to 32550 milliseconds, which almost double
>>>> the speed. but for 32 seconds to load a web page still too slow. Is
>>>> there any other way to speed up the performance? would adding a GPU
>>>> into the server help the performance? or any other software or
>>>> hardware solution could help on the processing speed? My current
>>>> server come with 32 G RAM, and the server never used more than half
>>>> of it.
>>>>> thanks,
>>>>>
>>>>>
>>>>> Arthur
>>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Fw: Performance issue with PDFBox 2.0.8

Posted by Arthur Wang <ar...@hotmail.com>.
Arthur Wang has shared OneDrive files with you. To view them, click the links below.


<https://1drv.ms/b/s!AhA_REgBppCpgQ062sb5LoKlZkC4>
[https://r1.res.office365.com/owa/prem/images/dc-pdf_20.png]<https://1drv.ms/b/s!AhA_REgBppCpgQ062sb5LoKlZkC4>

Herman & hiss - PPHI101201 - FV 1.pdf<https://1drv.ms/b/s!AhA_REgBppCpgQ062sb5LoKlZkC4>

<https://1drv.ms/u/s!AhA_REgBppCpgQ7jWRqI5BtoKiMx>
[https://r1.res.office365.com/owa/prem/images/dc-png_20.png]<https://1drv.ms/u/s!AhA_REgBppCpgQ7jWRqI5BtoKiMx>

fileListPage.png<https://1drv.ms/u/s!AhA_REgBppCpgQ7jWRqI5BtoKiMx>

<https://1drv.ms/u/s!AhA_REgBppCpgQ9zgx9cBhmI2DfH>
[https://r1.res.office365.com/owa/prem/images/dc-png_20.png]<https://1drv.ms/u/s!AhA_REgBppCpgQ9zgx9cBhmI2DfH>

downloadpage.png<https://1drv.ms/u/s!AhA_REgBppCpgQ9zgx9cBhmI2DfH>




Arthur Wang has shared a OneDrive file with you. To view it, click the link below.


<https://1drv.ms/b/s!AhA_REgBppCpgQ062sb5LoKlZkC4>
[https://r1.res.office365.com/owa/prem/images/dc-pdf_20.png]<https://1drv.ms/b/s!AhA_REgBppCpgQ062sb5LoKlZkC4>

Herman & hiss - PPHI101201 - FV 1.pdf<https://1drv.ms/b/s!AhA_REgBppCpgQ062sb5LoKlZkC4>


[https://ipodlq.by.files.1drv.com/y4mV0VxHK2D4uxyC4TQL_Pm0kYsbjNHGUAguMLxlVa4ykUtL6nwg19za0G74IoFusv2FwwJzohKoGdIKXg_MF26eavOY6hXbddC36qMI8vALhyNSmU8cAlpuAsMwah5b5__skoa2koVvs5wP2tekcxmBSQE-KPRahIsVu6ZtVnLV_I?width=800&height=800&cropmode=none]<https://1drv.ms/b/s!AhA_REgBppCpgQ062sb5LoKlZkC4>

[https://r1.res.office365.com/owa/prem/images/dc-pdf_40.png]<https://1drv.ms/b/s!AhA_REgBppCpgQ062sb5LoKlZkC4>
Herman & hiss - PPHI101201 - FV 1.pdf<https://1drv.ms/b/s!AhA_REgBppCpgQ062sb5LoKlZkC4>
Shared via OneDrive




Tilman,


Since my email got rejected due to the size limit of apache mail server--1m. I send it again here.


First, thank you very much for the extra information and udpate.


My application is an internal web based production system. Many designers in our graphic department uploaded the print-ready file to the system every hours, and other users include prepress, press, shipping, customers will log into the system to download the files. The print-ready pdf file sometimes are extremely big in terms of the size, 5 M to 1 G are most popular, 2G to 5G are rare, but do happen sometimes. Please refer to the attached two screenshots(fileListPage, downloadPage). What I am trying to do is to show a thumbnail on the fileDownloadPage. we used to show a download icon on the download page instead of the thumbnail, but users have to download the file to their local computer before actually seeing it. Sometimes the fileListPage show a long list of files, people get confused, it would be more convenient for the user to have a peek of the file before actually download it. so it's better to have a thumbnail on the download page. In terms of the pdf.js, I had never tried, do you think it can load 40M or 50 M file in one or two seconds by the apache server?


I copied my code below for you reference.(one is for testing, the other one is for production .)


Attached you will also find a pdf file named Herman..pdf. it only has two pages, but by only convert the first page, the best I can do it 7 seconds. that would be very slow for web application. If by adding a GPU, the performance could get better, I would certainly like to try, just not sure if it's going to work.


******************below are testing code running on eclilpse platform**************


package com.test;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.rendering.ImageType;
import org.apache.pdfbox.rendering.PDFRenderer;
import org.apache.pdfbox.tools.imageio.ImageIOUtil;

import java.awt.image.BufferedImage;
import java.io.File;
import java.io.IOException;
import org.apache.commons.lang3.time.StopWatch;
import org.apache.commons.lang3.StringUtils;

public class PdfToImage {

    private static final String OUTPUT_DIR = "/Users/someone/Desktop/";

    public static void main(String[] args) throws Exception{

        System.setProperty("sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider");

        StopWatch stopwatch = new StopWatch();

        stopwatch.start();

        try (final PDDocument document = PDDocument.load(new File("/Users/someone/Desktop/Herman & hiss - PPHI101201 - FV.pdf"))){
            PDFRenderer pdfRenderer = new PDFRenderer(document);
            pdfRenderer.setSubsamplingAllowed(true);
            //for (int page = 0; page < document.getNumberOfPages(); ++page)
            for (int page = 0; page < 1; ++page)
            {
                BufferedImage bim = pdfRenderer.renderImageWithDPI(page, 72, ImageType.RGB); //<--this number have performance impact
                String fileName = OUTPUT_DIR + "Herman & hiss - PPHI101201 - FV" + page + ".jpg";
                ImageIOUtil.writeImage(bim, fileName, 72); //<---this number
            }
            document.close();
        } catch (IOException e){
            System.err.println("Exception while trying to create pdf document - " + e);
        }

         stopwatch.stop(); // optional
        System.out.println("Time elapsed is "+ stopwatch.getTime() + " milliseconds");


    }
    //test Files: Ashley NJ_HHL101125_FV.pdf, 40M, 4 pages
    //try Ashley without set property: 4 pages@70074 milliseconds
    //try Ashley with property set:   4 pagesQ@32552 milliseconds
    //try have subSampling true set: 4 pages@9481 milliseconds
    //try Herman & hiss - PPHI101201 - FV.png: two pages@14050 milliseconds
    //try Herman & hiss - PPHI101201 - FV.jpg: two pages@13612 milliseconds
    //try Herman: 1 page@7625
    //try Ashley: 1 page@3237
    //try Ashely with 72 dpi: 1 page@2807
    //try Herman with 72 dpi: 1 page@6788
    //try herman without subSampling true setting: 1 page@7087

}



*****************below is production code running as an action class of struts *********


public void processPdf(String pdfFilePath, String imageFilePath){

        System.setProperty("sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider");

        try (final PDDocument document = PDDocument.load(new File(pdfFilePath))){
            PDFRenderer pdfRenderer = new PDFRenderer(document);
            pdfRenderer.setSubsamplingAllowed(true);
            //for (int page = 0; page < document.getNumberOfPages(); ++page)
            for (int page = 0; page < 1; ++page)
            {
                BufferedImage bim = pdfRenderer.renderImageWithDPI(page, 72, ImageType.RGB);

                ImageIOUtil.writeImage(bim, imageFilePath, 72);
            }
            document.close();
        } catch (IOException e){
                log.info("Exception while trying to create pdf document - " + e);
        }


    }


*********************



________________________________
From: Tilman Hausherr <TH...@t-online.de>
Sent: Tuesday, April 17, 2018 10:39 AM
To: users@pdfbox.apache.org
Subject: Re: Performance issue with PDFBox 2.0.8

Hi,

I ran the Ashley file through the profiler, most time is used for
decoding the jpeg files within and converting some of the from CMYK to
RGB. Nothing to optimize. I also found another one-time initialization
that takes 100-300ms, which I will add to the next version of PDFDebugger.

     FilterFactory.INSTANCE.getFilter(COSName.FLATE_DECODE);

I also tested the UsePureJavaCMYKConversion, it made rendering much
slower. IIRC, that option only helps with files with many tiny CMYK images.

I have committed a change that adds the subsampling option to
PDFToImage, that version will be available within a few hours at
https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/2.0.10-SNAPSHOT/
look for todays date.

Or get the source code here:
https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/PDFToImage.java?revision=1829374&view=markup

What type of application are you creating? If you want to show a PDF in
the browser, PDF.js works nicely, is free and included in firefox. If
you want to do thumbnails, then you should use a smaller dpi value. In
that case using subsampling would help even more.

Tilman



Am 17.04.2018 um 07:26 schrieb Tilman Hausherr:
> Hi,
>
> I have a Ryzen 1700 cpu and for tests I'm running it on max energy
> settings. It is unclear if a mac has a similar setting.  This url
> http://www.macos.utah.edu/documentation/administration/pmset.html
> shows there is a setting for "better performance" but I don't know if
> that does the same as on Windows where I get a performance doubling.
> Try PDFDebugger, it has a built-in benchmark feature, it shows the
> rendering speed in the status line.
>
> I'm also avoiding that one-time initializations are part of the
> benchmark results with this code that is also in PDFDebugger:
>
>         // trigger premature initializations for more accurate
> rendering benchmarks
>         // See discussion in PDFBOX-3988
>         if (PDType1Font.COURIER.isStandard14())
>         {
>             // Yes this is always true
>             PDDeviceCMYK.INSTANCE.toRGB(new float[] { 0, 0, 0, 0} );
>             PDDeviceRGB.INSTANCE.toRGB(new float[] { 0, 0, 0 } );
>             IIORegistry.getDefaultInstance();
>         }
>
> I see you're using the PDFToImage utility. That one doesn't support
> subsampling yet, it has been on my "todo" list for a few days, I'll
> try to do it tonight... But PDFToImage is really just a command line
> utility.
>
> Args 7, 8 and 11 don't work that way. Re arg 7 and 8, you need to call
> System.setProperty(). Re arg 11, you need to have a PDFRenderer object.
>
> Another way to convert to images is explained here:
> https://stackoverflow.com/questions/23326562/apache-pdfbox-convert-pdf-to-images
>
>
> there call pdfRenderer.setSubsamplingAllowed(true) to activate
> subsampling. PDFDebugger also supports it in the menu.
>
> Tilman
>
> Am 17.04.2018 um 01:20 schrieb Arthur Wang:
>> Tilman,
>>
>>
>> Thanks for the quick response and testing on my case. Below is my
>> java code, my test result after adding the subsampling allowed. For
>> the first page of ashley file, it cost 3362 milliseconds.
>>
>> For the Gill file, the time elapsed is 2456 milliseconds.
>>
>> My test were conducted on my MAC with processor: 2.2GHz, Core i7.
>> how come your PC runs so fast? 1.4 seconds is fast enough for a web
>> access. Maybe there is something wrong with my code? I would
>> appreciate if you take a look at my code.
>>
>>
>> Best,
>>
>>
>> Arthur
>>
>>
>> *******************
>>
>> import org.apache.pdfbox.tools.PDFToImage;
>> //import java.awt.image.BufferedImage;
>> import java.io.File;
>> //import java.io.IOException;
>> //import java.io.OutputStream;
>> import org.apache.commons.lang3.time.StopWatch;
>>
>>
>> public class PdfToImage2 {
>>
>>      private static final String OUTPUT_DIR = "/Users/someone/Desktop/";
>>
>>      public static void main(String[] args) throws Exception{
>>
>>          String pdfPath = "/Users/someone/Desktop/Ashley
>> NJ_HHL101125_FV.pdf";
>>          //config option 2:convert page 1 in pdf to image
>>          String [] args_1 =  new String[13];
>>          args_1[0] = "-startPage";
>>          args_1[1] = "1";
>>          args_1[2] = "-endPage";
>>          args_1[3] = "1";
>>          args_1[4] = "-outputPrefix";
>>          args_1[5] = OUTPUT_DIR+"Ashley NJ_HHL101125_FV1";
>>          args_1[6] = pdfPath;
>>          args_1[7] =
>> "-Dorg.apache.pdfbox.rendering.UsePureJavaCMYKConversion";
>>          args_1[8] = "true";
>>          args_1[9] = "-dpi";
>>          args_1[10] = "72";//@48-->3283 milliseconds, @96>3545
>> milliseconds, @72--> 3362milliseconds
>>          args_1[11] = "-PDFRenderer.setSubsamplingAllowed";
>>          args_1[12] = "true";
>>
>>          File f = new File(args_1[5]+"1.jpg");
>>          if(f.exists() && !f.isDirectory()) {
>>              System.out.println("file exist already");;
>>          }
>>          else{
>>
>>              StopWatch stopwatch = new StopWatch();
>>
>>              stopwatch.start();
>>
>>                try {
>>
>>                  System.setProperty("sun.java2d.cmm",
>> "sun.java2d.cmm.kcms.KcmsServiceProvider");
>>                  PDFToImage.main(args_1);
>>                  System.out.println("Done!");
>>                } catch (Exception e) {
>>                    System.err.println("Exception while trying to
>> create pdf document - " + e);
>>                }
>>
>>                   stopwatch.stop(); // optional
>>                  System.out.println("Time elapsed is "+
>> stopwatch.getTime() + " milliseconds");
>>
>>
>>          }//else
>>
>>          //first try without setting property: 3779 milliseconds
>>          //second try with the property set: 3852 milliseconds
>>          //third try with subsamplingAllowed: 3362 milliseconds
>>
>>      }
>>
>> *******************************
>>
>> ________________________________
>> From: Tilman Hausherr <TH...@t-online.de>
>> Sent: Monday, April 16, 2018 10:55 AM
>> To: users@pdfbox.apache.org
>> Subject: Re: Performance issue with PDFBox 2.0.8
>>
>> The java code didn't get through, most attachments get deleted. Call
>> PDFRenderer.setSubsamplingAllowed(true) to activate subsampling.
>>
>> I had a look at your files... These are not extremely slow renderings. 4
>> seconds for such a page is pretty good.
>>
>> On my PC, the first page of the Ashley file is rendered in PDFDebugger
>> in 1.4 seconds at 72dpi. The Gill file is done in less than a second.
>>
>> Tilman
>>
>> Am 16.04.2018 um 19:05 schrieb Arthur Wang:
>>> Arthur Wang has shared OneDrive files with you. To view them, click
>>> the links below.
>>>
>>> <https://1drv.ms/b/s%21AhA_REgBppCpgQluAoJe28B935ru>
>> [https://t5xdlg.by.files.1drv.com/y4miTL2BDX0qtz_xYqqct6mwo2l56s1alwlz1rLGpW_Mc5E7Ru9u3d-eAzERHVsoihIPc7xtNgaswIIeBlDh-hbm93zWQ6vL_PbzONFlHkd9shlCjeSfByIpMBE3EY161sZ77ggD87nTJRTRysSn4sLSrmwyenyMNNSuT6_EsqGw8Db-iHQ8Fr14T8lW0hdVFrxBDvGPI4J5G6IV2RH21FHiw?width=800&height=800&cropmode=none]<https://1drv.ms/b/s%21AhA_REgBppCpgQluAoJe28B935ru>
>>
>>
>> [https://r1.res.office365.com/owa/prem/images/dc-pdf_40.png]<https://1drv.ms/b/s%21AhA_REgBppCpgQluAoJe28B935ru>
>>
>> Ashley
>> NJ_HHL101125_FV.pdf<https://1drv.ms/b/s%21AhA_REgBppCpgQluAoJe28B935ru>
>> Shared via OneDrive
>>
>>
>>
>>> Ashley NJ_HHL101125_FV.pdf
>>> <https://1drv.ms/b/s%21AhA_REgBppCpgQluAoJe28B935ru>
>>>
>>>
>>> <https://1drv.ms/b/s%21AhA_REgBppCpgQpdnBIl_hmK6Wt0>
>>>
>>> Gill1-1356_KM102685-INS_FV.pdf
>>> <https://1drv.ms/b/s%21AhA_REgBppCpgQpdnBIl_hmK6Wt0>
>>>
>>>
>>> <https://1drv.ms/u/s%21AhA_REgBppCpgQvygYjm2eaJQmSH>
>>>
>>> Screen Shot 2018-04-16 at 9.23.52 AM.png
>>> <https://1drv.ms/u/s%21AhA_REgBppCpgQvygYjm2eaJQmSH>
>>>         [Screen Shot 2018-04-16 at 9.23.52 AM.png]
>>>
>>> just tried on 2.0.9, it works almost the same. to process all 4 pages
>>> cost 32 seconds, if only process the first page, it cost about 4
>>> seconds.
>>>
>>>
>>> My server is HP DL380 with dual Xeon processors and 32 G RAM, the hard
>>> drive is Intel Optane SSD NVMe.
>>>
>>> Once the JPG image is produced, the access of the image is almost
>>> instant regardless the size of the image file, so the open and close
>>> time of the image file are insignificant and could be ignored.
>>>
>>>
>>> By enable subsampling, do you mean to set up the dpi option ? do you
>>> have the sample code for PDFRenderer ? attached file
>>> ---PdfToImage2.java is my testing code. Ashley...pdf is a file with
>>> size about 45 M, and Gill...pdf is a file with size about 5 M. with
>>> the size 1/10th of the other one, the processing time is cut down to
>>> 2657 milliseconds compare to 3779 milliseconds. seems like the size
>>> does matter.
>>>
>>>
>>> thanks,
>>>
>>>
>>> Arthur
>>>
>>>
>>>
>>> ------------------------------------------------------------------------
>>>
>>> *From:* Tilman Hausherr <TH...@t-online.de>
>>> *Sent:* Monday, April 16, 2018 8:57 AM
>>> *To:* users@pdfbox.apache.org
>>> *Subject:* Re: Performance issue with PDFBox 2.0.8
>>> Please
>>> - retry with the current version 2.0.9
>>> - share your file for a profiler analysis
>>> - as said by Itai (who implemented it) try enabling subsampling in
>>> PDFRenderer (read the javadoc first). Compare the results and decide
>>> whether the quality is OK for you.
>>> - set the energy settings of your computer to maximum or at least to
>>> "balanced", not to "energy save"
>>> - don't know if adding GPU will help;
>>> - try also the
>>> -Dorg.apache.pdfbox.rendering.UsePureJavaCMYKConversion=true option
>>>
>>> The speed is not related to the size but to the complexity. 32 seconds
>>> may sound disappointing but it's not the worst I've ever seen. "Nice
>>> illustrations" with nested patterns or large shadings may be slow.
>>>
>>> Tilman
>>>
>>> Am 16.04.2018 um 09:21 schrieb Arthur Wang:
>>>> Hi, everyone,
>>>>
>>>>
>>>>
>>>> I am using PDFBox 2.0.8 and java 8 running in tomcat 8 for
>>> production to convert pdf into image for display. it works very well
>>> for pdf file size less than 5 M, it takes about 3800 milliseconds.
>>> however, it slows down very much when the file size is increased to 50
>>> M. it takes about 70,000 milliseconds, after setting system property
>>> of sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider", it does
>>> increase the performance to 32550 milliseconds, which almost double
>>> the speed. but for 32 seconds to load a web page still too slow. Is
>>> there any other way to speed up the performance? would adding a GPU
>>> into the server help the performance? or any other software or
>>> hardware solution could help on the processing speed? My current
>>> server come with 32 G RAM, and the server never used more than half
>>> of it.
>>>>
>>>> thanks,
>>>>
>>>>
>>>> Arthur
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: Performance issue with PDFBox 2.0.8

Posted by Tilman Hausherr <TH...@t-online.de>.
Hi,

I ran the Ashley file through the profiler, most time is used for 
decoding the jpeg files within and converting some of the from CMYK to 
RGB. Nothing to optimize. I also found another one-time initialization 
that takes 100-300ms, which I will add to the next version of PDFDebugger.

     FilterFactory.INSTANCE.getFilter(COSName.FLATE_DECODE);

I also tested the UsePureJavaCMYKConversion, it made rendering much 
slower. IIRC, that option only helps with files with many tiny CMYK images.

I have committed a change that adds the subsampling option to 
PDFToImage, that version will be available within a few hours at
https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/2.0.10-SNAPSHOT/
look for todays date.

Or get the source code here:
https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/PDFToImage.java?revision=1829374&view=markup

What type of application are you creating? If you want to show a PDF in 
the browser, PDF.js works nicely, is free and included in firefox. If 
you want to do thumbnails, then you should use a smaller dpi value. In 
that case using subsampling would help even more.

Tilman



Am 17.04.2018 um 07:26 schrieb Tilman Hausherr:
> Hi,
>
> I have a Ryzen 1700 cpu and for tests I'm running it on max energy 
> settings. It is unclear if a mac has a similar setting.  This url 
> http://www.macos.utah.edu/documentation/administration/pmset.html 
> shows there is a setting for "better performance" but I don't know if 
> that does the same as on Windows where I get a performance doubling. 
> Try PDFDebugger, it has a built-in benchmark feature, it shows the 
> rendering speed in the status line.
>
> I'm also avoiding that one-time initializations are part of the 
> benchmark results with this code that is also in PDFDebugger:
>
>         // trigger premature initializations for more accurate 
> rendering benchmarks
>         // See discussion in PDFBOX-3988
>         if (PDType1Font.COURIER.isStandard14())
>         {
>             // Yes this is always true
>             PDDeviceCMYK.INSTANCE.toRGB(new float[] { 0, 0, 0, 0} );
>             PDDeviceRGB.INSTANCE.toRGB(new float[] { 0, 0, 0 } );
>             IIORegistry.getDefaultInstance();
>         }
>
> I see you're using the PDFToImage utility. That one doesn't support 
> subsampling yet, it has been on my "todo" list for a few days, I'll 
> try to do it tonight... But PDFToImage is really just a command line 
> utility.
>
> Args 7, 8 and 11 don't work that way. Re arg 7 and 8, you need to call 
> System.setProperty(). Re arg 11, you need to have a PDFRenderer object.
>
> Another way to convert to images is explained here:
> https://stackoverflow.com/questions/23326562/apache-pdfbox-convert-pdf-to-images 
>
>
> there call pdfRenderer.setSubsamplingAllowed(true) to activate 
> subsampling. PDFDebugger also supports it in the menu.
>
> Tilman
>
> Am 17.04.2018 um 01:20 schrieb Arthur Wang:
>> Tilman,
>>
>>
>> Thanks for the quick response and testing on my case. Below is my 
>> java code, my test result after adding the subsampling allowed. For 
>> the first page of ashley file, it cost 3362 milliseconds.
>>
>> For the Gill file, the time elapsed is 2456 milliseconds.
>>
>> My test were conducted on my MAC with processor: 2.2GHz, Core i7.  
>> how come your PC runs so fast? 1.4 seconds is fast enough for a web 
>> access. Maybe there is something wrong with my code? I would 
>> appreciate if you take a look at my code.
>>
>>
>> Best,
>>
>>
>> Arthur
>>
>>
>> *******************
>>
>> import org.apache.pdfbox.tools.PDFToImage;
>> //import java.awt.image.BufferedImage;
>> import java.io.File;
>> //import java.io.IOException;
>> //import java.io.OutputStream;
>> import org.apache.commons.lang3.time.StopWatch;
>>
>>
>> public class PdfToImage2 {
>>
>>      private static final String OUTPUT_DIR = "/Users/someone/Desktop/";
>>
>>      public static void main(String[] args) throws Exception{
>>
>>          String pdfPath = "/Users/someone/Desktop/Ashley 
>> NJ_HHL101125_FV.pdf";
>>          //config option 2:convert page 1 in pdf to image
>>          String [] args_1 =  new String[13];
>>          args_1[0] = "-startPage";
>>          args_1[1] = "1";
>>          args_1[2] = "-endPage";
>>          args_1[3] = "1";
>>          args_1[4] = "-outputPrefix";
>>          args_1[5] = OUTPUT_DIR+"Ashley NJ_HHL101125_FV1";
>>          args_1[6] = pdfPath;
>>          args_1[7] = 
>> "-Dorg.apache.pdfbox.rendering.UsePureJavaCMYKConversion";
>>          args_1[8] = "true";
>>          args_1[9] = "-dpi";
>>          args_1[10] = "72";//@48-->3283 milliseconds, @96>3545 
>> milliseconds, @72--> 3362milliseconds
>>          args_1[11] = "-PDFRenderer.setSubsamplingAllowed";
>>          args_1[12] = "true";
>>
>>          File f = new File(args_1[5]+"1.jpg");
>>          if(f.exists() && !f.isDirectory()) {
>>              System.out.println("file exist already");;
>>          }
>>          else{
>>
>>              StopWatch stopwatch = new StopWatch();
>>
>>              stopwatch.start();
>>
>>                try {
>>
>>                  System.setProperty("sun.java2d.cmm", 
>> "sun.java2d.cmm.kcms.KcmsServiceProvider");
>>                  PDFToImage.main(args_1);
>>                  System.out.println("Done!");
>>                } catch (Exception e) {
>>                    System.err.println("Exception while trying to 
>> create pdf document - " + e);
>>                }
>>
>>                   stopwatch.stop(); // optional
>>                  System.out.println("Time elapsed is "+ 
>> stopwatch.getTime() + " milliseconds");
>>
>>
>>          }//else
>>
>>          //first try without setting property: 3779 milliseconds
>>          //second try with the property set: 3852 milliseconds
>>          //third try with subsamplingAllowed: 3362 milliseconds
>>
>>      }
>>
>> *******************************
>>
>> ________________________________
>> From: Tilman Hausherr <TH...@t-online.de>
>> Sent: Monday, April 16, 2018 10:55 AM
>> To: users@pdfbox.apache.org
>> Subject: Re: Performance issue with PDFBox 2.0.8
>>
>> The java code didn't get through, most attachments get deleted. Call
>> PDFRenderer.setSubsamplingAllowed(true) to activate subsampling.
>>
>> I had a look at your files... These are not extremely slow renderings. 4
>> seconds for such a page is pretty good.
>>
>> On my PC, the first page of the Ashley file is rendered in PDFDebugger
>> in 1.4 seconds at 72dpi. The Gill file is done in less than a second.
>>
>> Tilman
>>
>> Am 16.04.2018 um 19:05 schrieb Arthur Wang:
>>> Arthur Wang has shared OneDrive files with you. To view them, click
>>> the links below.
>>>
>>> <https://1drv.ms/b/s%21AhA_REgBppCpgQluAoJe28B935ru>
>> [https://t5xdlg.by.files.1drv.com/y4miTL2BDX0qtz_xYqqct6mwo2l56s1alwlz1rLGpW_Mc5E7Ru9u3d-eAzERHVsoihIPc7xtNgaswIIeBlDh-hbm93zWQ6vL_PbzONFlHkd9shlCjeSfByIpMBE3EY161sZ77ggD87nTJRTRysSn4sLSrmwyenyMNNSuT6_EsqGw8Db-iHQ8Fr14T8lW0hdVFrxBDvGPI4J5G6IV2RH21FHiw?width=800&height=800&cropmode=none]<https://1drv.ms/b/s%21AhA_REgBppCpgQluAoJe28B935ru> 
>>
>>
>> [https://r1.res.office365.com/owa/prem/images/dc-pdf_40.png]<https://1drv.ms/b/s%21AhA_REgBppCpgQluAoJe28B935ru> 
>>
>> Ashley 
>> NJ_HHL101125_FV.pdf<https://1drv.ms/b/s%21AhA_REgBppCpgQluAoJe28B935ru>
>> Shared via OneDrive
>>
>>
>>
>>> Ashley NJ_HHL101125_FV.pdf
>>> <https://1drv.ms/b/s%21AhA_REgBppCpgQluAoJe28B935ru>
>>>
>>>
>>> <https://1drv.ms/b/s%21AhA_REgBppCpgQpdnBIl_hmK6Wt0>
>>>
>>> Gill1-1356_KM102685-INS_FV.pdf
>>> <https://1drv.ms/b/s%21AhA_REgBppCpgQpdnBIl_hmK6Wt0>
>>>
>>>
>>> <https://1drv.ms/u/s%21AhA_REgBppCpgQvygYjm2eaJQmSH>
>>>
>>> Screen Shot 2018-04-16 at 9.23.52 AM.png
>>> <https://1drv.ms/u/s%21AhA_REgBppCpgQvygYjm2eaJQmSH>
>>>         [Screen Shot 2018-04-16 at 9.23.52 AM.png]
>>>
>>> just tried on 2.0.9, it works almost the same. to process all 4 pages
>>> cost 32 seconds, if only process the first page, it cost about 4 
>>> seconds.
>>>
>>>
>>> My server is HP DL380 with dual Xeon processors and 32 G RAM, the hard
>>> drive is Intel Optane SSD NVMe.
>>>
>>> Once the JPG image is produced, the access of the image is almost
>>> instant regardless the size of the image file, so the open and close
>>> time of the image file are insignificant and could be ignored.
>>>
>>>
>>> By enable subsampling, do you mean to set up the dpi option ? do you
>>> have the sample code for PDFRenderer ? attached file
>>> ---PdfToImage2.java is my testing code. Ashley...pdf is a file with
>>> size about 45 M, and Gill...pdf is a file with size about 5 M. with
>>> the size 1/10th of the other one, the processing time is cut down to
>>> 2657 milliseconds compare to 3779 milliseconds. seems like the size
>>> does matter.
>>>
>>>
>>> thanks,
>>>
>>>
>>> Arthur
>>>
>>>
>>>
>>> ------------------------------------------------------------------------ 
>>>
>>> *From:* Tilman Hausherr <TH...@t-online.de>
>>> *Sent:* Monday, April 16, 2018 8:57 AM
>>> *To:* users@pdfbox.apache.org
>>> *Subject:* Re: Performance issue with PDFBox 2.0.8
>>> Please
>>> - retry with the current version 2.0.9
>>> - share your file for a profiler analysis
>>> - as said by Itai (who implemented it) try enabling subsampling in
>>> PDFRenderer (read the javadoc first). Compare the results and decide
>>> whether the quality is OK for you.
>>> - set the energy settings of your computer to maximum or at least to
>>> "balanced", not to "energy save"
>>> - don't know if adding GPU will help;
>>> - try also the
>>> -Dorg.apache.pdfbox.rendering.UsePureJavaCMYKConversion=true option
>>>
>>> The speed is not related to the size but to the complexity. 32 seconds
>>> may sound disappointing but it's not the worst I've ever seen. "Nice
>>> illustrations" with nested patterns or large shadings may be slow.
>>>
>>> Tilman
>>>
>>> Am 16.04.2018 um 09:21 schrieb Arthur Wang:
>>>> Hi, everyone,
>>>>
>>>>
>>>>
>>>> I am using PDFBox 2.0.8 and java 8 running in tomcat 8 for
>>> production to convert pdf into image for display. it works very well
>>> for pdf file size less than 5 M, it takes about 3800 milliseconds.
>>> however, it slows down very much when the file size is increased to 50
>>> M. it takes about 70,000 milliseconds, after setting system property
>>> of sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider", it does
>>> increase the performance to 32550 milliseconds, which almost double
>>> the speed. but for 32 seconds to load a web page still too slow. Is
>>> there any other way to speed up the performance? would adding a GPU
>>> into the server help the performance? or any other software or
>>> hardware solution could help on the processing speed? My current
>>> server come with 32 G RAM, and the server never used more than half 
>>> of it.
>>>>
>>>> thanks,
>>>>
>>>>
>>>> Arthur
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: Performance issue with PDFBox 2.0.8

Posted by Tilman Hausherr <TH...@t-online.de>.
Hi,

I have a Ryzen 1700 cpu and for tests I'm running it on max energy 
settings. It is unclear if a mac has a similar setting.  This url 
http://www.macos.utah.edu/documentation/administration/pmset.html shows 
there is a setting for "better performance" but I don't know if that 
does the same as on Windows where I get a performance doubling. Try 
PDFDebugger, it has a built-in benchmark feature, it shows the rendering 
speed in the status line.

I'm also avoiding that one-time initializations are part of the 
benchmark results with this code that is also in PDFDebugger:

         // trigger premature initializations for more accurate 
rendering benchmarks
         // See discussion in PDFBOX-3988
         if (PDType1Font.COURIER.isStandard14())
         {
             // Yes this is always true
             PDDeviceCMYK.INSTANCE.toRGB(new float[] { 0, 0, 0, 0} );
             PDDeviceRGB.INSTANCE.toRGB(new float[] { 0, 0, 0 } );
             IIORegistry.getDefaultInstance();
         }

I see you're using the PDFToImage utility. That one doesn't support 
subsampling yet, it has been on my "todo" list for a few days, I'll try 
to do it tonight... But PDFToImage is really just a command line utility.

Args 7, 8 and 11 don't work that way. Re arg 7 and 8, you need to call 
System.setProperty(). Re arg 11, you need to have a PDFRenderer object.

Another way to convert to images is explained here:
https://stackoverflow.com/questions/23326562/apache-pdfbox-convert-pdf-to-images

there call pdfRenderer.setSubsamplingAllowed(true) to activate 
subsampling. PDFDebugger also supports it in the menu.

Tilman

Am 17.04.2018 um 01:20 schrieb Arthur Wang:
> Tilman,
>
>
> Thanks for the quick response and testing on my case. Below is my java code, my test result after adding the subsampling allowed. For the first page of ashley file, it cost 3362 milliseconds.
>
> For the Gill file, the time elapsed is 2456 milliseconds.
>
> My test were conducted on my MAC with processor: 2.2GHz, Core i7.  how come your PC runs so fast? 1.4 seconds is fast enough for a web access. Maybe there is something wrong with my code? I would appreciate if you take a look at my code.
>
>
> Best,
>
>
> Arthur
>
>
> *******************
>
> import org.apache.pdfbox.tools.PDFToImage;
> //import java.awt.image.BufferedImage;
> import java.io.File;
> //import java.io.IOException;
> //import java.io.OutputStream;
> import org.apache.commons.lang3.time.StopWatch;
>
>
> public class PdfToImage2 {
>
>      private static final String OUTPUT_DIR = "/Users/someone/Desktop/";
>
>      public static void main(String[] args) throws Exception{
>
>          String pdfPath = "/Users/someone/Desktop/Ashley NJ_HHL101125_FV.pdf";
>          //config option 2:convert page 1 in pdf to image
>          String [] args_1 =  new String[13];
>          args_1[0] = "-startPage";
>          args_1[1] = "1";
>          args_1[2] = "-endPage";
>          args_1[3] = "1";
>          args_1[4] = "-outputPrefix";
>          args_1[5] = OUTPUT_DIR+"Ashley NJ_HHL101125_FV1";
>          args_1[6] = pdfPath;
>          args_1[7] = "-Dorg.apache.pdfbox.rendering.UsePureJavaCMYKConversion";
>          args_1[8] = "true";
>          args_1[9] = "-dpi";
>          args_1[10] = "72";//@48-->3283 milliseconds, @96>3545 milliseconds, @72--> 3362milliseconds
>          args_1[11] = "-PDFRenderer.setSubsamplingAllowed";
>          args_1[12] = "true";
>
>          File f = new File(args_1[5]+"1.jpg");
>          if(f.exists() && !f.isDirectory()) {
>              System.out.println("file exist already");;
>          }
>          else{
>
>              StopWatch stopwatch = new StopWatch();
>
>              stopwatch.start();
>
>                try {
>
>                  System.setProperty("sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider");
>                  PDFToImage.main(args_1);
>                  System.out.println("Done!");
>                } catch (Exception e) {
>                    System.err.println("Exception while trying to create pdf document - " + e);
>                }
>
>                   stopwatch.stop(); // optional
>                  System.out.println("Time elapsed is "+ stopwatch.getTime() + " milliseconds");
>
>
>          }//else
>
>          //first try without setting property: 3779 milliseconds
>          //second try with the property set: 3852 milliseconds
>          //third try with subsamplingAllowed: 3362 milliseconds
>
>      }
>
> *******************************
>
> ________________________________
> From: Tilman Hausherr <TH...@t-online.de>
> Sent: Monday, April 16, 2018 10:55 AM
> To: users@pdfbox.apache.org
> Subject: Re: Performance issue with PDFBox 2.0.8
>
> The java code didn't get through, most attachments get deleted. Call
> PDFRenderer.setSubsamplingAllowed(true) to activate subsampling.
>
> I had a look at your files... These are not extremely slow renderings. 4
> seconds for such a page is pretty good.
>
> On my PC, the first page of the Ashley file is rendered in PDFDebugger
> in 1.4 seconds at 72dpi. The Gill file is done in less than a second.
>
> Tilman
>
> Am 16.04.2018 um 19:05 schrieb Arthur Wang:
>> Arthur Wang has shared OneDrive files with you. To view them, click
>> the links below.
>>
>> <https://1drv.ms/b/s%21AhA_REgBppCpgQluAoJe28B935ru>
> [https://t5xdlg.by.files.1drv.com/y4miTL2BDX0qtz_xYqqct6mwo2l56s1alwlz1rLGpW_Mc5E7Ru9u3d-eAzERHVsoihIPc7xtNgaswIIeBlDh-hbm93zWQ6vL_PbzONFlHkd9shlCjeSfByIpMBE3EY161sZ77ggD87nTJRTRysSn4sLSrmwyenyMNNSuT6_EsqGw8Db-iHQ8Fr14T8lW0hdVFrxBDvGPI4J5G6IV2RH21FHiw?width=800&height=800&cropmode=none]<https://1drv.ms/b/s%21AhA_REgBppCpgQluAoJe28B935ru>
>
> [https://r1.res.office365.com/owa/prem/images/dc-pdf_40.png]<https://1drv.ms/b/s%21AhA_REgBppCpgQluAoJe28B935ru>
> Ashley NJ_HHL101125_FV.pdf<https://1drv.ms/b/s%21AhA_REgBppCpgQluAoJe28B935ru>
> Shared via OneDrive
>
>
>
>> Ashley NJ_HHL101125_FV.pdf
>> <https://1drv.ms/b/s%21AhA_REgBppCpgQluAoJe28B935ru>
>>
>>
>> <https://1drv.ms/b/s%21AhA_REgBppCpgQpdnBIl_hmK6Wt0>
>>
>> Gill1-1356_KM102685-INS_FV.pdf
>> <https://1drv.ms/b/s%21AhA_REgBppCpgQpdnBIl_hmK6Wt0>
>>
>>
>> <https://1drv.ms/u/s%21AhA_REgBppCpgQvygYjm2eaJQmSH>
>>
>> Screen Shot 2018-04-16 at 9.23.52 AM.png
>> <https://1drv.ms/u/s%21AhA_REgBppCpgQvygYjm2eaJQmSH>
>>         [Screen Shot 2018-04-16 at 9.23.52 AM.png]
>>
>> just tried on 2.0.9, it works almost the same. to process all 4 pages
>> cost 32 seconds, if only process the first page, it cost about 4 seconds.
>>
>>
>> My server is HP DL380 with dual Xeon processors and 32 G RAM, the hard
>> drive is Intel Optane SSD NVMe.
>>
>> Once the JPG image is produced, the access of the image is almost
>> instant regardless the size of the image file, so the open and close
>> time of the image file are insignificant and could be ignored.
>>
>>
>> By enable subsampling, do you mean to set up the dpi option ? do you
>> have the sample code for PDFRenderer ? attached file
>> ---PdfToImage2.java is my testing code. Ashley...pdf is a file with
>> size about 45 M, and Gill...pdf is a file with size about 5 M. with
>> the size 1/10th of the other one, the processing time is cut down to
>> 2657 milliseconds compare to 3779 milliseconds. seems like the size
>> does matter.
>>
>>
>> thanks,
>>
>>
>> Arthur
>>
>>
>>
>> ------------------------------------------------------------------------
>> *From:* Tilman Hausherr <TH...@t-online.de>
>> *Sent:* Monday, April 16, 2018 8:57 AM
>> *To:* users@pdfbox.apache.org
>> *Subject:* Re: Performance issue with PDFBox 2.0.8
>> Please
>> - retry with the current version 2.0.9
>> - share your file for a profiler analysis
>> - as said by Itai (who implemented it) try enabling subsampling in
>> PDFRenderer (read the javadoc first). Compare the results and decide
>> whether the quality is OK for you.
>> - set the energy settings of your computer to maximum or at least to
>> "balanced", not to "energy save"
>> - don't know if adding GPU will help;
>> - try also the
>> -Dorg.apache.pdfbox.rendering.UsePureJavaCMYKConversion=true option
>>
>> The speed is not related to the size but to the complexity. 32 seconds
>> may sound disappointing but it's not the worst I've ever seen. "Nice
>> illustrations" with nested patterns or large shadings may be slow.
>>
>> Tilman
>>
>> Am 16.04.2018 um 09:21 schrieb Arthur Wang:
>>> Hi, everyone,
>>>
>>>
>>>
>>> I am using PDFBox 2.0.8 and java 8 running in tomcat 8 for
>> production to convert pdf into image for display. it works very well
>> for pdf file size less than 5 M, it takes about 3800 milliseconds.
>> however, it slows down very much when the file size is increased to 50
>> M. it takes about 70,000 milliseconds, after setting system property
>> of sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider", it does
>> increase the performance to 32550 milliseconds, which almost double
>> the speed. but for 32 seconds to load a web page still too slow. Is
>> there any other way to speed up the performance? would adding a GPU
>> into the server help the performance? or any other software or
>> hardware solution could help on the processing speed? My current
>> server come with 32 G RAM, and the server never used more than half of it.
>>>
>>> thanks,
>>>
>>>
>>> Arthur
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: Performance issue with PDFBox 2.0.8

Posted by Arthur Wang <ar...@hotmail.com>.
Tilman,


Thanks for the quick response and testing on my case. Below is my java code, my test result after adding the subsampling allowed. For the first page of ashley file, it cost 3362 milliseconds.

For the Gill file, the time elapsed is 2456 milliseconds.

My test were conducted on my MAC with processor: 2.2GHz, Core i7.  how come your PC runs so fast? 1.4 seconds is fast enough for a web access. Maybe there is something wrong with my code? I would appreciate if you take a look at my code.


Best,


Arthur


*******************

import org.apache.pdfbox.tools.PDFToImage;
//import java.awt.image.BufferedImage;
import java.io.File;
//import java.io.IOException;
//import java.io.OutputStream;
import org.apache.commons.lang3.time.StopWatch;


public class PdfToImage2 {

    private static final String OUTPUT_DIR = "/Users/someone/Desktop/";

    public static void main(String[] args) throws Exception{

        String pdfPath = "/Users/someone/Desktop/Ashley NJ_HHL101125_FV.pdf";
        //config option 2:convert page 1 in pdf to image
        String [] args_1 =  new String[13];
        args_1[0] = "-startPage";
        args_1[1] = "1";
        args_1[2] = "-endPage";
        args_1[3] = "1";
        args_1[4] = "-outputPrefix";
        args_1[5] = OUTPUT_DIR+"Ashley NJ_HHL101125_FV1";
        args_1[6] = pdfPath;
        args_1[7] = "-Dorg.apache.pdfbox.rendering.UsePureJavaCMYKConversion";
        args_1[8] = "true";
        args_1[9] = "-dpi";
        args_1[10] = "72";//@48-->3283 milliseconds, @96>3545 milliseconds, @72--> 3362milliseconds
        args_1[11] = "-PDFRenderer.setSubsamplingAllowed";
        args_1[12] = "true";

        File f = new File(args_1[5]+"1.jpg");
        if(f.exists() && !f.isDirectory()) {
            System.out.println("file exist already");;
        }
        else{

            StopWatch stopwatch = new StopWatch();

            stopwatch.start();

              try {

                System.setProperty("sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider");
                PDFToImage.main(args_1);
                System.out.println("Done!");
              } catch (Exception e) {
                  System.err.println("Exception while trying to create pdf document - " + e);
              }

                 stopwatch.stop(); // optional
                System.out.println("Time elapsed is "+ stopwatch.getTime() + " milliseconds");


        }//else

        //first try without setting property: 3779 milliseconds
        //second try with the property set: 3852 milliseconds
        //third try with subsamplingAllowed: 3362 milliseconds

    }

*******************************

________________________________
From: Tilman Hausherr <TH...@t-online.de>
Sent: Monday, April 16, 2018 10:55 AM
To: users@pdfbox.apache.org
Subject: Re: Performance issue with PDFBox 2.0.8

The java code didn't get through, most attachments get deleted. Call
PDFRenderer.setSubsamplingAllowed(true) to activate subsampling.

I had a look at your files... These are not extremely slow renderings. 4
seconds for such a page is pretty good.

On my PC, the first page of the Ashley file is rendered in PDFDebugger
in 1.4 seconds at 72dpi. The Gill file is done in less than a second.

Tilman

Am 16.04.2018 um 19:05 schrieb Arthur Wang:
> Arthur Wang has shared OneDrive files with you. To view them, click
> the links below.
>
> <https://1drv.ms/b/s%21AhA_REgBppCpgQluAoJe28B935ru>
[https://t5xdlg.by.files.1drv.com/y4miTL2BDX0qtz_xYqqct6mwo2l56s1alwlz1rLGpW_Mc5E7Ru9u3d-eAzERHVsoihIPc7xtNgaswIIeBlDh-hbm93zWQ6vL_PbzONFlHkd9shlCjeSfByIpMBE3EY161sZ77ggD87nTJRTRysSn4sLSrmwyenyMNNSuT6_EsqGw8Db-iHQ8Fr14T8lW0hdVFrxBDvGPI4J5G6IV2RH21FHiw?width=800&height=800&cropmode=none]<https://1drv.ms/b/s%21AhA_REgBppCpgQluAoJe28B935ru>

[https://r1.res.office365.com/owa/prem/images/dc-pdf_40.png]<https://1drv.ms/b/s%21AhA_REgBppCpgQluAoJe28B935ru>
Ashley NJ_HHL101125_FV.pdf<https://1drv.ms/b/s%21AhA_REgBppCpgQluAoJe28B935ru>
Shared via OneDrive



>
> Ashley NJ_HHL101125_FV.pdf
> <https://1drv.ms/b/s%21AhA_REgBppCpgQluAoJe28B935ru>
>
>
> <https://1drv.ms/b/s%21AhA_REgBppCpgQpdnBIl_hmK6Wt0>
>
> Gill1-1356_KM102685-INS_FV.pdf
> <https://1drv.ms/b/s%21AhA_REgBppCpgQpdnBIl_hmK6Wt0>
>
>
> <https://1drv.ms/u/s%21AhA_REgBppCpgQvygYjm2eaJQmSH>
>
> Screen Shot 2018-04-16 at 9.23.52 AM.png
> <https://1drv.ms/u/s%21AhA_REgBppCpgQvygYjm2eaJQmSH>
>        [Screen Shot 2018-04-16 at 9.23.52 AM.png]
>
> just tried on 2.0.9, it works almost the same. to process all 4 pages
> cost 32 seconds, if only process the first page, it cost about 4 seconds.
>
>
> My server is HP DL380 with dual Xeon processors and 32 G RAM, the hard
> drive is Intel Optane SSD NVMe.
>
> Once the JPG image is produced, the access of the image is almost
> instant regardless the size of the image file, so the open and close
> time of the image file are insignificant and could be ignored.
>
>
> By enable subsampling, do you mean to set up the dpi option ? do you
> have the sample code for PDFRenderer ? attached file
> ---PdfToImage2.java is my testing code. Ashley...pdf is a file with
> size about 45 M, and Gill...pdf is a file with size about 5 M. with
> the size 1/10th of the other one, the processing time is cut down to
> 2657 milliseconds compare to 3779 milliseconds. seems like the size
> does matter.
>
>
> thanks,
>
>
> Arthur
>
>
>
> ------------------------------------------------------------------------
> *From:* Tilman Hausherr <TH...@t-online.de>
> *Sent:* Monday, April 16, 2018 8:57 AM
> *To:* users@pdfbox.apache.org
> *Subject:* Re: Performance issue with PDFBox 2.0.8
> Please
> - retry with the current version 2.0.9
> - share your file for a profiler analysis
> - as said by Itai (who implemented it) try enabling subsampling in
> PDFRenderer (read the javadoc first). Compare the results and decide
> whether the quality is OK for you.
> - set the energy settings of your computer to maximum or at least to
> "balanced", not to "energy save"
> - don't know if adding GPU will help;
> - try also the
> -Dorg.apache.pdfbox.rendering.UsePureJavaCMYKConversion=true option
>
> The speed is not related to the size but to the complexity. 32 seconds
> may sound disappointing but it's not the worst I've ever seen. "Nice
> illustrations" with nested patterns or large shadings may be slow.
>
> Tilman
>
> Am 16.04.2018 um 09:21 schrieb Arthur Wang:
> > Hi, everyone,
> >
> >
> >
> > I am using PDFBox 2.0.8 and java 8 running in tomcat 8 for
> production to convert pdf into image for display. it works very well
> for pdf file size less than 5 M, it takes about 3800 milliseconds.
> however, it slows down very much when the file size is increased to 50
> M. it takes about 70,000 milliseconds, after setting system property
> of sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider", it does
> increase the performance to 32550 milliseconds, which almost double
> the speed. but for 32 seconds to load a web page still too slow. Is
> there any other way to speed up the performance? would adding a GPU
> into the server help the performance? or any other software or
> hardware solution could help on the processing speed? My current
> server come with 32 G RAM, and the server never used more than half of it.
> >
> >
> > thanks,
> >
> >
> > Arthur
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org



Re: Performance issue with PDFBox 2.0.8

Posted by Tilman Hausherr <TH...@t-online.de>.
The java code didn't get through, most attachments get deleted. Call 
PDFRenderer.setSubsamplingAllowed(true) to activate subsampling.

I had a look at your files... These are not extremely slow renderings. 4 
seconds for such a page is pretty good.

On my PC, the first page of the Ashley file is rendered in PDFDebugger 
in 1.4 seconds at 72dpi. The Gill file is done in less than a second.

Tilman

Am 16.04.2018 um 19:05 schrieb Arthur Wang:
> Arthur Wang has shared OneDrive files with you. To view them, click 
> the links below.
>
> <https://1drv.ms/b/s%21AhA_REgBppCpgQluAoJe28B935ru>
> 	
> Ashley NJ_HHL101125_FV.pdf 
> <https://1drv.ms/b/s%21AhA_REgBppCpgQluAoJe28B935ru>
> 	
>
> <https://1drv.ms/b/s%21AhA_REgBppCpgQpdnBIl_hmK6Wt0>
> 	
> Gill1-1356_KM102685-INS_FV.pdf 
> <https://1drv.ms/b/s%21AhA_REgBppCpgQpdnBIl_hmK6Wt0>
> 	
>
> <https://1drv.ms/u/s%21AhA_REgBppCpgQvygYjm2eaJQmSH>
> 	
> Screen Shot 2018-04-16 at 9.23.52 AM.png 
> <https://1drv.ms/u/s%21AhA_REgBppCpgQvygYjm2eaJQmSH>
> 	[Screen Shot 2018-04-16 at 9.23.52 AM.png]
>
> just tried on 2.0.9, it works almost the same. to process all 4 pages 
> cost 32 seconds, if only process the first page, it cost about 4 seconds.
>
>
> My server is HP DL380 with dual Xeon processors and 32 G RAM, the hard 
> drive is Intel Optane SSD NVMe.
>
> Once the JPG image is produced, the access of the image is almost 
> instant regardless the size of the image file, so the open and close 
> time of the image file are insignificant and could be ignored.
>
>
> By enable subsampling, do you mean to set up the dpi option ? do you 
> have the sample code for PDFRenderer ? attached file 
> ---PdfToImage2.java is my testing code. Ashley...pdf is a file with 
> size about 45 M, and Gill...pdf is a file with size about 5 M. with 
> the size 1/10th of the other one, the processing time is cut down to 
> 2657 milliseconds compare to 3779 milliseconds. seems like the size 
> does matter.
>
>
> thanks,
>
>
> Arthur
>
>
>
> ------------------------------------------------------------------------
> *From:* Tilman Hausherr <TH...@t-online.de>
> *Sent:* Monday, April 16, 2018 8:57 AM
> *To:* users@pdfbox.apache.org
> *Subject:* Re: Performance issue with PDFBox 2.0.8
> Please
> - retry with the current version 2.0.9
> - share your file for a profiler analysis
> - as said by Itai (who implemented it) try enabling subsampling in
> PDFRenderer (read the javadoc first). Compare the results and decide
> whether the quality is OK for you.
> - set the energy settings of your computer to maximum or at least to
> "balanced", not to "energy save"
> - don't know if adding GPU will help;
> - try also the
> -Dorg.apache.pdfbox.rendering.UsePureJavaCMYKConversion=true option
>
> The speed is not related to the size but to the complexity. 32 seconds
> may sound disappointing but it's not the worst I've ever seen. "Nice
> illustrations" with nested patterns or large shadings may be slow.
>
> Tilman
>
> Am 16.04.2018 um 09:21 schrieb Arthur Wang:
> > Hi, everyone,
> >
> >
> >
> > I am using PDFBox 2.0.8 and java 8 running in tomcat 8 for 
> production to convert pdf into image for display. it works very well 
> for pdf file size less than 5 M, it takes about 3800 milliseconds. 
> however, it slows down very much when the file size is increased to 50 
> M. it takes about 70,000 milliseconds, after setting system property 
> of sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider", it does 
> increase the performance to 32550 milliseconds, which almost double 
> the speed. but for 32 seconds to load a web page still too slow. Is 
> there any other way to speed up the performance? would adding a GPU 
> into the server help the performance? or any other software or 
> hardware solution could help on the processing speed? My current 
> server come with 32 G RAM, and the server never used more than half of it.
> >
> >
> > thanks,
> >
> >
> > Arthur
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org



Re: Performance issue with PDFBox 2.0.8

Posted by Arthur Wang <ar...@hotmail.com>.
Arthur Wang has shared OneDrive files with you. To view them, click the links below.


<https://1drv.ms/b/s!AhA_REgBppCpgQluAoJe28B935ru>
[https://r1.res.office365.com/owa/prem/images/dc-pdf_20.png]<https://1drv.ms/b/s!AhA_REgBppCpgQluAoJe28B935ru>

Ashley NJ_HHL101125_FV.pdf<https://1drv.ms/b/s!AhA_REgBppCpgQluAoJe28B935ru>

<https://1drv.ms/b/s!AhA_REgBppCpgQpdnBIl_hmK6Wt0>
[https://r1.res.office365.com/owa/prem/images/dc-pdf_20.png]<https://1drv.ms/b/s!AhA_REgBppCpgQpdnBIl_hmK6Wt0>

Gill1-1356_KM102685-INS_FV.pdf<https://1drv.ms/b/s!AhA_REgBppCpgQpdnBIl_hmK6Wt0>

<https://1drv.ms/u/s!AhA_REgBppCpgQvygYjm2eaJQmSH>
[https://r1.res.office365.com/owa/prem/images/dc-png_20.png]<https://1drv.ms/u/s!AhA_REgBppCpgQvygYjm2eaJQmSH>

Screen Shot 2018-04-16 at 9.23.52 AM.png<https://1drv.ms/u/s!AhA_REgBppCpgQvygYjm2eaJQmSH>




just tried on 2.0.9, it works almost the same. to process all 4 pages cost 32 seconds, if only process the first page, it cost about 4 seconds.


My server is HP DL380 with dual Xeon processors and 32 G RAM, the hard drive is Intel Optane SSD NVMe.

Once the JPG image is produced, the access of the image is almost instant regardless the size of the image file, so the open and close time of the image file are insignificant and could be ignored.


By enable subsampling, do you mean to set up the dpi option ? do you have the sample code for PDFRenderer ? attached file ---PdfToImage2.java is my testing code. Ashley...pdf is a file with size about 45 M, and Gill...pdf is a file with size about 5 M. with the size 1/10th of the other one, the processing time is cut down to 2657 milliseconds compare to 3779 milliseconds. seems like the size does matter.


thanks,


Arthur


________________________________
From: Tilman Hausherr <TH...@t-online.de>
Sent: Monday, April 16, 2018 8:57 AM
To: users@pdfbox.apache.org
Subject: Re: Performance issue with PDFBox 2.0.8

Please
- retry with the current version 2.0.9
- share your file for a profiler analysis
- as said by Itai (who implemented it) try enabling subsampling in
PDFRenderer (read the javadoc first). Compare the results and decide
whether the quality is OK for you.
- set the energy settings of your computer to maximum or at least to
"balanced", not to "energy save"
- don't know if adding GPU will help;
- try also the
-Dorg.apache.pdfbox.rendering.UsePureJavaCMYKConversion=true option

The speed is not related to the size but to the complexity. 32 seconds
may sound disappointing but it's not the worst I've ever seen. "Nice
illustrations" with nested patterns or large shadings may be slow.

Tilman

Am 16.04.2018 um 09:21 schrieb Arthur Wang:
> Hi, everyone,
>
>
>
> I am using PDFBox 2.0.8 and java 8 running in tomcat 8 for production to convert pdf into image for display. it works very well for pdf file size less than 5 M, it takes about 3800 milliseconds. however, it slows down very much when the file size is increased to 50 M. it takes about 70,000 milliseconds, after setting system property of sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider", it does increase the performance to 32550 milliseconds, which almost double the speed. but for 32 seconds to load a web page still too slow. Is there any other way to speed up the performance? would adding a GPU into the server help the performance? or any other software or hardware solution could help on the processing speed? My current server come with 32 G RAM, and the server never used more than half of it.
>
>
> thanks,
>
>
> Arthur
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: Performance issue with PDFBox 2.0.8

Posted by Tilman Hausherr <TH...@t-online.de>.
Please
- retry with the current version 2.0.9
- share your file for a profiler analysis
- as said by Itai (who implemented it) try enabling subsampling in 
PDFRenderer (read the javadoc first). Compare the results and decide 
whether the quality is OK for you.
- set the energy settings of your computer to maximum or at least to 
"balanced", not to "energy save"
- don't know if adding GPU will help;
- try also the 
-Dorg.apache.pdfbox.rendering.UsePureJavaCMYKConversion=true option

The speed is not related to the size but to the complexity. 32 seconds 
may sound disappointing but it's not the worst I've ever seen. "Nice 
illustrations" with nested patterns or large shadings may be slow.

Tilman

Am 16.04.2018 um 09:21 schrieb Arthur Wang:
> Hi, everyone,
>
>
>
> I am using PDFBox 2.0.8 and java 8 running in tomcat 8 for production to convert pdf into image for display. it works very well for pdf file size less than 5 M, it takes about 3800 milliseconds. however, it slows down very much when the file size is increased to 50 M. it takes about 70,000 milliseconds, after setting system property of sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider", it does increase the performance to 32550 milliseconds, which almost double the speed. but for 32 seconds to load a web page still too slow. Is there any other way to speed up the performance? would adding a GPU into the server help the performance? or any other software or hardware solution could help on the processing speed? My current server come with 32 G RAM, and the server never used more than half of it.
>
>
> thanks,
>
>
> Arthur
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: Performance issue with PDFBox 2.0.8

Posted by Itai <it...@gmail.com>.
If the files in question include very large JPEG or JBIG images, and if you
are rendering at a relatively low DPI, allowing subsampling may help (
https://issues.apache.org/jira/browse/PDFBOX-4137), but you will have to
upgrade to 2.0.9.
In my use-case it increased speed by an order of magnitude, but it was an
extreme case (files consisting of one huge 50,000x5,000 JPEGs and nothing
else).

On Mon, Apr 16, 2018 at 11:25 AM, Maruan Sahyoun <sa...@fileaffairs.de>
wrote:

> Hi,
>
> > Am 16.04.2018 um 09:21 schrieb Arthur Wang <ar...@hotmail.com>:
> >
> > Hi, everyone,
> >
> >
> >
> > I am using PDFBox 2.0.8 and java 8 running in tomcat 8 for production to
> convert pdf into image for display. it works very well for pdf file size
> less than 5 M, it takes about 3800 milliseconds. however, it slows down
> very much when the file size is increased to 50 M. it takes about 70,000
> milliseconds, after setting system property of sun.java2d.cmm",
> "sun.java2d.cmm.kcms.KcmsServiceProvider", it does increase the
> performance to 32550 milliseconds, which almost double the speed. but for
> 32 seconds to load a web page still too slow. Is there any other way to
> speed up the performance? would adding a GPU into the server help the
> performance? or any other software or hardware solution could help on the
> processing speed? My current server come with 32 G RAM, and the server
> never used more than half of it.
>
> what is the time for opening and closing such document alone without
> rendering? Are there multiple pages? The difference in size might be that
> the larger one uses a lot of images. Would you have a sample PDF?
>
> >
> >
> > thanks,
> >
> >
> > Arthur
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Re: Performance issue with PDFBox 2.0.8

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
Hi,

> Am 16.04.2018 um 09:21 schrieb Arthur Wang <ar...@hotmail.com>:
> 
> Hi, everyone,
> 
> 
> 
> I am using PDFBox 2.0.8 and java 8 running in tomcat 8 for production to convert pdf into image for display. it works very well for pdf file size less than 5 M, it takes about 3800 milliseconds. however, it slows down very much when the file size is increased to 50 M. it takes about 70,000 milliseconds, after setting system property of sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider", it does increase the performance to 32550 milliseconds, which almost double the speed. but for 32 seconds to load a web page still too slow. Is there any other way to speed up the performance? would adding a GPU into the server help the performance? or any other software or hardware solution could help on the processing speed? My current server come with 32 G RAM, and the server never used more than half of it.

what is the time for opening and closing such document alone without rendering? Are there multiple pages? The difference in size might be that the larger one uses a lot of images. Would you have a sample PDF?

> 
> 
> thanks,
> 
> 
> Arthur


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org