You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by Manfred Pock <po...@gmail.com> on 2015/07/14 11:26:28 UTC

Performance of the trunkversion

Hi,

we use the Pdfbox-trunkversion to render pdf's, currently we use the 
version from 12. May 2015.

Today i have done an update to the current version and have test it. It 
seems to be that it need now much more time to render pdf's, it depends 
of the size of the pdf.

for example you can try this one:

http://cloud.directupload.net/15bu

It need five times more then the version from May 2015.

regarts, Manfred

Re: Performance of the trunkversion

Posted by Andreas Lehmkühler <an...@lehmi.de>.

> Manfred Pock <po...@gmail.com> hat am 14. Juli 2015 um 12:15
> geschrieben:
> 
> 
> Yes, the input is a inputstream. I can try it direct from file.
> 
> But in general we get the pdf from an document management system as stream.
> Does make sense that i save the pdf to file before?
If possible, yes. As I already said, we need random access to the pdf and
InputStream doesn't support seek operations so that we have to copy the whole
stream to a file or to memory.

> Why is there so an big performance difference beetween the version from 
> May and the current version, if we use it with useScratchFiles = true ?
I'm not sure, but the reason seems to be the altered scratchfile handling. I've
to double check that.

BR
Andreas

> regarts, Manfred
> 
> Am 14.07.2015 um 12:02 schrieb Andreas Lehmkühler:
> > Hi,
> >
> >> Manfred Pock <po...@gmail.com> hat am 14. Juli 2015 um 11:39
> >> geschrieben:
> >>
> >>
> >> Ok, we load the pdf with useScratchFiles = true, if we load them with
> >> false the performance is better, but a little bit slower than the old one.
> > What do you use as input, a stream or a real file? If the latter you should
> > use
> > the load method with the file parameter.
> >
> > PDFBox needs ramdom access to the pdf and if a stream is provided PDFBox
> > copies
> > the data to a file (lower memory usage, slower performance) or to the memory
> > (higher memory usage, better performance).
> >
> > BR
> > Andreas
> >
> >
> >> But now it need more memory. I cannot load some pdfs with the current
> >> version with the same java-memory configuration.
> >>
> >> Am 14.07.2015 um 11:26 schrieb Manfred Pock:
> >>> Hi,
> >>>
> >>> we use the Pdfbox-trunkversion to render pdf's, currently we use the
> >>> version from 12. May 2015.
> >>>
> >>> Today i have done an update to the current version and have test it.
> >>> It seems to be that it need now much more time to render pdf's, it
> >>> depends of the size of the pdf.
> >>>
> >>> for example you can try this one:
> >>>
> >>> http://cloud.directupload.net/15bu
> >>>
> >>> It need five times more then the version from May 2015.
> >>>
> >>> regarts, Manfred
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> > For additional commands, e-mail: dev-help@pdfbox.apache.org
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org

Re: Performance of the trunkversion

Posted by Timo Boehme <ti...@ontochem.com>.

Hi Manfred,

yes, I also encountered an error while testing another file now. I will 
check the implementation.

Best,
Timo


Am 15.07.2015 um 09:51 schrieb Manfred Pock:
> Hi Timo,
>
> i have tried it put it doesn't work now and i get different exceptions
> or Errors
>
> i looks like that there is a problem with any kind of images, the rest
> will be shown.
>
> for example:
>
> SCHWERWIEGEND: TIFFFaxDecoder: Invalid code encountered while decoding
> 2D group 4 compressed data.
> java.io.IOException: TIFFFaxDecoder: Invalid code encountered while
> decoding 2D group 4 compressed data.
>      at
> org.apache.pdfbox.filter.ccitt.TIFFFaxDecoder.decodeT6(TIFFFaxDecoder.java:1125)
>
>      at
> org.apache.pdfbox.filter.CCITTFaxFilter.decode(CCITTFaxFilter.java:94)
>      at org.apache.pdfbox.cos.COSStream.attemptDecode(COSStream.java:422)
>      at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:398)
>      at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:335)
>      at org.apache.pdfbox.cos.COSStream.getDecodeResult(COSStream.java:278)
>      at
> org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.<init>(PDImageXObject.java:120)
>
>      at
> org.apache.pdfbox.pdmodel.graphics.PDXObject.createXObject(PDXObject.java:67)
>
>      at
> org.apache.pdfbox.pdmodel.PDResources.getXObject(PDResources.java:340)
>
> SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a
> DataFormatException
> Jul 15, 2015 9:45:05 AM org.apache.pdfbox.contentstream.PDFStreamEngine
> operatorException
> WARNUNG: java.util.zip.DataFormatException: invalid block type
>
> Jul 15, 2015 9:46:18 AM org.apache.pdfbox.contentstream.PDFStreamEngine
> operatorException
> WARNUNG: Not a JPEG file: starts with 0xe0 0x00
>
> ul 15, 2015 9:46:23 AM org.apache.pdfbox.contentstream.PDFStreamEngine
> operatorException
> WARNUNG: Image stream was not read - filter: DCTDecode
>
> SCHWERWIEGEND: java.util.zip.DataFormatException: invalid distance too
> far back
> java.io.IOException: java.util.zip.DataFormatException: invalid distance
> too far back
>      at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:83)
>      at org.apache.pdfbox.cos.COSStream.attemptDecode(COSStream.java:422)
>      at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:398)
>      at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:335)
>      at
> org.apache.pdfbox.cos.COSStream.checkUnfilteredBuffer(COSStream.java:265)
>      at
> org.apache.pdfbox.cos.COSStream.getUnfilteredRandomAccess(COSStream.java:239)
>
>      at org.apache.pdfbox.pdfparser.BaseParser.<init>(BaseParser.java:146)
>      at
> org.apache.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:78)
>      at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:451)
>
>      at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:438)
>
>      at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149)
>
>      at
> org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:180)
>      at
> org.apache.pdfbox.rendering.PDFRenderer.renderPage(PDFRenderer.java:205)
>      at
> org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:136)
>      at
> org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:95)
>
> .... Caused by: java.util.zip.DataFormatException: invalid distance too
> far back
>      at java.util.zip.Inflater.inflateBytes(Native Method)
>      at java.util.zip.Inflater.inflate(Inflater.java:259)
>      at java.util.zip.Inflater.inflate(Inflater.java:280)
>      at
> org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:101)
>      at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:74)
>
> Am 15.07.2015 um 00:35 schrieb Timo Boehme:
>> I've created PDFBOX-2882 with a drop-in replacement of the scratch
>> file implementation.
>> @Manfred: Could you please test if this helps in your scenario to
>> increase performance?
>>
>> Best,
>> Timo
>>
>>
>> Am 14.07.2015 um 13:47 schrieb Timo Boehme:
>>> Hi,
>>>
>>> instead of having a linked page list in ScratchFileBuffer I would
>>> propose having a list of pages with the page numbers (integer) kept in
>>> memory (takes 1k for 1MB data). This would ease page handling, seeking
>>> does not need I/O-operations and caching of pages would be a lot easier.
>>> I may find some time later to come up with such a replacement.
>>>
>>> Best,
>>> Timo
>>>
>>>
>>> Am 14.07.2015 um 13:02 schrieb Timo Boehme:
>>>> Hi,
>>>>
>>>> as I see it (had only a quick look at the implementation) the
>>>> ScratchFileBuffer implementation is not optimal for fast random access.
>>>> Single writes of bytes are not buffered but directly written to the
>>>> file
>>>> - a lot of I/O-operations) and seek operations have to travel the
>>>> linked
>>>> page list reading some bytes of each page - again a lot of seek and
>>>> read
>>>> I/O-operations.
>>>> To speed things up it is crucial to minimize the number of
>>>> I/O-operations directly going to the random access file. Therefore
>>>> it is
>>>> needed to buffer writes, keep last read page in memory for sequential
>>>> reads and have an in-memory cache of page meta data (offset, link to
>>>> previous/next page).
>>>>
>>>>
>>>> Best,
>>>> Timo
>>>>
>>>>
>>>> Am 14.07.2015 um 12:15 schrieb Manfred Pock:
>>>>> Yes, the input is a inputstream. I can try it direct from file.
>>>>>
>>>>> But in general we get the pdf from an document management system as
>>>>> stream.
>>>>> Does make sense that i save the pdf to file before?
>>>>>
>>>>> Why is there so an big performance difference beetween the version
>>>>> from
>>>>> May and the current version, if we use it with useScratchFiles =
>>>>> true ?
>>>>>
>>>>> regarts, Manfred
>>>>>
>>>>> Am 14.07.2015 um 12:02 schrieb Andreas Lehmkühler:
>>>>>> Hi,
>>>>>>
>>>>>>> Manfred Pock <po...@gmail.com> hat am 14. Juli 2015 um 11:39
>>>>>>> geschrieben:
>>>>>>>
>>>>>>>
>>>>>>> Ok, we load the pdf with useScratchFiles = true, if we load them
>>>>>>> with
>>>>>>> false the performance is better, but a little bit slower than the
>>>>>>> old
>>>>>>> one.
>>>>>> What do you use as input, a stream or a real file? If the latter you
>>>>>> should use
>>>>>> the load method with the file parameter.
>>>>>>
>>>>>> PDFBox needs ramdom access to the pdf and if a stream is provided
>>>>>> PDFBox copies
>>>>>> the data to a file (lower memory usage, slower performance) or to the
>>>>>> memory
>>>>>> (higher memory usage, better performance).
>>>>>>
>>>>>> BR
>>>>>> Andreas
>>>>>>
>>>>>>
>>>>>>> But now it need more memory. I cannot load some pdfs with the
>>>>>>> current
>>>>>>> version with the same java-memory configuration.
>>>>>>>
>>>>>>> Am 14.07.2015 um 11:26 schrieb Manfred Pock:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> we use the Pdfbox-trunkversion to render pdf's, currently we use
>>>>>>>> the
>>>>>>>> version from 12. May 2015.
>>>>>>>>
>>>>>>>> Today i have done an update to the current version and have test
>>>>>>>> it.
>>>>>>>> It seems to be that it need now much more time to render pdf's, it
>>>>>>>> depends of the size of the pdf.
>>>>>>>>
>>>>>>>> for example you can try this one:
>>>>>>>>
>>>>>>>> http://cloud.directupload.net/15bu
>>>>>>>>
>>>>>>>> It need five times more then the version from May 2015.
>>>>>>>>
>>>>>>>> regarts, Manfred
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>


-- 
Timo Boehme
OntoChem IT Solutions GmbH
Blücherstraße 24
06120 Halle (Saale)
Germany

phone: +49 345 478 047 4      | fax: +49 345 478 047 1
email: ulf.laube@ontochem.com | web: www.ontochem.com
HRB 21962 Amtsgericht Stendal | USt-IdNr.: DE815563824
managing director : Lutz Weber


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org

Re: Performance of the trunkversion

Posted by Manfred Pock <po...@gmail.com>.

Yes, now we have the same performance as the old may-version, it will be ok.

However, better parsing/render-performance would always be fine.

thanks and regarts, Manfred

Am 15.07.2015 um 15:43 schrieb Timo Boehme:
> The latest ScratchFile* versions attached to PDFBOX-2882 fix the 
> described exception and improve on ScratchFileBuffer.clear() command.
>
> Regarding speed test of file provided by Manfred Pock: on my machine 
> rendering is nearly the same for only scratch file usage vs. using 
> ScratchFile with allowed main-memory usage vs. not using scratch file:
> - first run approx. 1.5 sec, further runs (same VM) 0.15 sec
>
> So we might see the font or other initialization in the beginning, 
> having the pure file parsing/rendering time in the consecutive runs.
> @Manfred: are these values what you would expect?
>
>
> Best,
> Timo
>
>
> Am 15.07.2015 um 13:03 schrieb Timo Boehme:
>> I also wouldn't expect an improvement with 256k bytes of in-memory
>> pages. You should add possibly an 1000 times larger value - or what you
>> would like to use.
>>
>> I will later have a look into the reason for your exception.
>>
>>
>> Timo
>>
>>
>> Am 15.07.2015 um 12:57 schrieb Manfred Pock:
>>> Hi Timo,
>>>
>>> i have seen and tried it again. I have set maxInMemoryByteSize to 
>>> 256000
>>> and i cannot see a real improvement.
>>>
>>> But i got an Exception with the appended pdf.
>>>
>>> Exception in thread "AWT-EventQueue-0" java.lang.NullPointerException
>>>      at
>>> org.apache.pdfbox.contentstream.PDFStreamEngine.showTextStrings(PDFStreamEngine.java:567) 
>>>
>>>
>>>
>>>      at
>>> org.apache.pdfbox.rendering.PageDrawer.showTextStrings(PageDrawer.java:297) 
>>>
>>>
>>>      at
>>> org.apache.pdfbox.contentstream.operator.text.ShowTextAdjusted.process(ShowTextAdjusted.java:38) 
>>>
>>>
>>>
>>>      at
>>> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:795) 
>>>
>>>
>>>
>>>      at
>>> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:462) 
>>>
>>>
>>>
>>>      at
>>> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:438) 
>>>
>>>
>>>
>>>      at
>>> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149) 
>>>
>>>
>>>
>>>      at
>>> org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:180)
>>>      at
>>> org.apache.pdfbox.rendering.PDFRenderer.renderPage(PDFRenderer.java:205) 
>>>
>>>      at
>>> org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:136) 
>>>
>>>      at
>>> org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:95) 
>>>
>>>
>>>
>>>
>>> Am 15.07.2015 um 12:28 schrieb Timo Boehme:
>>>> Hi Manfred,
>>>>
>>>> there is another update of ScratchFile. It now is able to use a
>>>> certain amount of main memory before using the scratch file. Could you
>>>> give it a try? You will have to change the source a bit since the
>>>> constructor getting the allowed amount of memory is currently not
>>>> supported by PDDocument class. Simply change
>>>>
>>>>     public ScratchFile(File scratchFileDirectory) throws IOException
>>>>     {
>>>>         this(scratchFileDirectory, 0);
>>>>     }
>>>>
>>>> to
>>>>     public ScratchFile(File scratchFileDirectory) throws IOException
>>>>     {
>>>>         this(scratchFileDirectory, XXXXXX);
>>>>     }
>>>> where XXXXXX is the amount of main memory to be used for buffers in
>>>> bytes.
>>>>
>>>> If you use a larger value and the performance still is not same/better
>>>> as the May version than at least it is not the problem of the buffer
>>>> handling for streams.
>>>>
>>>>
>>>> Best,
>>>> Timo
>>>>
>>>>
>>>> Am 15.07.2015 um 12:20 schrieb Manfred Pock:
>>>>> Hi Timo,
>>>>>
>>>>> i have test it with different pdf's and die performance ist nearly of
>>>>> the version from may. Just a little bit slower.
>>>>>
>>>>> It will be ok, but it will be nice if it will performe better ;-)
>>>>>
>>>>> thanks and regarts.
>>>>> Manfred
>>>>>
>>>>> Am 15.07.2015 um 10:24 schrieb Timo Boehme:
>>>>>> Hi Manfred,
>>>>>>
>>>>>> the issue should be fixed in the updated versions attached to
>>>>>> PDFBOX-2882. Please give them a try.
>>>>>>
>>>>>>
>>>>>> Timo
>>>>>>
>>>>>>
>>>>>> Am 15.07.2015 um 09:51 schrieb Manfred Pock:
>>>>>>> Hi Timo,
>>>>>>>
>>>>>>> i have tried it put it doesn't work now and i get different
>>>>>>> exceptions
>>>>>>> or Errors
>>>>>>>
>>>>>>> i looks like that there is a problem with any kind of images, the
>>>>>>> rest
>>>>>>> will be shown.
>>>>>>>
>>>>>>> for example:
>>>>>>>
>>>>>>> SCHWERWIEGEND: TIFFFaxDecoder: Invalid code encountered while
>>>>>>> decoding
>>>>>>> 2D group 4 compressed data.
>>>>>>> java.io.IOException: TIFFFaxDecoder: Invalid code encountered while
>>>>>>> decoding 2D group 4 compressed data.
>>>>>>>      at
>>>>>>> org.apache.pdfbox.filter.ccitt.TIFFFaxDecoder.decodeT6(TIFFFaxDecoder.java:1125) 
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>      at
>>>>>>> org.apache.pdfbox.filter.CCITTFaxFilter.decode(CCITTFaxFilter.java:94) 
>>>>>>>
>>>>>>>
>>>>>>>      at
>>>>>>> org.apache.pdfbox.cos.COSStream.attemptDecode(COSStream.java:422)
>>>>>>>      at 
>>>>>>> org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:398)
>>>>>>>      at 
>>>>>>> org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:335)
>>>>>>>      at
>>>>>>> org.apache.pdfbox.cos.COSStream.getDecodeResult(COSStream.java:278)
>>>>>>>      at
>>>>>>> org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.<init>(PDImageXObject.java:120) 
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>      at
>>>>>>> org.apache.pdfbox.pdmodel.graphics.PDXObject.createXObject(PDXObject.java:67) 
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>      at
>>>>>>> org.apache.pdfbox.pdmodel.PDResources.getXObject(PDResources.java:340) 
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a
>>>>>>> DataFormatException
>>>>>>> Jul 15, 2015 9:45:05 AM
>>>>>>> org.apache.pdfbox.contentstream.PDFStreamEngine
>>>>>>> operatorException
>>>>>>> WARNUNG: java.util.zip.DataFormatException: invalid block type
>>>>>>>
>>>>>>> Jul 15, 2015 9:46:18 AM
>>>>>>> org.apache.pdfbox.contentstream.PDFStreamEngine
>>>>>>> operatorException
>>>>>>> WARNUNG: Not a JPEG file: starts with 0xe0 0x00
>>>>>>>
>>>>>>> ul 15, 2015 9:46:23 AM
>>>>>>> org.apache.pdfbox.contentstream.PDFStreamEngine
>>>>>>> operatorException
>>>>>>> WARNUNG: Image stream was not read - filter: DCTDecode
>>>>>>>
>>>>>>> SCHWERWIEGEND: java.util.zip.DataFormatException: invalid distance
>>>>>>> too
>>>>>>> far back
>>>>>>> java.io.IOException: java.util.zip.DataFormatException: invalid
>>>>>>> distance
>>>>>>> too far back
>>>>>>>      at
>>>>>>> org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:83)
>>>>>>>      at
>>>>>>> org.apache.pdfbox.cos.COSStream.attemptDecode(COSStream.java:422)
>>>>>>>      at 
>>>>>>> org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:398)
>>>>>>>      at 
>>>>>>> org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:335)
>>>>>>>      at
>>>>>>> org.apache.pdfbox.cos.COSStream.checkUnfilteredBuffer(COSStream.java:265) 
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>      at
>>>>>>> org.apache.pdfbox.cos.COSStream.getUnfilteredRandomAccess(COSStream.java:239) 
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>      at
>>>>>>> org.apache.pdfbox.pdfparser.BaseParser.<init>(BaseParser.java:146)
>>>>>>>      at
>>>>>>> org.apache.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:78) 
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>      at
>>>>>>> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:451) 
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>      at
>>>>>>> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:438) 
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>      at
>>>>>>> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149) 
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>      at
>>>>>>> org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:180) 
>>>>>>>
>>>>>>>      at
>>>>>>> org.apache.pdfbox.rendering.PDFRenderer.renderPage(PDFRenderer.java:205) 
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>      at
>>>>>>> org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:136) 
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>      at
>>>>>>> org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:95) 
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> .... Caused by: java.util.zip.DataFormatException: invalid distance
>>>>>>> too
>>>>>>> far back
>>>>>>>      at java.util.zip.Inflater.inflateBytes(Native Method)
>>>>>>>      at java.util.zip.Inflater.inflate(Inflater.java:259)
>>>>>>>      at java.util.zip.Inflater.inflate(Inflater.java:280)
>>>>>>>      at
>>>>>>> org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:101) 
>>>>>>>
>>>>>>>      at
>>>>>>> org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:74)
>>>>>>>
>>>>>>> Am 15.07.2015 um 00:35 schrieb Timo Boehme:
>>>>>>>> I've created PDFBOX-2882 with a drop-in replacement of the scratch
>>>>>>>> file implementation.
>>>>>>>> @Manfred: Could you please test if this helps in your scenario to
>>>>>>>> increase performance?
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Timo
>>>>>>>>
>>>>>>>>
>>>>>>>> Am 14.07.2015 um 13:47 schrieb Timo Boehme:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> instead of having a linked page list in ScratchFileBuffer I would
>>>>>>>>> propose having a list of pages with the page numbers (integer)
>>>>>>>>> kept in
>>>>>>>>> memory (takes 1k for 1MB data). This would ease page handling,
>>>>>>>>> seeking
>>>>>>>>> does not need I/O-operations and caching of pages would be a lot
>>>>>>>>> easier.
>>>>>>>>> I may find some time later to come up with such a replacement.
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Timo
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Am 14.07.2015 um 13:02 schrieb Timo Boehme:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> as I see it (had only a quick look at the implementation) the
>>>>>>>>>> ScratchFileBuffer implementation is not optimal for fast random
>>>>>>>>>> access.
>>>>>>>>>> Single writes of bytes are not buffered but directly written to
>>>>>>>>>> the
>>>>>>>>>> file
>>>>>>>>>> - a lot of I/O-operations) and seek operations have to travel 
>>>>>>>>>> the
>>>>>>>>>> linked
>>>>>>>>>> page list reading some bytes of each page - again a lot of seek
>>>>>>>>>> and
>>>>>>>>>> read
>>>>>>>>>> I/O-operations.
>>>>>>>>>> To speed things up it is crucial to minimize the number of
>>>>>>>>>> I/O-operations directly going to the random access file. 
>>>>>>>>>> Therefore
>>>>>>>>>> it is
>>>>>>>>>> needed to buffer writes, keep last read page in memory for
>>>>>>>>>> sequential
>>>>>>>>>> reads and have an in-memory cache of page meta data (offset,
>>>>>>>>>> link to
>>>>>>>>>> previous/next page).
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Timo
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Am 14.07.2015 um 12:15 schrieb Manfred Pock:
>>>>>>>>>>> Yes, the input is a inputstream. I can try it direct from file.
>>>>>>>>>>>
>>>>>>>>>>> But in general we get the pdf from an document management
>>>>>>>>>>> system as
>>>>>>>>>>> stream.
>>>>>>>>>>> Does make sense that i save the pdf to file before?
>>>>>>>>>>>
>>>>>>>>>>> Why is there so an big performance difference beetween the
>>>>>>>>>>> version
>>>>>>>>>>> from
>>>>>>>>>>> May and the current version, if we use it with 
>>>>>>>>>>> useScratchFiles =
>>>>>>>>>>> true ?
>>>>>>>>>>>
>>>>>>>>>>> regarts, Manfred
>>>>>>>>>>>
>>>>>>>>>>> Am 14.07.2015 um 12:02 schrieb Andreas Lehmkühler:
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>>> Manfred Pock <po...@gmail.com> hat am 14. Juli 2015 um
>>>>>>>>>>>>> 11:39
>>>>>>>>>>>>> geschrieben:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Ok, we load the pdf with useScratchFiles = true, if we load
>>>>>>>>>>>>> them
>>>>>>>>>>>>> with
>>>>>>>>>>>>> false the performance is better, but a little bit slower than
>>>>>>>>>>>>> the
>>>>>>>>>>>>> old
>>>>>>>>>>>>> one.
>>>>>>>>>>>> What do you use as input, a stream or a real file? If the 
>>>>>>>>>>>> latter
>>>>>>>>>>>> you
>>>>>>>>>>>> should use
>>>>>>>>>>>> the load method with the file parameter.
>>>>>>>>>>>>
>>>>>>>>>>>> PDFBox needs ramdom access to the pdf and if a stream is
>>>>>>>>>>>> provided
>>>>>>>>>>>> PDFBox copies
>>>>>>>>>>>> the data to a file (lower memory usage, slower performance) or
>>>>>>>>>>>> to the
>>>>>>>>>>>> memory
>>>>>>>>>>>> (higher memory usage, better performance).
>>>>>>>>>>>>
>>>>>>>>>>>> BR
>>>>>>>>>>>> Andreas
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> But now it need more memory. I cannot load some pdfs with the
>>>>>>>>>>>>> current
>>>>>>>>>>>>> version with the same java-memory configuration.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Am 14.07.2015 um 11:26 schrieb Manfred Pock:
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> we use the Pdfbox-trunkversion to render pdf's, currently we
>>>>>>>>>>>>>> use
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> version from 12. May 2015.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Today i have done an update to the current version and have
>>>>>>>>>>>>>> test
>>>>>>>>>>>>>> it.
>>>>>>>>>>>>>> It seems to be that it need now much more time to render
>>>>>>>>>>>>>> pdf's, it
>>>>>>>>>>>>>> depends of the size of the pdf.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> for example you can try this one:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> http://cloud.directupload.net/15bu
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> It need five times more then the version from May 2015.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> regarts, Manfred
>>>>>>>>>>>> --------------------------------------------------------------------- 
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>>>>>>>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --------------------------------------------------------------------- 
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>>>>>>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org

Re: Performance of the trunkversion

Posted by Timo Boehme <ti...@ontochem.com>.

The latest ScratchFile* versions attached to PDFBOX-2882 fix the 
described exception and improve on ScratchFileBuffer.clear() command.

Regarding speed test of file provided by Manfred Pock: on my machine 
rendering is nearly the same for only scratch file usage vs. using 
ScratchFile with allowed main-memory usage vs. not using scratch file:
- first run approx. 1.5 sec, further runs (same VM) 0.15 sec

So we might see the font or other initialization in the beginning, 
having the pure file parsing/rendering time in the consecutive runs.
@Manfred: are these values what you would expect?


Best,
Timo


Am 15.07.2015 um 13:03 schrieb Timo Boehme:
> I also wouldn't expect an improvement with 256k bytes of in-memory
> pages. You should add possibly an 1000 times larger value - or what you
> would like to use.
>
> I will later have a look into the reason for your exception.
>
>
> Timo
>
>
> Am 15.07.2015 um 12:57 schrieb Manfred Pock:
>> Hi Timo,
>>
>> i have seen and tried it again. I have set maxInMemoryByteSize to 256000
>> and i cannot see a real improvement.
>>
>> But i got an Exception with the appended pdf.
>>
>> Exception in thread "AWT-EventQueue-0" java.lang.NullPointerException
>>      at
>> org.apache.pdfbox.contentstream.PDFStreamEngine.showTextStrings(PDFStreamEngine.java:567)
>>
>>
>>      at
>> org.apache.pdfbox.rendering.PageDrawer.showTextStrings(PageDrawer.java:297)
>>
>>      at
>> org.apache.pdfbox.contentstream.operator.text.ShowTextAdjusted.process(ShowTextAdjusted.java:38)
>>
>>
>>      at
>> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:795)
>>
>>
>>      at
>> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:462)
>>
>>
>>      at
>> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:438)
>>
>>
>>      at
>> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149)
>>
>>
>>      at
>> org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:180)
>>      at
>> org.apache.pdfbox.rendering.PDFRenderer.renderPage(PDFRenderer.java:205)
>>      at
>> org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:136)
>>      at
>> org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:95)
>>
>>
>>
>> Am 15.07.2015 um 12:28 schrieb Timo Boehme:
>>> Hi Manfred,
>>>
>>> there is another update of ScratchFile. It now is able to use a
>>> certain amount of main memory before using the scratch file. Could you
>>> give it a try? You will have to change the source a bit since the
>>> constructor getting the allowed amount of memory is currently not
>>> supported by PDDocument class. Simply change
>>>
>>>     public ScratchFile(File scratchFileDirectory) throws IOException
>>>     {
>>>         this(scratchFileDirectory, 0);
>>>     }
>>>
>>> to
>>>     public ScratchFile(File scratchFileDirectory) throws IOException
>>>     {
>>>         this(scratchFileDirectory, XXXXXX);
>>>     }
>>> where XXXXXX is the amount of main memory to be used for buffers in
>>> bytes.
>>>
>>> If you use a larger value and the performance still is not same/better
>>> as the May version than at least it is not the problem of the buffer
>>> handling for streams.
>>>
>>>
>>> Best,
>>> Timo
>>>
>>>
>>> Am 15.07.2015 um 12:20 schrieb Manfred Pock:
>>>> Hi Timo,
>>>>
>>>> i have test it with different pdf's and die performance ist nearly of
>>>> the version from may. Just a little bit slower.
>>>>
>>>> It will be ok, but it will be nice if it will performe better ;-)
>>>>
>>>> thanks and regarts.
>>>> Manfred
>>>>
>>>> Am 15.07.2015 um 10:24 schrieb Timo Boehme:
>>>>> Hi Manfred,
>>>>>
>>>>> the issue should be fixed in the updated versions attached to
>>>>> PDFBOX-2882. Please give them a try.
>>>>>
>>>>>
>>>>> Timo
>>>>>
>>>>>
>>>>> Am 15.07.2015 um 09:51 schrieb Manfred Pock:
>>>>>> Hi Timo,
>>>>>>
>>>>>> i have tried it put it doesn't work now and i get different
>>>>>> exceptions
>>>>>> or Errors
>>>>>>
>>>>>> i looks like that there is a problem with any kind of images, the
>>>>>> rest
>>>>>> will be shown.
>>>>>>
>>>>>> for example:
>>>>>>
>>>>>> SCHWERWIEGEND: TIFFFaxDecoder: Invalid code encountered while
>>>>>> decoding
>>>>>> 2D group 4 compressed data.
>>>>>> java.io.IOException: TIFFFaxDecoder: Invalid code encountered while
>>>>>> decoding 2D group 4 compressed data.
>>>>>>      at
>>>>>> org.apache.pdfbox.filter.ccitt.TIFFFaxDecoder.decodeT6(TIFFFaxDecoder.java:1125)
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>      at
>>>>>> org.apache.pdfbox.filter.CCITTFaxFilter.decode(CCITTFaxFilter.java:94)
>>>>>>
>>>>>>      at
>>>>>> org.apache.pdfbox.cos.COSStream.attemptDecode(COSStream.java:422)
>>>>>>      at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:398)
>>>>>>      at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:335)
>>>>>>      at
>>>>>> org.apache.pdfbox.cos.COSStream.getDecodeResult(COSStream.java:278)
>>>>>>      at
>>>>>> org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.<init>(PDImageXObject.java:120)
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>      at
>>>>>> org.apache.pdfbox.pdmodel.graphics.PDXObject.createXObject(PDXObject.java:67)
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>      at
>>>>>> org.apache.pdfbox.pdmodel.PDResources.getXObject(PDResources.java:340)
>>>>>>
>>>>>>
>>>>>> SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a
>>>>>> DataFormatException
>>>>>> Jul 15, 2015 9:45:05 AM
>>>>>> org.apache.pdfbox.contentstream.PDFStreamEngine
>>>>>> operatorException
>>>>>> WARNUNG: java.util.zip.DataFormatException: invalid block type
>>>>>>
>>>>>> Jul 15, 2015 9:46:18 AM
>>>>>> org.apache.pdfbox.contentstream.PDFStreamEngine
>>>>>> operatorException
>>>>>> WARNUNG: Not a JPEG file: starts with 0xe0 0x00
>>>>>>
>>>>>> ul 15, 2015 9:46:23 AM
>>>>>> org.apache.pdfbox.contentstream.PDFStreamEngine
>>>>>> operatorException
>>>>>> WARNUNG: Image stream was not read - filter: DCTDecode
>>>>>>
>>>>>> SCHWERWIEGEND: java.util.zip.DataFormatException: invalid distance
>>>>>> too
>>>>>> far back
>>>>>> java.io.IOException: java.util.zip.DataFormatException: invalid
>>>>>> distance
>>>>>> too far back
>>>>>>      at
>>>>>> org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:83)
>>>>>>      at
>>>>>> org.apache.pdfbox.cos.COSStream.attemptDecode(COSStream.java:422)
>>>>>>      at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:398)
>>>>>>      at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:335)
>>>>>>      at
>>>>>> org.apache.pdfbox.cos.COSStream.checkUnfilteredBuffer(COSStream.java:265)
>>>>>>
>>>>>>
>>>>>>
>>>>>>      at
>>>>>> org.apache.pdfbox.cos.COSStream.getUnfilteredRandomAccess(COSStream.java:239)
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>      at
>>>>>> org.apache.pdfbox.pdfparser.BaseParser.<init>(BaseParser.java:146)
>>>>>>      at
>>>>>> org.apache.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:78)
>>>>>>
>>>>>>
>>>>>>
>>>>>>      at
>>>>>> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:451)
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>      at
>>>>>> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:438)
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>      at
>>>>>> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149)
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>      at
>>>>>> org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:180)
>>>>>>      at
>>>>>> org.apache.pdfbox.rendering.PDFRenderer.renderPage(PDFRenderer.java:205)
>>>>>>
>>>>>>
>>>>>>      at
>>>>>> org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:136)
>>>>>>
>>>>>>
>>>>>>
>>>>>>      at
>>>>>> org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:95)
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> .... Caused by: java.util.zip.DataFormatException: invalid distance
>>>>>> too
>>>>>> far back
>>>>>>      at java.util.zip.Inflater.inflateBytes(Native Method)
>>>>>>      at java.util.zip.Inflater.inflate(Inflater.java:259)
>>>>>>      at java.util.zip.Inflater.inflate(Inflater.java:280)
>>>>>>      at
>>>>>> org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:101)
>>>>>>      at
>>>>>> org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:74)
>>>>>>
>>>>>> Am 15.07.2015 um 00:35 schrieb Timo Boehme:
>>>>>>> I've created PDFBOX-2882 with a drop-in replacement of the scratch
>>>>>>> file implementation.
>>>>>>> @Manfred: Could you please test if this helps in your scenario to
>>>>>>> increase performance?
>>>>>>>
>>>>>>> Best,
>>>>>>> Timo
>>>>>>>
>>>>>>>
>>>>>>> Am 14.07.2015 um 13:47 schrieb Timo Boehme:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> instead of having a linked page list in ScratchFileBuffer I would
>>>>>>>> propose having a list of pages with the page numbers (integer)
>>>>>>>> kept in
>>>>>>>> memory (takes 1k for 1MB data). This would ease page handling,
>>>>>>>> seeking
>>>>>>>> does not need I/O-operations and caching of pages would be a lot
>>>>>>>> easier.
>>>>>>>> I may find some time later to come up with such a replacement.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Timo
>>>>>>>>
>>>>>>>>
>>>>>>>> Am 14.07.2015 um 13:02 schrieb Timo Boehme:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> as I see it (had only a quick look at the implementation) the
>>>>>>>>> ScratchFileBuffer implementation is not optimal for fast random
>>>>>>>>> access.
>>>>>>>>> Single writes of bytes are not buffered but directly written to
>>>>>>>>> the
>>>>>>>>> file
>>>>>>>>> - a lot of I/O-operations) and seek operations have to travel the
>>>>>>>>> linked
>>>>>>>>> page list reading some bytes of each page - again a lot of seek
>>>>>>>>> and
>>>>>>>>> read
>>>>>>>>> I/O-operations.
>>>>>>>>> To speed things up it is crucial to minimize the number of
>>>>>>>>> I/O-operations directly going to the random access file. Therefore
>>>>>>>>> it is
>>>>>>>>> needed to buffer writes, keep last read page in memory for
>>>>>>>>> sequential
>>>>>>>>> reads and have an in-memory cache of page meta data (offset,
>>>>>>>>> link to
>>>>>>>>> previous/next page).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Timo
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Am 14.07.2015 um 12:15 schrieb Manfred Pock:
>>>>>>>>>> Yes, the input is a inputstream. I can try it direct from file.
>>>>>>>>>>
>>>>>>>>>> But in general we get the pdf from an document management
>>>>>>>>>> system as
>>>>>>>>>> stream.
>>>>>>>>>> Does make sense that i save the pdf to file before?
>>>>>>>>>>
>>>>>>>>>> Why is there so an big performance difference beetween the
>>>>>>>>>> version
>>>>>>>>>> from
>>>>>>>>>> May and the current version, if we use it with useScratchFiles =
>>>>>>>>>> true ?
>>>>>>>>>>
>>>>>>>>>> regarts, Manfred
>>>>>>>>>>
>>>>>>>>>> Am 14.07.2015 um 12:02 schrieb Andreas Lehmkühler:
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>>> Manfred Pock <po...@gmail.com> hat am 14. Juli 2015 um
>>>>>>>>>>>> 11:39
>>>>>>>>>>>> geschrieben:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Ok, we load the pdf with useScratchFiles = true, if we load
>>>>>>>>>>>> them
>>>>>>>>>>>> with
>>>>>>>>>>>> false the performance is better, but a little bit slower than
>>>>>>>>>>>> the
>>>>>>>>>>>> old
>>>>>>>>>>>> one.
>>>>>>>>>>> What do you use as input, a stream or a real file? If the latter
>>>>>>>>>>> you
>>>>>>>>>>> should use
>>>>>>>>>>> the load method with the file parameter.
>>>>>>>>>>>
>>>>>>>>>>> PDFBox needs ramdom access to the pdf and if a stream is
>>>>>>>>>>> provided
>>>>>>>>>>> PDFBox copies
>>>>>>>>>>> the data to a file (lower memory usage, slower performance) or
>>>>>>>>>>> to the
>>>>>>>>>>> memory
>>>>>>>>>>> (higher memory usage, better performance).
>>>>>>>>>>>
>>>>>>>>>>> BR
>>>>>>>>>>> Andreas
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> But now it need more memory. I cannot load some pdfs with the
>>>>>>>>>>>> current
>>>>>>>>>>>> version with the same java-memory configuration.
>>>>>>>>>>>>
>>>>>>>>>>>> Am 14.07.2015 um 11:26 schrieb Manfred Pock:
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> we use the Pdfbox-trunkversion to render pdf's, currently we
>>>>>>>>>>>>> use
>>>>>>>>>>>>> the
>>>>>>>>>>>>> version from 12. May 2015.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Today i have done an update to the current version and have
>>>>>>>>>>>>> test
>>>>>>>>>>>>> it.
>>>>>>>>>>>>> It seems to be that it need now much more time to render
>>>>>>>>>>>>> pdf's, it
>>>>>>>>>>>>> depends of the size of the pdf.
>>>>>>>>>>>>>
>>>>>>>>>>>>> for example you can try this one:
>>>>>>>>>>>>>
>>>>>>>>>>>>> http://cloud.directupload.net/15bu
>>>>>>>>>>>>>
>>>>>>>>>>>>> It need five times more then the version from May 2015.
>>>>>>>>>>>>>
>>>>>>>>>>>>> regarts, Manfred
>>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>>>>>>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>>>>>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>
>


-- 
Timo Boehme
OntoChem IT Solutions GmbH
Blücherstraße 24
06120 Halle (Saale)
Germany

phone: +49 345 478 047 4      | fax: +49 345 478 047 1
email: ulf.laube@ontochem.com | web: www.ontochem.com
HRB 21962 Amtsgericht Stendal | USt-IdNr.: DE815563824
managing director : Lutz Weber


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org

Re: Performance of the trunkversion

Posted by Timo Boehme <ti...@ontochem.com>.

I also wouldn't expect an improvement with 256k bytes of in-memory 
pages. You should add possibly an 1000 times larger value - or what you 
would like to use.

I will later have a look into the reason for your exception.


Timo


Am 15.07.2015 um 12:57 schrieb Manfred Pock:
> Hi Timo,
>
> i have seen and tried it again. I have set maxInMemoryByteSize to 256000
> and i cannot see a real improvement.
>
> But i got an Exception with the appended pdf.
>
> Exception in thread "AWT-EventQueue-0" java.lang.NullPointerException
>      at
> org.apache.pdfbox.contentstream.PDFStreamEngine.showTextStrings(PDFStreamEngine.java:567)
>
>      at
> org.apache.pdfbox.rendering.PageDrawer.showTextStrings(PageDrawer.java:297)
>      at
> org.apache.pdfbox.contentstream.operator.text.ShowTextAdjusted.process(ShowTextAdjusted.java:38)
>
>      at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:795)
>
>      at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:462)
>
>      at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:438)
>
>      at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149)
>
>      at
> org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:180)
>      at
> org.apache.pdfbox.rendering.PDFRenderer.renderPage(PDFRenderer.java:205)
>      at
> org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:136)
>      at
> org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:95)
>
>
> Am 15.07.2015 um 12:28 schrieb Timo Boehme:
>> Hi Manfred,
>>
>> there is another update of ScratchFile. It now is able to use a
>> certain amount of main memory before using the scratch file. Could you
>> give it a try? You will have to change the source a bit since the
>> constructor getting the allowed amount of memory is currently not
>> supported by PDDocument class. Simply change
>>
>>     public ScratchFile(File scratchFileDirectory) throws IOException
>>     {
>>         this(scratchFileDirectory, 0);
>>     }
>>
>> to
>>     public ScratchFile(File scratchFileDirectory) throws IOException
>>     {
>>         this(scratchFileDirectory, XXXXXX);
>>     }
>> where XXXXXX is the amount of main memory to be used for buffers in
>> bytes.
>>
>> If you use a larger value and the performance still is not same/better
>> as the May version than at least it is not the problem of the buffer
>> handling for streams.
>>
>>
>> Best,
>> Timo
>>
>>
>> Am 15.07.2015 um 12:20 schrieb Manfred Pock:
>>> Hi Timo,
>>>
>>> i have test it with different pdf's and die performance ist nearly of
>>> the version from may. Just a little bit slower.
>>>
>>> It will be ok, but it will be nice if it will performe better ;-)
>>>
>>> thanks and regarts.
>>> Manfred
>>>
>>> Am 15.07.2015 um 10:24 schrieb Timo Boehme:
>>>> Hi Manfred,
>>>>
>>>> the issue should be fixed in the updated versions attached to
>>>> PDFBOX-2882. Please give them a try.
>>>>
>>>>
>>>> Timo
>>>>
>>>>
>>>> Am 15.07.2015 um 09:51 schrieb Manfred Pock:
>>>>> Hi Timo,
>>>>>
>>>>> i have tried it put it doesn't work now and i get different exceptions
>>>>> or Errors
>>>>>
>>>>> i looks like that there is a problem with any kind of images, the rest
>>>>> will be shown.
>>>>>
>>>>> for example:
>>>>>
>>>>> SCHWERWIEGEND: TIFFFaxDecoder: Invalid code encountered while decoding
>>>>> 2D group 4 compressed data.
>>>>> java.io.IOException: TIFFFaxDecoder: Invalid code encountered while
>>>>> decoding 2D group 4 compressed data.
>>>>>      at
>>>>> org.apache.pdfbox.filter.ccitt.TIFFFaxDecoder.decodeT6(TIFFFaxDecoder.java:1125)
>>>>>
>>>>>
>>>>>
>>>>>      at
>>>>> org.apache.pdfbox.filter.CCITTFaxFilter.decode(CCITTFaxFilter.java:94)
>>>>>      at
>>>>> org.apache.pdfbox.cos.COSStream.attemptDecode(COSStream.java:422)
>>>>>      at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:398)
>>>>>      at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:335)
>>>>>      at
>>>>> org.apache.pdfbox.cos.COSStream.getDecodeResult(COSStream.java:278)
>>>>>      at
>>>>> org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.<init>(PDImageXObject.java:120)
>>>>>
>>>>>
>>>>>
>>>>>      at
>>>>> org.apache.pdfbox.pdmodel.graphics.PDXObject.createXObject(PDXObject.java:67)
>>>>>
>>>>>
>>>>>
>>>>>      at
>>>>> org.apache.pdfbox.pdmodel.PDResources.getXObject(PDResources.java:340)
>>>>>
>>>>> SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a
>>>>> DataFormatException
>>>>> Jul 15, 2015 9:45:05 AM
>>>>> org.apache.pdfbox.contentstream.PDFStreamEngine
>>>>> operatorException
>>>>> WARNUNG: java.util.zip.DataFormatException: invalid block type
>>>>>
>>>>> Jul 15, 2015 9:46:18 AM
>>>>> org.apache.pdfbox.contentstream.PDFStreamEngine
>>>>> operatorException
>>>>> WARNUNG: Not a JPEG file: starts with 0xe0 0x00
>>>>>
>>>>> ul 15, 2015 9:46:23 AM org.apache.pdfbox.contentstream.PDFStreamEngine
>>>>> operatorException
>>>>> WARNUNG: Image stream was not read - filter: DCTDecode
>>>>>
>>>>> SCHWERWIEGEND: java.util.zip.DataFormatException: invalid distance too
>>>>> far back
>>>>> java.io.IOException: java.util.zip.DataFormatException: invalid
>>>>> distance
>>>>> too far back
>>>>>      at
>>>>> org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:83)
>>>>>      at
>>>>> org.apache.pdfbox.cos.COSStream.attemptDecode(COSStream.java:422)
>>>>>      at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:398)
>>>>>      at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:335)
>>>>>      at
>>>>> org.apache.pdfbox.cos.COSStream.checkUnfilteredBuffer(COSStream.java:265)
>>>>>
>>>>>
>>>>>      at
>>>>> org.apache.pdfbox.cos.COSStream.getUnfilteredRandomAccess(COSStream.java:239)
>>>>>
>>>>>
>>>>>
>>>>>      at
>>>>> org.apache.pdfbox.pdfparser.BaseParser.<init>(BaseParser.java:146)
>>>>>      at
>>>>> org.apache.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:78)
>>>>>
>>>>>
>>>>>      at
>>>>> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:451)
>>>>>
>>>>>
>>>>>
>>>>>      at
>>>>> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:438)
>>>>>
>>>>>
>>>>>
>>>>>      at
>>>>> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149)
>>>>>
>>>>>
>>>>>
>>>>>      at
>>>>> org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:180)
>>>>>      at
>>>>> org.apache.pdfbox.rendering.PDFRenderer.renderPage(PDFRenderer.java:205)
>>>>>
>>>>>      at
>>>>> org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:136)
>>>>>
>>>>>
>>>>>      at
>>>>> org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:95)
>>>>>
>>>>>
>>>>>
>>>>> .... Caused by: java.util.zip.DataFormatException: invalid distance
>>>>> too
>>>>> far back
>>>>>      at java.util.zip.Inflater.inflateBytes(Native Method)
>>>>>      at java.util.zip.Inflater.inflate(Inflater.java:259)
>>>>>      at java.util.zip.Inflater.inflate(Inflater.java:280)
>>>>>      at
>>>>> org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:101)
>>>>>      at
>>>>> org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:74)
>>>>>
>>>>> Am 15.07.2015 um 00:35 schrieb Timo Boehme:
>>>>>> I've created PDFBOX-2882 with a drop-in replacement of the scratch
>>>>>> file implementation.
>>>>>> @Manfred: Could you please test if this helps in your scenario to
>>>>>> increase performance?
>>>>>>
>>>>>> Best,
>>>>>> Timo
>>>>>>
>>>>>>
>>>>>> Am 14.07.2015 um 13:47 schrieb Timo Boehme:
>>>>>>> Hi,
>>>>>>>
>>>>>>> instead of having a linked page list in ScratchFileBuffer I would
>>>>>>> propose having a list of pages with the page numbers (integer)
>>>>>>> kept in
>>>>>>> memory (takes 1k for 1MB data). This would ease page handling,
>>>>>>> seeking
>>>>>>> does not need I/O-operations and caching of pages would be a lot
>>>>>>> easier.
>>>>>>> I may find some time later to come up with such a replacement.
>>>>>>>
>>>>>>> Best,
>>>>>>> Timo
>>>>>>>
>>>>>>>
>>>>>>> Am 14.07.2015 um 13:02 schrieb Timo Boehme:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> as I see it (had only a quick look at the implementation) the
>>>>>>>> ScratchFileBuffer implementation is not optimal for fast random
>>>>>>>> access.
>>>>>>>> Single writes of bytes are not buffered but directly written to the
>>>>>>>> file
>>>>>>>> - a lot of I/O-operations) and seek operations have to travel the
>>>>>>>> linked
>>>>>>>> page list reading some bytes of each page - again a lot of seek and
>>>>>>>> read
>>>>>>>> I/O-operations.
>>>>>>>> To speed things up it is crucial to minimize the number of
>>>>>>>> I/O-operations directly going to the random access file. Therefore
>>>>>>>> it is
>>>>>>>> needed to buffer writes, keep last read page in memory for
>>>>>>>> sequential
>>>>>>>> reads and have an in-memory cache of page meta data (offset,
>>>>>>>> link to
>>>>>>>> previous/next page).
>>>>>>>>
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Timo
>>>>>>>>
>>>>>>>>
>>>>>>>> Am 14.07.2015 um 12:15 schrieb Manfred Pock:
>>>>>>>>> Yes, the input is a inputstream. I can try it direct from file.
>>>>>>>>>
>>>>>>>>> But in general we get the pdf from an document management
>>>>>>>>> system as
>>>>>>>>> stream.
>>>>>>>>> Does make sense that i save the pdf to file before?
>>>>>>>>>
>>>>>>>>> Why is there so an big performance difference beetween the version
>>>>>>>>> from
>>>>>>>>> May and the current version, if we use it with useScratchFiles =
>>>>>>>>> true ?
>>>>>>>>>
>>>>>>>>> regarts, Manfred
>>>>>>>>>
>>>>>>>>> Am 14.07.2015 um 12:02 schrieb Andreas Lehmkühler:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>>> Manfred Pock <po...@gmail.com> hat am 14. Juli 2015 um
>>>>>>>>>>> 11:39
>>>>>>>>>>> geschrieben:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Ok, we load the pdf with useScratchFiles = true, if we load them
>>>>>>>>>>> with
>>>>>>>>>>> false the performance is better, but a little bit slower than
>>>>>>>>>>> the
>>>>>>>>>>> old
>>>>>>>>>>> one.
>>>>>>>>>> What do you use as input, a stream or a real file? If the latter
>>>>>>>>>> you
>>>>>>>>>> should use
>>>>>>>>>> the load method with the file parameter.
>>>>>>>>>>
>>>>>>>>>> PDFBox needs ramdom access to the pdf and if a stream is provided
>>>>>>>>>> PDFBox copies
>>>>>>>>>> the data to a file (lower memory usage, slower performance) or
>>>>>>>>>> to the
>>>>>>>>>> memory
>>>>>>>>>> (higher memory usage, better performance).
>>>>>>>>>>
>>>>>>>>>> BR
>>>>>>>>>> Andreas
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> But now it need more memory. I cannot load some pdfs with the
>>>>>>>>>>> current
>>>>>>>>>>> version with the same java-memory configuration.
>>>>>>>>>>>
>>>>>>>>>>> Am 14.07.2015 um 11:26 schrieb Manfred Pock:
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> we use the Pdfbox-trunkversion to render pdf's, currently we
>>>>>>>>>>>> use
>>>>>>>>>>>> the
>>>>>>>>>>>> version from 12. May 2015.
>>>>>>>>>>>>
>>>>>>>>>>>> Today i have done an update to the current version and have
>>>>>>>>>>>> test
>>>>>>>>>>>> it.
>>>>>>>>>>>> It seems to be that it need now much more time to render
>>>>>>>>>>>> pdf's, it
>>>>>>>>>>>> depends of the size of the pdf.
>>>>>>>>>>>>
>>>>>>>>>>>> for example you can try this one:
>>>>>>>>>>>>
>>>>>>>>>>>> http://cloud.directupload.net/15bu
>>>>>>>>>>>>
>>>>>>>>>>>> It need five times more then the version from May 2015.
>>>>>>>>>>>>
>>>>>>>>>>>> regarts, Manfred
>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>>>>>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>>>>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
>


-- 
Timo Boehme
OntoChem IT Solutions GmbH
Blücherstraße 24
06120 Halle (Saale)
Germany

phone: +49 345 478 047 4      | fax: +49 345 478 047 1
email: ulf.laube@ontochem.com | web: www.ontochem.com
HRB 21962 Amtsgericht Stendal | USt-IdNr.: DE815563824
managing director : Lutz Weber


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org

Re: Performance of the trunkversion

Posted by Manfred Pock <po...@gmail.com>.

Hi Timo,

i have seen and tried it again. I have set maxInMemoryByteSize to 256000 
and i cannot see a real improvement.

But i got an Exception with the appended pdf.

Exception in thread "AWT-EventQueue-0" java.lang.NullPointerException
     at 
org.apache.pdfbox.contentstream.PDFStreamEngine.showTextStrings(PDFStreamEngine.java:567)
     at 
org.apache.pdfbox.rendering.PageDrawer.showTextStrings(PageDrawer.java:297)
     at 
org.apache.pdfbox.contentstream.operator.text.ShowTextAdjusted.process(ShowTextAdjusted.java:38)
     at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:795)
     at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:462)
     at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:438)
     at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149)
     at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:180)
     at 
org.apache.pdfbox.rendering.PDFRenderer.renderPage(PDFRenderer.java:205)
     at 
org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:136)
     at 
org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:95)

Am 15.07.2015 um 12:28 schrieb Timo Boehme:
> Hi Manfred,
>
> there is another update of ScratchFile. It now is able to use a 
> certain amount of main memory before using the scratch file. Could you 
> give it a try? You will have to change the source a bit since the 
> constructor getting the allowed amount of memory is currently not 
> supported by PDDocument class. Simply change
>
>     public ScratchFile(File scratchFileDirectory) throws IOException
>     {
>         this(scratchFileDirectory, 0);
>     }
>
> to
>     public ScratchFile(File scratchFileDirectory) throws IOException
>     {
>         this(scratchFileDirectory, XXXXXX);
>     }
> where XXXXXX is the amount of main memory to be used for buffers in 
> bytes.
>
> If you use a larger value and the performance still is not same/better 
> as the May version than at least it is not the problem of the buffer 
> handling for streams.
>
>
> Best,
> Timo
>
>
> Am 15.07.2015 um 12:20 schrieb Manfred Pock:
>> Hi Timo,
>>
>> i have test it with different pdf's and die performance ist nearly of
>> the version from may. Just a little bit slower.
>>
>> It will be ok, but it will be nice if it will performe better ;-)
>>
>> thanks and regarts.
>> Manfred
>>
>> Am 15.07.2015 um 10:24 schrieb Timo Boehme:
>>> Hi Manfred,
>>>
>>> the issue should be fixed in the updated versions attached to
>>> PDFBOX-2882. Please give them a try.
>>>
>>>
>>> Timo
>>>
>>>
>>> Am 15.07.2015 um 09:51 schrieb Manfred Pock:
>>>> Hi Timo,
>>>>
>>>> i have tried it put it doesn't work now and i get different exceptions
>>>> or Errors
>>>>
>>>> i looks like that there is a problem with any kind of images, the rest
>>>> will be shown.
>>>>
>>>> for example:
>>>>
>>>> SCHWERWIEGEND: TIFFFaxDecoder: Invalid code encountered while decoding
>>>> 2D group 4 compressed data.
>>>> java.io.IOException: TIFFFaxDecoder: Invalid code encountered while
>>>> decoding 2D group 4 compressed data.
>>>>      at
>>>> org.apache.pdfbox.filter.ccitt.TIFFFaxDecoder.decodeT6(TIFFFaxDecoder.java:1125) 
>>>>
>>>>
>>>>
>>>>      at
>>>> org.apache.pdfbox.filter.CCITTFaxFilter.decode(CCITTFaxFilter.java:94)
>>>>      at
>>>> org.apache.pdfbox.cos.COSStream.attemptDecode(COSStream.java:422)
>>>>      at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:398)
>>>>      at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:335)
>>>>      at
>>>> org.apache.pdfbox.cos.COSStream.getDecodeResult(COSStream.java:278)
>>>>      at
>>>> org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.<init>(PDImageXObject.java:120) 
>>>>
>>>>
>>>>
>>>>      at
>>>> org.apache.pdfbox.pdmodel.graphics.PDXObject.createXObject(PDXObject.java:67) 
>>>>
>>>>
>>>>
>>>>      at
>>>> org.apache.pdfbox.pdmodel.PDResources.getXObject(PDResources.java:340)
>>>>
>>>> SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a
>>>> DataFormatException
>>>> Jul 15, 2015 9:45:05 AM 
>>>> org.apache.pdfbox.contentstream.PDFStreamEngine
>>>> operatorException
>>>> WARNUNG: java.util.zip.DataFormatException: invalid block type
>>>>
>>>> Jul 15, 2015 9:46:18 AM 
>>>> org.apache.pdfbox.contentstream.PDFStreamEngine
>>>> operatorException
>>>> WARNUNG: Not a JPEG file: starts with 0xe0 0x00
>>>>
>>>> ul 15, 2015 9:46:23 AM org.apache.pdfbox.contentstream.PDFStreamEngine
>>>> operatorException
>>>> WARNUNG: Image stream was not read - filter: DCTDecode
>>>>
>>>> SCHWERWIEGEND: java.util.zip.DataFormatException: invalid distance too
>>>> far back
>>>> java.io.IOException: java.util.zip.DataFormatException: invalid 
>>>> distance
>>>> too far back
>>>>      at 
>>>> org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:83)
>>>>      at
>>>> org.apache.pdfbox.cos.COSStream.attemptDecode(COSStream.java:422)
>>>>      at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:398)
>>>>      at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:335)
>>>>      at
>>>> org.apache.pdfbox.cos.COSStream.checkUnfilteredBuffer(COSStream.java:265) 
>>>>
>>>>
>>>>      at
>>>> org.apache.pdfbox.cos.COSStream.getUnfilteredRandomAccess(COSStream.java:239) 
>>>>
>>>>
>>>>
>>>>      at
>>>> org.apache.pdfbox.pdfparser.BaseParser.<init>(BaseParser.java:146)
>>>>      at
>>>> org.apache.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:78) 
>>>>
>>>>
>>>>      at
>>>> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:451) 
>>>>
>>>>
>>>>
>>>>      at
>>>> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:438) 
>>>>
>>>>
>>>>
>>>>      at
>>>> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149) 
>>>>
>>>>
>>>>
>>>>      at
>>>> org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:180)
>>>>      at
>>>> org.apache.pdfbox.rendering.PDFRenderer.renderPage(PDFRenderer.java:205) 
>>>>
>>>>      at
>>>> org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:136) 
>>>>
>>>>
>>>>      at
>>>> org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:95) 
>>>>
>>>>
>>>>
>>>> .... Caused by: java.util.zip.DataFormatException: invalid distance 
>>>> too
>>>> far back
>>>>      at java.util.zip.Inflater.inflateBytes(Native Method)
>>>>      at java.util.zip.Inflater.inflate(Inflater.java:259)
>>>>      at java.util.zip.Inflater.inflate(Inflater.java:280)
>>>>      at
>>>> org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:101)
>>>>      at 
>>>> org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:74)
>>>>
>>>> Am 15.07.2015 um 00:35 schrieb Timo Boehme:
>>>>> I've created PDFBOX-2882 with a drop-in replacement of the scratch
>>>>> file implementation.
>>>>> @Manfred: Could you please test if this helps in your scenario to
>>>>> increase performance?
>>>>>
>>>>> Best,
>>>>> Timo
>>>>>
>>>>>
>>>>> Am 14.07.2015 um 13:47 schrieb Timo Boehme:
>>>>>> Hi,
>>>>>>
>>>>>> instead of having a linked page list in ScratchFileBuffer I would
>>>>>> propose having a list of pages with the page numbers (integer) 
>>>>>> kept in
>>>>>> memory (takes 1k for 1MB data). This would ease page handling, 
>>>>>> seeking
>>>>>> does not need I/O-operations and caching of pages would be a lot
>>>>>> easier.
>>>>>> I may find some time later to come up with such a replacement.
>>>>>>
>>>>>> Best,
>>>>>> Timo
>>>>>>
>>>>>>
>>>>>> Am 14.07.2015 um 13:02 schrieb Timo Boehme:
>>>>>>> Hi,
>>>>>>>
>>>>>>> as I see it (had only a quick look at the implementation) the
>>>>>>> ScratchFileBuffer implementation is not optimal for fast random
>>>>>>> access.
>>>>>>> Single writes of bytes are not buffered but directly written to the
>>>>>>> file
>>>>>>> - a lot of I/O-operations) and seek operations have to travel the
>>>>>>> linked
>>>>>>> page list reading some bytes of each page - again a lot of seek and
>>>>>>> read
>>>>>>> I/O-operations.
>>>>>>> To speed things up it is crucial to minimize the number of
>>>>>>> I/O-operations directly going to the random access file. Therefore
>>>>>>> it is
>>>>>>> needed to buffer writes, keep last read page in memory for 
>>>>>>> sequential
>>>>>>> reads and have an in-memory cache of page meta data (offset, 
>>>>>>> link to
>>>>>>> previous/next page).
>>>>>>>
>>>>>>>
>>>>>>> Best,
>>>>>>> Timo
>>>>>>>
>>>>>>>
>>>>>>> Am 14.07.2015 um 12:15 schrieb Manfred Pock:
>>>>>>>> Yes, the input is a inputstream. I can try it direct from file.
>>>>>>>>
>>>>>>>> But in general we get the pdf from an document management 
>>>>>>>> system as
>>>>>>>> stream.
>>>>>>>> Does make sense that i save the pdf to file before?
>>>>>>>>
>>>>>>>> Why is there so an big performance difference beetween the version
>>>>>>>> from
>>>>>>>> May and the current version, if we use it with useScratchFiles =
>>>>>>>> true ?
>>>>>>>>
>>>>>>>> regarts, Manfred
>>>>>>>>
>>>>>>>> Am 14.07.2015 um 12:02 schrieb Andreas Lehmkühler:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>>> Manfred Pock <po...@gmail.com> hat am 14. Juli 2015 um
>>>>>>>>>> 11:39
>>>>>>>>>> geschrieben:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Ok, we load the pdf with useScratchFiles = true, if we load them
>>>>>>>>>> with
>>>>>>>>>> false the performance is better, but a little bit slower than 
>>>>>>>>>> the
>>>>>>>>>> old
>>>>>>>>>> one.
>>>>>>>>> What do you use as input, a stream or a real file? If the latter
>>>>>>>>> you
>>>>>>>>> should use
>>>>>>>>> the load method with the file parameter.
>>>>>>>>>
>>>>>>>>> PDFBox needs ramdom access to the pdf and if a stream is provided
>>>>>>>>> PDFBox copies
>>>>>>>>> the data to a file (lower memory usage, slower performance) or
>>>>>>>>> to the
>>>>>>>>> memory
>>>>>>>>> (higher memory usage, better performance).
>>>>>>>>>
>>>>>>>>> BR
>>>>>>>>> Andreas
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> But now it need more memory. I cannot load some pdfs with the
>>>>>>>>>> current
>>>>>>>>>> version with the same java-memory configuration.
>>>>>>>>>>
>>>>>>>>>> Am 14.07.2015 um 11:26 schrieb Manfred Pock:
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> we use the Pdfbox-trunkversion to render pdf's, currently we 
>>>>>>>>>>> use
>>>>>>>>>>> the
>>>>>>>>>>> version from 12. May 2015.
>>>>>>>>>>>
>>>>>>>>>>> Today i have done an update to the current version and have 
>>>>>>>>>>> test
>>>>>>>>>>> it.
>>>>>>>>>>> It seems to be that it need now much more time to render
>>>>>>>>>>> pdf's, it
>>>>>>>>>>> depends of the size of the pdf.
>>>>>>>>>>>
>>>>>>>>>>> for example you can try this one:
>>>>>>>>>>>
>>>>>>>>>>> http://cloud.directupload.net/15bu
>>>>>>>>>>>
>>>>>>>>>>> It need five times more then the version from May 2015.
>>>>>>>>>>>
>>>>>>>>>>> regarts, Manfred
>>>>>>>>> --------------------------------------------------------------------- 
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>>>>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --------------------------------------------------------------------- 
>>>>>>>>
>>>>>>>>
>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>>>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>
>

Re: Performance of the trunkversion

Posted by Timo Boehme <ti...@ontochem.com>.

Hi Manfred,

there is another update of ScratchFile. It now is able to use a certain 
amount of main memory before using the scratch file. Could you give it a 
try? You will have to change the source a bit since the constructor 
getting the allowed amount of memory is currently not supported by 
PDDocument class. Simply change

     public ScratchFile(File scratchFileDirectory) throws IOException
     {
         this(scratchFileDirectory, 0);
     }

to
     public ScratchFile(File scratchFileDirectory) throws IOException
     {
         this(scratchFileDirectory, XXXXXX);
     }
where XXXXXX is the amount of main memory to be used for buffers in bytes.

If you use a larger value and the performance still is not same/better 
as the May version than at least it is not the problem of the buffer 
handling for streams.


Best,
Timo


Am 15.07.2015 um 12:20 schrieb Manfred Pock:
> Hi Timo,
>
> i have test it with different pdf's and die performance ist nearly of
> the version from may. Just a little bit slower.
>
> It will be ok, but it will be nice if it will performe better ;-)
>
> thanks and regarts.
> Manfred
>
> Am 15.07.2015 um 10:24 schrieb Timo Boehme:
>> Hi Manfred,
>>
>> the issue should be fixed in the updated versions attached to
>> PDFBOX-2882. Please give them a try.
>>
>>
>> Timo
>>
>>
>> Am 15.07.2015 um 09:51 schrieb Manfred Pock:
>>> Hi Timo,
>>>
>>> i have tried it put it doesn't work now and i get different exceptions
>>> or Errors
>>>
>>> i looks like that there is a problem with any kind of images, the rest
>>> will be shown.
>>>
>>> for example:
>>>
>>> SCHWERWIEGEND: TIFFFaxDecoder: Invalid code encountered while decoding
>>> 2D group 4 compressed data.
>>> java.io.IOException: TIFFFaxDecoder: Invalid code encountered while
>>> decoding 2D group 4 compressed data.
>>>      at
>>> org.apache.pdfbox.filter.ccitt.TIFFFaxDecoder.decodeT6(TIFFFaxDecoder.java:1125)
>>>
>>>
>>>      at
>>> org.apache.pdfbox.filter.CCITTFaxFilter.decode(CCITTFaxFilter.java:94)
>>>      at
>>> org.apache.pdfbox.cos.COSStream.attemptDecode(COSStream.java:422)
>>>      at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:398)
>>>      at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:335)
>>>      at
>>> org.apache.pdfbox.cos.COSStream.getDecodeResult(COSStream.java:278)
>>>      at
>>> org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.<init>(PDImageXObject.java:120)
>>>
>>>
>>>      at
>>> org.apache.pdfbox.pdmodel.graphics.PDXObject.createXObject(PDXObject.java:67)
>>>
>>>
>>>      at
>>> org.apache.pdfbox.pdmodel.PDResources.getXObject(PDResources.java:340)
>>>
>>> SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a
>>> DataFormatException
>>> Jul 15, 2015 9:45:05 AM org.apache.pdfbox.contentstream.PDFStreamEngine
>>> operatorException
>>> WARNUNG: java.util.zip.DataFormatException: invalid block type
>>>
>>> Jul 15, 2015 9:46:18 AM org.apache.pdfbox.contentstream.PDFStreamEngine
>>> operatorException
>>> WARNUNG: Not a JPEG file: starts with 0xe0 0x00
>>>
>>> ul 15, 2015 9:46:23 AM org.apache.pdfbox.contentstream.PDFStreamEngine
>>> operatorException
>>> WARNUNG: Image stream was not read - filter: DCTDecode
>>>
>>> SCHWERWIEGEND: java.util.zip.DataFormatException: invalid distance too
>>> far back
>>> java.io.IOException: java.util.zip.DataFormatException: invalid distance
>>> too far back
>>>      at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:83)
>>>      at
>>> org.apache.pdfbox.cos.COSStream.attemptDecode(COSStream.java:422)
>>>      at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:398)
>>>      at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:335)
>>>      at
>>> org.apache.pdfbox.cos.COSStream.checkUnfilteredBuffer(COSStream.java:265)
>>>
>>>      at
>>> org.apache.pdfbox.cos.COSStream.getUnfilteredRandomAccess(COSStream.java:239)
>>>
>>>
>>>      at
>>> org.apache.pdfbox.pdfparser.BaseParser.<init>(BaseParser.java:146)
>>>      at
>>> org.apache.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:78)
>>>
>>>      at
>>> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:451)
>>>
>>>
>>>      at
>>> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:438)
>>>
>>>
>>>      at
>>> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149)
>>>
>>>
>>>      at
>>> org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:180)
>>>      at
>>> org.apache.pdfbox.rendering.PDFRenderer.renderPage(PDFRenderer.java:205)
>>>      at
>>> org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:136)
>>>
>>>      at
>>> org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:95)
>>>
>>>
>>> .... Caused by: java.util.zip.DataFormatException: invalid distance too
>>> far back
>>>      at java.util.zip.Inflater.inflateBytes(Native Method)
>>>      at java.util.zip.Inflater.inflate(Inflater.java:259)
>>>      at java.util.zip.Inflater.inflate(Inflater.java:280)
>>>      at
>>> org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:101)
>>>      at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:74)
>>>
>>> Am 15.07.2015 um 00:35 schrieb Timo Boehme:
>>>> I've created PDFBOX-2882 with a drop-in replacement of the scratch
>>>> file implementation.
>>>> @Manfred: Could you please test if this helps in your scenario to
>>>> increase performance?
>>>>
>>>> Best,
>>>> Timo
>>>>
>>>>
>>>> Am 14.07.2015 um 13:47 schrieb Timo Boehme:
>>>>> Hi,
>>>>>
>>>>> instead of having a linked page list in ScratchFileBuffer I would
>>>>> propose having a list of pages with the page numbers (integer) kept in
>>>>> memory (takes 1k for 1MB data). This would ease page handling, seeking
>>>>> does not need I/O-operations and caching of pages would be a lot
>>>>> easier.
>>>>> I may find some time later to come up with such a replacement.
>>>>>
>>>>> Best,
>>>>> Timo
>>>>>
>>>>>
>>>>> Am 14.07.2015 um 13:02 schrieb Timo Boehme:
>>>>>> Hi,
>>>>>>
>>>>>> as I see it (had only a quick look at the implementation) the
>>>>>> ScratchFileBuffer implementation is not optimal for fast random
>>>>>> access.
>>>>>> Single writes of bytes are not buffered but directly written to the
>>>>>> file
>>>>>> - a lot of I/O-operations) and seek operations have to travel the
>>>>>> linked
>>>>>> page list reading some bytes of each page - again a lot of seek and
>>>>>> read
>>>>>> I/O-operations.
>>>>>> To speed things up it is crucial to minimize the number of
>>>>>> I/O-operations directly going to the random access file. Therefore
>>>>>> it is
>>>>>> needed to buffer writes, keep last read page in memory for sequential
>>>>>> reads and have an in-memory cache of page meta data (offset, link to
>>>>>> previous/next page).
>>>>>>
>>>>>>
>>>>>> Best,
>>>>>> Timo
>>>>>>
>>>>>>
>>>>>> Am 14.07.2015 um 12:15 schrieb Manfred Pock:
>>>>>>> Yes, the input is a inputstream. I can try it direct from file.
>>>>>>>
>>>>>>> But in general we get the pdf from an document management system as
>>>>>>> stream.
>>>>>>> Does make sense that i save the pdf to file before?
>>>>>>>
>>>>>>> Why is there so an big performance difference beetween the version
>>>>>>> from
>>>>>>> May and the current version, if we use it with useScratchFiles =
>>>>>>> true ?
>>>>>>>
>>>>>>> regarts, Manfred
>>>>>>>
>>>>>>> Am 14.07.2015 um 12:02 schrieb Andreas Lehmkühler:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>>> Manfred Pock <po...@gmail.com> hat am 14. Juli 2015 um
>>>>>>>>> 11:39
>>>>>>>>> geschrieben:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Ok, we load the pdf with useScratchFiles = true, if we load them
>>>>>>>>> with
>>>>>>>>> false the performance is better, but a little bit slower than the
>>>>>>>>> old
>>>>>>>>> one.
>>>>>>>> What do you use as input, a stream or a real file? If the latter
>>>>>>>> you
>>>>>>>> should use
>>>>>>>> the load method with the file parameter.
>>>>>>>>
>>>>>>>> PDFBox needs ramdom access to the pdf and if a stream is provided
>>>>>>>> PDFBox copies
>>>>>>>> the data to a file (lower memory usage, slower performance) or
>>>>>>>> to the
>>>>>>>> memory
>>>>>>>> (higher memory usage, better performance).
>>>>>>>>
>>>>>>>> BR
>>>>>>>> Andreas
>>>>>>>>
>>>>>>>>
>>>>>>>>> But now it need more memory. I cannot load some pdfs with the
>>>>>>>>> current
>>>>>>>>> version with the same java-memory configuration.
>>>>>>>>>
>>>>>>>>> Am 14.07.2015 um 11:26 schrieb Manfred Pock:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> we use the Pdfbox-trunkversion to render pdf's, currently we use
>>>>>>>>>> the
>>>>>>>>>> version from 12. May 2015.
>>>>>>>>>>
>>>>>>>>>> Today i have done an update to the current version and have test
>>>>>>>>>> it.
>>>>>>>>>> It seems to be that it need now much more time to render
>>>>>>>>>> pdf's, it
>>>>>>>>>> depends of the size of the pdf.
>>>>>>>>>>
>>>>>>>>>> for example you can try this one:
>>>>>>>>>>
>>>>>>>>>> http://cloud.directupload.net/15bu
>>>>>>>>>>
>>>>>>>>>> It need five times more then the version from May 2015.
>>>>>>>>>>
>>>>>>>>>> regarts, Manfred
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>
>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>>>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>>
>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
>


-- 
Timo Boehme
OntoChem IT Solutions GmbH
Blücherstraße 24
06120 Halle (Saale)
Germany

phone: +49 345 478 047 4      | fax: +49 345 478 047 1
email: ulf.laube@ontochem.com | web: www.ontochem.com
HRB 21962 Amtsgericht Stendal | USt-IdNr.: DE815563824
managing director : Lutz Weber


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org

Re: Performance of the trunkversion

Posted by Manfred Pock <po...@gmail.com>.

Hi Timo,

i have test it with different pdf's and die performance ist nearly of 
the version from may. Just a little bit slower.

It will be ok, but it will be nice if it will performe better ;-)

thanks and regarts.
Manfred

Am 15.07.2015 um 10:24 schrieb Timo Boehme:
> Hi Manfred,
>
> the issue should be fixed in the updated versions attached to 
> PDFBOX-2882. Please give them a try.
>
>
> Timo
>
>
> Am 15.07.2015 um 09:51 schrieb Manfred Pock:
>> Hi Timo,
>>
>> i have tried it put it doesn't work now and i get different exceptions
>> or Errors
>>
>> i looks like that there is a problem with any kind of images, the rest
>> will be shown.
>>
>> for example:
>>
>> SCHWERWIEGEND: TIFFFaxDecoder: Invalid code encountered while decoding
>> 2D group 4 compressed data.
>> java.io.IOException: TIFFFaxDecoder: Invalid code encountered while
>> decoding 2D group 4 compressed data.
>>      at
>> org.apache.pdfbox.filter.ccitt.TIFFFaxDecoder.decodeT6(TIFFFaxDecoder.java:1125) 
>>
>>
>>      at
>> org.apache.pdfbox.filter.CCITTFaxFilter.decode(CCITTFaxFilter.java:94)
>>      at 
>> org.apache.pdfbox.cos.COSStream.attemptDecode(COSStream.java:422)
>>      at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:398)
>>      at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:335)
>>      at 
>> org.apache.pdfbox.cos.COSStream.getDecodeResult(COSStream.java:278)
>>      at
>> org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.<init>(PDImageXObject.java:120) 
>>
>>
>>      at
>> org.apache.pdfbox.pdmodel.graphics.PDXObject.createXObject(PDXObject.java:67) 
>>
>>
>>      at
>> org.apache.pdfbox.pdmodel.PDResources.getXObject(PDResources.java:340)
>>
>> SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a
>> DataFormatException
>> Jul 15, 2015 9:45:05 AM org.apache.pdfbox.contentstream.PDFStreamEngine
>> operatorException
>> WARNUNG: java.util.zip.DataFormatException: invalid block type
>>
>> Jul 15, 2015 9:46:18 AM org.apache.pdfbox.contentstream.PDFStreamEngine
>> operatorException
>> WARNUNG: Not a JPEG file: starts with 0xe0 0x00
>>
>> ul 15, 2015 9:46:23 AM org.apache.pdfbox.contentstream.PDFStreamEngine
>> operatorException
>> WARNUNG: Image stream was not read - filter: DCTDecode
>>
>> SCHWERWIEGEND: java.util.zip.DataFormatException: invalid distance too
>> far back
>> java.io.IOException: java.util.zip.DataFormatException: invalid distance
>> too far back
>>      at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:83)
>>      at 
>> org.apache.pdfbox.cos.COSStream.attemptDecode(COSStream.java:422)
>>      at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:398)
>>      at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:335)
>>      at
>> org.apache.pdfbox.cos.COSStream.checkUnfilteredBuffer(COSStream.java:265) 
>>
>>      at
>> org.apache.pdfbox.cos.COSStream.getUnfilteredRandomAccess(COSStream.java:239) 
>>
>>
>>      at 
>> org.apache.pdfbox.pdfparser.BaseParser.<init>(BaseParser.java:146)
>>      at
>> org.apache.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:78) 
>>
>>      at
>> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:451) 
>>
>>
>>      at
>> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:438) 
>>
>>
>>      at
>> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149) 
>>
>>
>>      at
>> org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:180)
>>      at
>> org.apache.pdfbox.rendering.PDFRenderer.renderPage(PDFRenderer.java:205)
>>      at
>> org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:136) 
>>
>>      at
>> org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:95) 
>>
>>
>> .... Caused by: java.util.zip.DataFormatException: invalid distance too
>> far back
>>      at java.util.zip.Inflater.inflateBytes(Native Method)
>>      at java.util.zip.Inflater.inflate(Inflater.java:259)
>>      at java.util.zip.Inflater.inflate(Inflater.java:280)
>>      at
>> org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:101)
>>      at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:74)
>>
>> Am 15.07.2015 um 00:35 schrieb Timo Boehme:
>>> I've created PDFBOX-2882 with a drop-in replacement of the scratch
>>> file implementation.
>>> @Manfred: Could you please test if this helps in your scenario to
>>> increase performance?
>>>
>>> Best,
>>> Timo
>>>
>>>
>>> Am 14.07.2015 um 13:47 schrieb Timo Boehme:
>>>> Hi,
>>>>
>>>> instead of having a linked page list in ScratchFileBuffer I would
>>>> propose having a list of pages with the page numbers (integer) kept in
>>>> memory (takes 1k for 1MB data). This would ease page handling, seeking
>>>> does not need I/O-operations and caching of pages would be a lot 
>>>> easier.
>>>> I may find some time later to come up with such a replacement.
>>>>
>>>> Best,
>>>> Timo
>>>>
>>>>
>>>> Am 14.07.2015 um 13:02 schrieb Timo Boehme:
>>>>> Hi,
>>>>>
>>>>> as I see it (had only a quick look at the implementation) the
>>>>> ScratchFileBuffer implementation is not optimal for fast random 
>>>>> access.
>>>>> Single writes of bytes are not buffered but directly written to the
>>>>> file
>>>>> - a lot of I/O-operations) and seek operations have to travel the
>>>>> linked
>>>>> page list reading some bytes of each page - again a lot of seek and
>>>>> read
>>>>> I/O-operations.
>>>>> To speed things up it is crucial to minimize the number of
>>>>> I/O-operations directly going to the random access file. Therefore
>>>>> it is
>>>>> needed to buffer writes, keep last read page in memory for sequential
>>>>> reads and have an in-memory cache of page meta data (offset, link to
>>>>> previous/next page).
>>>>>
>>>>>
>>>>> Best,
>>>>> Timo
>>>>>
>>>>>
>>>>> Am 14.07.2015 um 12:15 schrieb Manfred Pock:
>>>>>> Yes, the input is a inputstream. I can try it direct from file.
>>>>>>
>>>>>> But in general we get the pdf from an document management system as
>>>>>> stream.
>>>>>> Does make sense that i save the pdf to file before?
>>>>>>
>>>>>> Why is there so an big performance difference beetween the version
>>>>>> from
>>>>>> May and the current version, if we use it with useScratchFiles =
>>>>>> true ?
>>>>>>
>>>>>> regarts, Manfred
>>>>>>
>>>>>> Am 14.07.2015 um 12:02 schrieb Andreas Lehmkühler:
>>>>>>> Hi,
>>>>>>>
>>>>>>>> Manfred Pock <po...@gmail.com> hat am 14. Juli 2015 um 
>>>>>>>> 11:39
>>>>>>>> geschrieben:
>>>>>>>>
>>>>>>>>
>>>>>>>> Ok, we load the pdf with useScratchFiles = true, if we load them
>>>>>>>> with
>>>>>>>> false the performance is better, but a little bit slower than the
>>>>>>>> old
>>>>>>>> one.
>>>>>>> What do you use as input, a stream or a real file? If the latter 
>>>>>>> you
>>>>>>> should use
>>>>>>> the load method with the file parameter.
>>>>>>>
>>>>>>> PDFBox needs ramdom access to the pdf and if a stream is provided
>>>>>>> PDFBox copies
>>>>>>> the data to a file (lower memory usage, slower performance) or 
>>>>>>> to the
>>>>>>> memory
>>>>>>> (higher memory usage, better performance).
>>>>>>>
>>>>>>> BR
>>>>>>> Andreas
>>>>>>>
>>>>>>>
>>>>>>>> But now it need more memory. I cannot load some pdfs with the
>>>>>>>> current
>>>>>>>> version with the same java-memory configuration.
>>>>>>>>
>>>>>>>> Am 14.07.2015 um 11:26 schrieb Manfred Pock:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> we use the Pdfbox-trunkversion to render pdf's, currently we use
>>>>>>>>> the
>>>>>>>>> version from 12. May 2015.
>>>>>>>>>
>>>>>>>>> Today i have done an update to the current version and have test
>>>>>>>>> it.
>>>>>>>>> It seems to be that it need now much more time to render 
>>>>>>>>> pdf's, it
>>>>>>>>> depends of the size of the pdf.
>>>>>>>>>
>>>>>>>>> for example you can try this one:
>>>>>>>>>
>>>>>>>>> http://cloud.directupload.net/15bu
>>>>>>>>>
>>>>>>>>> It need five times more then the version from May 2015.
>>>>>>>>>
>>>>>>>>> regarts, Manfred
>>>>>>> --------------------------------------------------------------------- 
>>>>>>>
>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --------------------------------------------------------------------- 
>>>>>>
>>>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org

Re: Performance of the trunkversion

Posted by Timo Boehme <ti...@ontochem.com>.

Hi Manfred,

the issue should be fixed in the updated versions attached to 
PDFBOX-2882. Please give them a try.


Timo


Am 15.07.2015 um 09:51 schrieb Manfred Pock:
> Hi Timo,
>
> i have tried it put it doesn't work now and i get different exceptions
> or Errors
>
> i looks like that there is a problem with any kind of images, the rest
> will be shown.
>
> for example:
>
> SCHWERWIEGEND: TIFFFaxDecoder: Invalid code encountered while decoding
> 2D group 4 compressed data.
> java.io.IOException: TIFFFaxDecoder: Invalid code encountered while
> decoding 2D group 4 compressed data.
>      at
> org.apache.pdfbox.filter.ccitt.TIFFFaxDecoder.decodeT6(TIFFFaxDecoder.java:1125)
>
>      at
> org.apache.pdfbox.filter.CCITTFaxFilter.decode(CCITTFaxFilter.java:94)
>      at org.apache.pdfbox.cos.COSStream.attemptDecode(COSStream.java:422)
>      at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:398)
>      at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:335)
>      at org.apache.pdfbox.cos.COSStream.getDecodeResult(COSStream.java:278)
>      at
> org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.<init>(PDImageXObject.java:120)
>
>      at
> org.apache.pdfbox.pdmodel.graphics.PDXObject.createXObject(PDXObject.java:67)
>
>      at
> org.apache.pdfbox.pdmodel.PDResources.getXObject(PDResources.java:340)
>
> SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a
> DataFormatException
> Jul 15, 2015 9:45:05 AM org.apache.pdfbox.contentstream.PDFStreamEngine
> operatorException
> WARNUNG: java.util.zip.DataFormatException: invalid block type
>
> Jul 15, 2015 9:46:18 AM org.apache.pdfbox.contentstream.PDFStreamEngine
> operatorException
> WARNUNG: Not a JPEG file: starts with 0xe0 0x00
>
> ul 15, 2015 9:46:23 AM org.apache.pdfbox.contentstream.PDFStreamEngine
> operatorException
> WARNUNG: Image stream was not read - filter: DCTDecode
>
> SCHWERWIEGEND: java.util.zip.DataFormatException: invalid distance too
> far back
> java.io.IOException: java.util.zip.DataFormatException: invalid distance
> too far back
>      at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:83)
>      at org.apache.pdfbox.cos.COSStream.attemptDecode(COSStream.java:422)
>      at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:398)
>      at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:335)
>      at
> org.apache.pdfbox.cos.COSStream.checkUnfilteredBuffer(COSStream.java:265)
>      at
> org.apache.pdfbox.cos.COSStream.getUnfilteredRandomAccess(COSStream.java:239)
>
>      at org.apache.pdfbox.pdfparser.BaseParser.<init>(BaseParser.java:146)
>      at
> org.apache.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:78)
>      at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:451)
>
>      at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:438)
>
>      at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149)
>
>      at
> org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:180)
>      at
> org.apache.pdfbox.rendering.PDFRenderer.renderPage(PDFRenderer.java:205)
>      at
> org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:136)
>      at
> org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:95)
>
> .... Caused by: java.util.zip.DataFormatException: invalid distance too
> far back
>      at java.util.zip.Inflater.inflateBytes(Native Method)
>      at java.util.zip.Inflater.inflate(Inflater.java:259)
>      at java.util.zip.Inflater.inflate(Inflater.java:280)
>      at
> org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:101)
>      at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:74)
>
> Am 15.07.2015 um 00:35 schrieb Timo Boehme:
>> I've created PDFBOX-2882 with a drop-in replacement of the scratch
>> file implementation.
>> @Manfred: Could you please test if this helps in your scenario to
>> increase performance?
>>
>> Best,
>> Timo
>>
>>
>> Am 14.07.2015 um 13:47 schrieb Timo Boehme:
>>> Hi,
>>>
>>> instead of having a linked page list in ScratchFileBuffer I would
>>> propose having a list of pages with the page numbers (integer) kept in
>>> memory (takes 1k for 1MB data). This would ease page handling, seeking
>>> does not need I/O-operations and caching of pages would be a lot easier.
>>> I may find some time later to come up with such a replacement.
>>>
>>> Best,
>>> Timo
>>>
>>>
>>> Am 14.07.2015 um 13:02 schrieb Timo Boehme:
>>>> Hi,
>>>>
>>>> as I see it (had only a quick look at the implementation) the
>>>> ScratchFileBuffer implementation is not optimal for fast random access.
>>>> Single writes of bytes are not buffered but directly written to the
>>>> file
>>>> - a lot of I/O-operations) and seek operations have to travel the
>>>> linked
>>>> page list reading some bytes of each page - again a lot of seek and
>>>> read
>>>> I/O-operations.
>>>> To speed things up it is crucial to minimize the number of
>>>> I/O-operations directly going to the random access file. Therefore
>>>> it is
>>>> needed to buffer writes, keep last read page in memory for sequential
>>>> reads and have an in-memory cache of page meta data (offset, link to
>>>> previous/next page).
>>>>
>>>>
>>>> Best,
>>>> Timo
>>>>
>>>>
>>>> Am 14.07.2015 um 12:15 schrieb Manfred Pock:
>>>>> Yes, the input is a inputstream. I can try it direct from file.
>>>>>
>>>>> But in general we get the pdf from an document management system as
>>>>> stream.
>>>>> Does make sense that i save the pdf to file before?
>>>>>
>>>>> Why is there so an big performance difference beetween the version
>>>>> from
>>>>> May and the current version, if we use it with useScratchFiles =
>>>>> true ?
>>>>>
>>>>> regarts, Manfred
>>>>>
>>>>> Am 14.07.2015 um 12:02 schrieb Andreas Lehmkühler:
>>>>>> Hi,
>>>>>>
>>>>>>> Manfred Pock <po...@gmail.com> hat am 14. Juli 2015 um 11:39
>>>>>>> geschrieben:
>>>>>>>
>>>>>>>
>>>>>>> Ok, we load the pdf with useScratchFiles = true, if we load them
>>>>>>> with
>>>>>>> false the performance is better, but a little bit slower than the
>>>>>>> old
>>>>>>> one.
>>>>>> What do you use as input, a stream or a real file? If the latter you
>>>>>> should use
>>>>>> the load method with the file parameter.
>>>>>>
>>>>>> PDFBox needs ramdom access to the pdf and if a stream is provided
>>>>>> PDFBox copies
>>>>>> the data to a file (lower memory usage, slower performance) or to the
>>>>>> memory
>>>>>> (higher memory usage, better performance).
>>>>>>
>>>>>> BR
>>>>>> Andreas
>>>>>>
>>>>>>
>>>>>>> But now it need more memory. I cannot load some pdfs with the
>>>>>>> current
>>>>>>> version with the same java-memory configuration.
>>>>>>>
>>>>>>> Am 14.07.2015 um 11:26 schrieb Manfred Pock:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> we use the Pdfbox-trunkversion to render pdf's, currently we use
>>>>>>>> the
>>>>>>>> version from 12. May 2015.
>>>>>>>>
>>>>>>>> Today i have done an update to the current version and have test
>>>>>>>> it.
>>>>>>>> It seems to be that it need now much more time to render pdf's, it
>>>>>>>> depends of the size of the pdf.
>>>>>>>>
>>>>>>>> for example you can try this one:
>>>>>>>>
>>>>>>>> http://cloud.directupload.net/15bu
>>>>>>>>
>>>>>>>> It need five times more then the version from May 2015.
>>>>>>>>
>>>>>>>> regarts, Manfred
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>


-- 
Timo Boehme
OntoChem IT Solutions GmbH
Blücherstraße 24
06120 Halle (Saale)
Germany

phone: +49 345 478 047 4      | fax: +49 345 478 047 1
email: ulf.laube@ontochem.com | web: www.ontochem.com
HRB 21962 Amtsgericht Stendal | USt-IdNr.: DE815563824
managing director : Lutz Weber


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org

Re: Performance of the trunkversion

Posted by Manfred Pock <po...@gmail.com>.

Hi Timo,

i have tried it put it doesn't work now and i get different exceptions 
or Errors

i looks like that there is a problem with any kind of images, the rest 
will be shown.

for example:

SCHWERWIEGEND: TIFFFaxDecoder: Invalid code encountered while decoding 
2D group 4 compressed data.
java.io.IOException: TIFFFaxDecoder: Invalid code encountered while 
decoding 2D group 4 compressed data.
     at 
org.apache.pdfbox.filter.ccitt.TIFFFaxDecoder.decodeT6(TIFFFaxDecoder.java:1125)
     at 
org.apache.pdfbox.filter.CCITTFaxFilter.decode(CCITTFaxFilter.java:94)
     at org.apache.pdfbox.cos.COSStream.attemptDecode(COSStream.java:422)
     at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:398)
     at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:335)
     at org.apache.pdfbox.cos.COSStream.getDecodeResult(COSStream.java:278)
     at 
org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.<init>(PDImageXObject.java:120)
     at 
org.apache.pdfbox.pdmodel.graphics.PDXObject.createXObject(PDXObject.java:67)
     at 
org.apache.pdfbox.pdmodel.PDResources.getXObject(PDResources.java:340)

SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a 
DataFormatException
Jul 15, 2015 9:45:05 AM org.apache.pdfbox.contentstream.PDFStreamEngine 
operatorException
WARNUNG: java.util.zip.DataFormatException: invalid block type

Jul 15, 2015 9:46:18 AM org.apache.pdfbox.contentstream.PDFStreamEngine 
operatorException
WARNUNG: Not a JPEG file: starts with 0xe0 0x00

ul 15, 2015 9:46:23 AM org.apache.pdfbox.contentstream.PDFStreamEngine 
operatorException
WARNUNG: Image stream was not read - filter: DCTDecode

SCHWERWIEGEND: java.util.zip.DataFormatException: invalid distance too 
far back
java.io.IOException: java.util.zip.DataFormatException: invalid distance 
too far back
     at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:83)
     at org.apache.pdfbox.cos.COSStream.attemptDecode(COSStream.java:422)
     at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:398)
     at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:335)
     at 
org.apache.pdfbox.cos.COSStream.checkUnfilteredBuffer(COSStream.java:265)
     at 
org.apache.pdfbox.cos.COSStream.getUnfilteredRandomAccess(COSStream.java:239)
     at org.apache.pdfbox.pdfparser.BaseParser.<init>(BaseParser.java:146)
     at 
org.apache.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:78)
     at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:451)
     at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:438)
     at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149)
     at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:180)
     at 
org.apache.pdfbox.rendering.PDFRenderer.renderPage(PDFRenderer.java:205)
     at 
org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:136)
     at 
org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:95)
.... Caused by: java.util.zip.DataFormatException: invalid distance too 
far back
     at java.util.zip.Inflater.inflateBytes(Native Method)
     at java.util.zip.Inflater.inflate(Inflater.java:259)
     at java.util.zip.Inflater.inflate(Inflater.java:280)
     at 
org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:101)
     at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:74)

Am 15.07.2015 um 00:35 schrieb Timo Boehme:
> I've created PDFBOX-2882 with a drop-in replacement of the scratch 
> file implementation.
> @Manfred: Could you please test if this helps in your scenario to 
> increase performance?
>
> Best,
> Timo
>
>
> Am 14.07.2015 um 13:47 schrieb Timo Boehme:
>> Hi,
>>
>> instead of having a linked page list in ScratchFileBuffer I would
>> propose having a list of pages with the page numbers (integer) kept in
>> memory (takes 1k for 1MB data). This would ease page handling, seeking
>> does not need I/O-operations and caching of pages would be a lot easier.
>> I may find some time later to come up with such a replacement.
>>
>> Best,
>> Timo
>>
>>
>> Am 14.07.2015 um 13:02 schrieb Timo Boehme:
>>> Hi,
>>>
>>> as I see it (had only a quick look at the implementation) the
>>> ScratchFileBuffer implementation is not optimal for fast random access.
>>> Single writes of bytes are not buffered but directly written to the 
>>> file
>>> - a lot of I/O-operations) and seek operations have to travel the 
>>> linked
>>> page list reading some bytes of each page - again a lot of seek and 
>>> read
>>> I/O-operations.
>>> To speed things up it is crucial to minimize the number of
>>> I/O-operations directly going to the random access file. Therefore 
>>> it is
>>> needed to buffer writes, keep last read page in memory for sequential
>>> reads and have an in-memory cache of page meta data (offset, link to
>>> previous/next page).
>>>
>>>
>>> Best,
>>> Timo
>>>
>>>
>>> Am 14.07.2015 um 12:15 schrieb Manfred Pock:
>>>> Yes, the input is a inputstream. I can try it direct from file.
>>>>
>>>> But in general we get the pdf from an document management system as
>>>> stream.
>>>> Does make sense that i save the pdf to file before?
>>>>
>>>> Why is there so an big performance difference beetween the version 
>>>> from
>>>> May and the current version, if we use it with useScratchFiles = 
>>>> true ?
>>>>
>>>> regarts, Manfred
>>>>
>>>> Am 14.07.2015 um 12:02 schrieb Andreas Lehmkühler:
>>>>> Hi,
>>>>>
>>>>>> Manfred Pock <po...@gmail.com> hat am 14. Juli 2015 um 11:39
>>>>>> geschrieben:
>>>>>>
>>>>>>
>>>>>> Ok, we load the pdf with useScratchFiles = true, if we load them 
>>>>>> with
>>>>>> false the performance is better, but a little bit slower than the 
>>>>>> old
>>>>>> one.
>>>>> What do you use as input, a stream or a real file? If the latter you
>>>>> should use
>>>>> the load method with the file parameter.
>>>>>
>>>>> PDFBox needs ramdom access to the pdf and if a stream is provided
>>>>> PDFBox copies
>>>>> the data to a file (lower memory usage, slower performance) or to the
>>>>> memory
>>>>> (higher memory usage, better performance).
>>>>>
>>>>> BR
>>>>> Andreas
>>>>>
>>>>>
>>>>>> But now it need more memory. I cannot load some pdfs with the 
>>>>>> current
>>>>>> version with the same java-memory configuration.
>>>>>>
>>>>>> Am 14.07.2015 um 11:26 schrieb Manfred Pock:
>>>>>>> Hi,
>>>>>>>
>>>>>>> we use the Pdfbox-trunkversion to render pdf's, currently we use 
>>>>>>> the
>>>>>>> version from 12. May 2015.
>>>>>>>
>>>>>>> Today i have done an update to the current version and have test 
>>>>>>> it.
>>>>>>> It seems to be that it need now much more time to render pdf's, it
>>>>>>> depends of the size of the pdf.
>>>>>>>
>>>>>>> for example you can try this one:
>>>>>>>
>>>>>>> http://cloud.directupload.net/15bu
>>>>>>>
>>>>>>> It need five times more then the version from May 2015.
>>>>>>>
>>>>>>> regarts, Manfred
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>
>>>
>>>
>>
>>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org

Re: Performance of the trunkversion

Posted by Timo Boehme <ti...@ontochem.com>.

I've created PDFBOX-2882 with a drop-in replacement of the scratch file 
implementation.
@Manfred: Could you please test if this helps in your scenario to 
increase performance?

Best,
Timo


Am 14.07.2015 um 13:47 schrieb Timo Boehme:
> Hi,
>
> instead of having a linked page list in ScratchFileBuffer I would
> propose having a list of pages with the page numbers (integer) kept in
> memory (takes 1k for 1MB data). This would ease page handling, seeking
> does not need I/O-operations and caching of pages would be a lot easier.
> I may find some time later to come up with such a replacement.
>
> Best,
> Timo
>
>
> Am 14.07.2015 um 13:02 schrieb Timo Boehme:
>> Hi,
>>
>> as I see it (had only a quick look at the implementation) the
>> ScratchFileBuffer implementation is not optimal for fast random access.
>> Single writes of bytes are not buffered but directly written to the file
>> - a lot of I/O-operations) and seek operations have to travel the linked
>> page list reading some bytes of each page - again a lot of seek and read
>> I/O-operations.
>> To speed things up it is crucial to minimize the number of
>> I/O-operations directly going to the random access file. Therefore it is
>> needed to buffer writes, keep last read page in memory for sequential
>> reads and have an in-memory cache of page meta data (offset, link to
>> previous/next page).
>>
>>
>> Best,
>> Timo
>>
>>
>> Am 14.07.2015 um 12:15 schrieb Manfred Pock:
>>> Yes, the input is a inputstream. I can try it direct from file.
>>>
>>> But in general we get the pdf from an document management system as
>>> stream.
>>> Does make sense that i save the pdf to file before?
>>>
>>> Why is there so an big performance difference beetween the version from
>>> May and the current version, if we use it with useScratchFiles = true ?
>>>
>>> regarts, Manfred
>>>
>>> Am 14.07.2015 um 12:02 schrieb Andreas Lehmkühler:
>>>> Hi,
>>>>
>>>>> Manfred Pock <po...@gmail.com> hat am 14. Juli 2015 um 11:39
>>>>> geschrieben:
>>>>>
>>>>>
>>>>> Ok, we load the pdf with useScratchFiles = true, if we load them with
>>>>> false the performance is better, but a little bit slower than the old
>>>>> one.
>>>> What do you use as input, a stream or a real file? If the latter you
>>>> should use
>>>> the load method with the file parameter.
>>>>
>>>> PDFBox needs ramdom access to the pdf and if a stream is provided
>>>> PDFBox copies
>>>> the data to a file (lower memory usage, slower performance) or to the
>>>> memory
>>>> (higher memory usage, better performance).
>>>>
>>>> BR
>>>> Andreas
>>>>
>>>>
>>>>> But now it need more memory. I cannot load some pdfs with the current
>>>>> version with the same java-memory configuration.
>>>>>
>>>>> Am 14.07.2015 um 11:26 schrieb Manfred Pock:
>>>>>> Hi,
>>>>>>
>>>>>> we use the Pdfbox-trunkversion to render pdf's, currently we use the
>>>>>> version from 12. May 2015.
>>>>>>
>>>>>> Today i have done an update to the current version and have test it.
>>>>>> It seems to be that it need now much more time to render pdf's, it
>>>>>> depends of the size of the pdf.
>>>>>>
>>>>>> for example you can try this one:
>>>>>>
>>>>>> http://cloud.directupload.net/15bu
>>>>>>
>>>>>> It need five times more then the version from May 2015.
>>>>>>
>>>>>> regarts, Manfred
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>
>>
>>
>
>


-- 
Timo Boehme
OntoChem IT Solutions GmbH
Blücherstraße 24
06120 Halle (Saale)
Germany

phone: +49 345 478 047 4      | fax: +49 345 478 047 1
email: ulf.laube@ontochem.com | web: www.ontochem.com
HRB 21962 Amtsgericht Stendal | USt-IdNr.: DE815563824
managing director : Lutz Weber


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org

Re: Performance of the trunkversion

Posted by Timo Boehme <ti...@ontochem.com>.

Hi,

instead of having a linked page list in ScratchFileBuffer I would 
propose having a list of pages with the page numbers (integer) kept in 
memory (takes 1k for 1MB data). This would ease page handling, seeking 
does not need I/O-operations and caching of pages would be a lot easier.
I may find some time later to come up with such a replacement.

Best,
Timo


Am 14.07.2015 um 13:02 schrieb Timo Boehme:
> Hi,
>
> as I see it (had only a quick look at the implementation) the
> ScratchFileBuffer implementation is not optimal for fast random access.
> Single writes of bytes are not buffered but directly written to the file
> - a lot of I/O-operations) and seek operations have to travel the linked
> page list reading some bytes of each page - again a lot of seek and read
> I/O-operations.
> To speed things up it is crucial to minimize the number of
> I/O-operations directly going to the random access file. Therefore it is
> needed to buffer writes, keep last read page in memory for sequential
> reads and have an in-memory cache of page meta data (offset, link to
> previous/next page).
>
>
> Best,
> Timo
>
>
> Am 14.07.2015 um 12:15 schrieb Manfred Pock:
>> Yes, the input is a inputstream. I can try it direct from file.
>>
>> But in general we get the pdf from an document management system as
>> stream.
>> Does make sense that i save the pdf to file before?
>>
>> Why is there so an big performance difference beetween the version from
>> May and the current version, if we use it with useScratchFiles = true ?
>>
>> regarts, Manfred
>>
>> Am 14.07.2015 um 12:02 schrieb Andreas Lehmkühler:
>>> Hi,
>>>
>>>> Manfred Pock <po...@gmail.com> hat am 14. Juli 2015 um 11:39
>>>> geschrieben:
>>>>
>>>>
>>>> Ok, we load the pdf with useScratchFiles = true, if we load them with
>>>> false the performance is better, but a little bit slower than the old
>>>> one.
>>> What do you use as input, a stream or a real file? If the latter you
>>> should use
>>> the load method with the file parameter.
>>>
>>> PDFBox needs ramdom access to the pdf and if a stream is provided
>>> PDFBox copies
>>> the data to a file (lower memory usage, slower performance) or to the
>>> memory
>>> (higher memory usage, better performance).
>>>
>>> BR
>>> Andreas
>>>
>>>
>>>> But now it need more memory. I cannot load some pdfs with the current
>>>> version with the same java-memory configuration.
>>>>
>>>> Am 14.07.2015 um 11:26 schrieb Manfred Pock:
>>>>> Hi,
>>>>>
>>>>> we use the Pdfbox-trunkversion to render pdf's, currently we use the
>>>>> version from 12. May 2015.
>>>>>
>>>>> Today i have done an update to the current version and have test it.
>>>>> It seems to be that it need now much more time to render pdf's, it
>>>>> depends of the size of the pdf.
>>>>>
>>>>> for example you can try this one:
>>>>>
>>>>> http://cloud.directupload.net/15bu
>>>>>
>>>>> It need five times more then the version from May 2015.
>>>>>
>>>>> regarts, Manfred
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>
>
>


-- 
Timo Boehme
OntoChem IT Solutions GmbH
Blücherstraße 24
06120 Halle (Saale)
Germany

phone: +49 345 478 047 4      | fax: +49 345 478 047 1
email: ulf.laube@ontochem.com | web: www.ontochem.com
HRB 21962 Amtsgericht Stendal | USt-IdNr.: DE815563824
managing director : Lutz Weber


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org

Re: Performance of the trunkversion

Posted by Timo Boehme <ti...@ontochem.com>.

Hi,

as I see it (had only a quick look at the implementation) the 
ScratchFileBuffer implementation is not optimal for fast random access. 
Single writes of bytes are not buffered but directly written to the file 
- a lot of I/O-operations) and seek operations have to travel the linked 
page list reading some bytes of each page - again a lot of seek and read 
I/O-operations.
To speed things up it is crucial to minimize the number of 
I/O-operations directly going to the random access file. Therefore it is 
needed to buffer writes, keep last read page in memory for sequential 
reads and have an in-memory cache of page meta data (offset, link to 
previous/next page).


Best,
Timo


Am 14.07.2015 um 12:15 schrieb Manfred Pock:
> Yes, the input is a inputstream. I can try it direct from file.
>
> But in general we get the pdf from an document management system as stream.
> Does make sense that i save the pdf to file before?
>
> Why is there so an big performance difference beetween the version from
> May and the current version, if we use it with useScratchFiles = true ?
>
> regarts, Manfred
>
> Am 14.07.2015 um 12:02 schrieb Andreas Lehmkühler:
>> Hi,
>>
>>> Manfred Pock <po...@gmail.com> hat am 14. Juli 2015 um 11:39
>>> geschrieben:
>>>
>>>
>>> Ok, we load the pdf with useScratchFiles = true, if we load them with
>>> false the performance is better, but a little bit slower than the old
>>> one.
>> What do you use as input, a stream or a real file? If the latter you
>> should use
>> the load method with the file parameter.
>>
>> PDFBox needs ramdom access to the pdf and if a stream is provided
>> PDFBox copies
>> the data to a file (lower memory usage, slower performance) or to the
>> memory
>> (higher memory usage, better performance).
>>
>> BR
>> Andreas
>>
>>
>>> But now it need more memory. I cannot load some pdfs with the current
>>> version with the same java-memory configuration.
>>>
>>> Am 14.07.2015 um 11:26 schrieb Manfred Pock:
>>>> Hi,
>>>>
>>>> we use the Pdfbox-trunkversion to render pdf's, currently we use the
>>>> version from 12. May 2015.
>>>>
>>>> Today i have done an update to the current version and have test it.
>>>> It seems to be that it need now much more time to render pdf's, it
>>>> depends of the size of the pdf.
>>>>
>>>> for example you can try this one:
>>>>
>>>> http://cloud.directupload.net/15bu
>>>>
>>>> It need five times more then the version from May 2015.
>>>>
>>>> regarts, Manfred
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>


-- 
Timo Boehme
OntoChem IT Solutions GmbH
Blücherstraße 24
06120 Halle (Saale)
Germany

phone: +49 345 478 047 4      | fax: +49 345 478 047 1
email: ulf.laube@ontochem.com | web: www.ontochem.com
HRB 21962 Amtsgericht Stendal | USt-IdNr.: DE815563824
managing director : Lutz Weber


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org

Re: Performance of the trunkversion

Posted by Manfred Pock <po...@gmail.com>.

Yes, the input is a inputstream. I can try it direct from file.

But in general we get the pdf from an document management system as stream.
Does make sense that i save the pdf to file before?

Why is there so an big performance difference beetween the version from 
May and the current version, if we use it with useScratchFiles = true ?

regarts, Manfred

Am 14.07.2015 um 12:02 schrieb Andreas Lehmkühler:
> Hi,
>
>> Manfred Pock <po...@gmail.com> hat am 14. Juli 2015 um 11:39
>> geschrieben:
>>
>>
>> Ok, we load the pdf with useScratchFiles = true, if we load them with
>> false the performance is better, but a little bit slower than the old one.
> What do you use as input, a stream or a real file? If the latter you should use
> the load method with the file parameter.
>
> PDFBox needs ramdom access to the pdf and if a stream is provided PDFBox copies
> the data to a file (lower memory usage, slower performance) or to the memory
> (higher memory usage, better performance).
>
> BR
> Andreas
>
>
>> But now it need more memory. I cannot load some pdfs with the current
>> version with the same java-memory configuration.
>>
>> Am 14.07.2015 um 11:26 schrieb Manfred Pock:
>>> Hi,
>>>
>>> we use the Pdfbox-trunkversion to render pdf's, currently we use the
>>> version from 12. May 2015.
>>>
>>> Today i have done an update to the current version and have test it.
>>> It seems to be that it need now much more time to render pdf's, it
>>> depends of the size of the pdf.
>>>
>>> for example you can try this one:
>>>
>>> http://cloud.directupload.net/15bu
>>>
>>> It need five times more then the version from May 2015.
>>>
>>> regarts, Manfred
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org

Re: Performance of the trunkversion

Posted by Andreas Lehmkühler <an...@lehmi.de>.

Hi,

> Manfred Pock <po...@gmail.com> hat am 14. Juli 2015 um 11:39
> geschrieben:
> 
> 
> Ok, we load the pdf with useScratchFiles = true, if we load them with 
> false the performance is better, but a little bit slower than the old one.
What do you use as input, a stream or a real file? If the latter you should use
the load method with the file parameter.

PDFBox needs ramdom access to the pdf and if a stream is provided PDFBox copies
the data to a file (lower memory usage, slower performance) or to the memory
(higher memory usage, better performance). 

BR
Andreas


> But now it need more memory. I cannot load some pdfs with the current 
> version with the same java-memory configuration.
> 
> Am 14.07.2015 um 11:26 schrieb Manfred Pock:
> > Hi,
> >
> > we use the Pdfbox-trunkversion to render pdf's, currently we use the 
> > version from 12. May 2015.
> >
> > Today i have done an update to the current version and have test it. 
> > It seems to be that it need now much more time to render pdf's, it 
> > depends of the size of the pdf.
> >
> > for example you can try this one:
> >
> > http://cloud.directupload.net/15bu
> >
> > It need five times more then the version from May 2015.
> >
> > regarts, Manfred
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org

Re: Performance of the trunkversion

Posted by Manfred Pock <po...@gmail.com>.

Ok, we load the pdf with useScratchFiles = true, if we load them with 
false the performance is better, but a little bit slower than the old one.

But now it need more memory. I cannot load some pdfs with the current 
version with the same java-memory configuration.

Am 14.07.2015 um 11:26 schrieb Manfred Pock:
> Hi,
>
> we use the Pdfbox-trunkversion to render pdf's, currently we use the 
> version from 12. May 2015.
>
> Today i have done an update to the current version and have test it. 
> It seems to be that it need now much more time to render pdf's, it 
> depends of the size of the pdf.
>
> for example you can try this one:
>
> http://cloud.directupload.net/15bu
>
> It need five times more then the version from May 2015.
>
> regarts, Manfred