You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Andrea Vacondio <an...@gmail.com> on 2015/02/27 16:34:28 UTC

Xref parsing performance

Hi,
few days ago I was profiling PDFBox when loading medium/large size
documents and I think I found something.
If you try loading the document
http://www.adobe.com/devnet/acrobat/pdfs/pdf_reference_1-7.pdf you'll see
it takes quite some time and that's mostly spent in the
XrefTrailerResolver.getContainedObjectNumbers. The issue is that every time
an object contained in an unparsed object stream is found, the
XrefTrailerResolver performs a full scan of the xref entries found in the
document, in this case hundreds of thousands. If the object streams are
many (like in the given doc), it performs many full scans resulting in poor
performance.
I'm trying to get familiar with the PDFBox code and I decided to try and
fix this here https://github.com/torakiki/sambox/tree/xref
As you can see I refactored a bit extracting some classes and covered the
expect behaviour with unit tests. I tested it with few random docs, loading
and saving them back and the output is exactly the same with or without my
changes. The pdf_reference_1-7.pdf doc loads in half of the time, same as
this
http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
it takes half the time. Other kind of docs loads in a comparable amount of
time and even profiling memory usage it seems comparable if not a little
less.
Maybe someone wants to take a look?

I understand my changes look a bit invasive and the issue could probably be
fixed differently, on the other hand the couple BaseParser+COSParser looks
like a big intimidating monster to a newcomer like me and it's quite
difficult to follow the expected behaviour so I thought this might be a
chance to start breaking them down in smaller, distilled classes...
something a little more manageable and testable... anyway, grab what you
like, leave what you don't  :)

Re: Xref parsing performance

Posted by Andrea Vacondio <an...@gmail.com>.

mmm... are you using the tip of the "xref" branch? Because it shouldn't use
any jdk7 stuff and it compiles and runs fine on my machine. I'm using
Ubuntu and jdk1.6.0_45 and I have:
[INFO] BUILD SUCCESS

I changed the generation number to int because in the xref table it's a 5
digit number so it fits an int. According to the spec object number and
generation number are both integer (as opposite to real numbers) but I
don't think the specs distinguish between int and long so, while the gen
number can be at most 99999, I couldn't find any limit to the object number
so I left it long.
I noticed there's currently some work going on the COSPaser but I was
already playing with this changes so I finished them. I actually posted
them here more as a starting point for discussion... see what you guys
think of these kind of refactors/patch. I'm quite new to PDFBox (not to
java and PDF) and I'm kind of trying to understand what is welcome and what
is not :)

On Sat, Feb 28, 2015 at 4:47 PM, Tilman Hausherr <TH...@t-online.de>
wrote:

> Hi Andrea,
>
> While a speed improvement in parsing of large files would be much
> appreciated (especially by the TIKA users), there are several problems with
> your change:
>
> - don't do changes that need JDK7 or higher even if they are cool. We use
> JDK6 currently.
>
> - regressions:
>
> Error converting file PDFBOX-2250-110264-xref-zeronumber.pdf
> java.io.IOException: XREF for 3:0 points to wrong object: 1:0
>     at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(
> COSParser.java:696)
>     at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(
> COSParser.java:639)
>     at org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(
> COSParser.java:600)
>     at org.apache.pdfbox.pdfparser.PDFParser.initialParse(
> PDFParser.java:346)
>     at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:373)
>     at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:811)
>     at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:757)
>     at org.apache.pdfbox.util.TestPDFToImage.doTestFile(
> TestPDFToImage.java:201)
>     at org.apache.pdfbox.util.TestPDFToImage.testRenderImage(
> TestPDFToImage.java:343)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at sun.reflect.NativeMethodAccessorImpl.invoke(
> NativeMethodAccessorImpl.java:57)
>     at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:606)
>     at junit.framework.TestCase.runTest(TestCase.java:176)
>     at junit.framework.TestCase.runBare(TestCase.java:141)
>     at junit.framework.TestResult$1.protect(TestResult.java:122)
>     at junit.framework.TestResult.runProtected(TestResult.java:142)
>     at junit.framework.TestResult.run(TestResult.java:125)
>     at junit.framework.TestCase.run(TestCase.java:129)
>     at junit.framework.TestSuite.runTest(TestSuite.java:255)
>     at junit.framework.TestSuite.run(TestSuite.java:250)
>     at junit.textui.TestRunner.doRun(TestRunner.java:116)
>     at junit.textui.TestRunner.start(TestRunner.java:183)
>     at junit.textui.TestRunner.main(TestRunner.java:137)
>     at org.apache.pdfbox.util.TestPDFToImage.main(TestPDFToImage.java:393)
>
>
> Error converting file PDFBOX-2599.pdf
> java.io.IOException: XREF for 2:0 points to wrong object: 1:0
>     at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(
> COSParser.java:696)
>     at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(
> COSParser.java:639)
>     at org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(
> COSParser.java:600)
>     at org.apache.pdfbox.pdfparser.PDFParser.initialParse(
> PDFParser.java:346)
>     at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:373)
>     at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:811)
>     at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:757)
>     at org.apache.pdfbox.util.TestPDFToImage.doTestFile(
> TestPDFToImage.java:201)
>     at org.apache.pdfbox.util.TestPDFToImage.testRenderImage(
> TestPDFToImage.java:343)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at sun.reflect.NativeMethodAccessorImpl.invoke(
> NativeMethodAccessorImpl.java:57)
>     at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:606)
>     at junit.framework.TestCase.runTest(TestCase.java:176)
>     at junit.framework.TestCase.runBare(TestCase.java:141)
>     at junit.framework.TestResult$1.protect(TestResult.java:122)
>     at junit.framework.TestResult.runProtected(TestResult.java:142)
>     at junit.framework.TestResult.run(TestResult.java:125)
>     at junit.framework.TestCase.run(TestCase.java:129)
>     at junit.framework.TestSuite.runTest(TestSuite.java:255)
>     at junit.framework.TestSuite.run(TestSuite.java:250)
>     at junit.textui.TestRunner.doRun(TestRunner.java:116)
>     at junit.textui.TestRunner.start(TestRunner.java:183)
>     at junit.textui.TestRunner.main(TestRunner.java:137)
>     at org.apache.pdfbox.util.TestPDFToImage.main(TestPDFToImage.java:393)
>
>
> - why change only one of the members of that cosobjectkey class to int?
> According to the spec, both are integers. Maybe there's a good reason, but
> I'd like to know.
>
> - even if you get rid of the regressions, a remaining problem is that
>    - Andreas L. is currently working on some parser stuff in PDFBOX-2527
>    - your change is too big to evaluate (I'm speaking only for myself
> there). It would be better to first submit only small refactorings in
> PDFBOX-2576, and then the optimization you mention (or the other way
> around). The parser is indeed a tricky part of the code (And SonarQube and
> Software Diagnostics have also flagged it as too complex). I did some
> refactorings a few weeks ago there (splitting methods), but stopped because
> I couldn't come up with names for the new methods. I just didn't understand
> what they were doing.
>
> Tilman
>
> Am 27.02.2015 um 16:34 schrieb Andrea Vacondio:
>
>> Hi,
>> few days ago I was profiling PDFBox when loading medium/large size
>> documents and I think I found something.
>> If you try loading the document
>> http://www.adobe.com/devnet/acrobat/pdfs/pdf_reference_1-7.pdf  you'll
>> see
>> it takes quite some time and that's mostly spent in the
>> XrefTrailerResolver.getContainedObjectNumbers. The issue is that every
>> time
>> an object contained in an unparsed object stream is found, the
>> XrefTrailerResolver performs a full scan of the xref entries found in the
>> document, in this case hundreds of thousands. If the object streams are
>> many (like in the given doc), it performs many full scans resulting in
>> poor
>> performance.
>> I'm trying to get familiar with the PDFBox code and I decided to try and
>> fix this herehttps://github.com/torakiki/sambox/tree/xref
>> As you can see I refactored a bit extracting some classes and covered the
>> expect behaviour with unit tests. I tested it with few random docs,
>> loading
>> and saving them back and the output is exactly the same with or without my
>> changes. The pdf_reference_1-7.pdf doc loads in half of the time, same as
>> this
>> http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/
>> pdf/pdfs/PDF32000_2008.pdf
>> it takes half the time. Other kind of docs loads in a comparable amount of
>> time and even profiling memory usage it seems comparable if not a little
>> less.
>> Maybe someone wants to take a look?
>>
>> I understand my changes look a bit invasive and the issue could probably
>> be
>> fixed differently, on the other hand the couple BaseParser+COSParser looks
>> like a big intimidating monster to a newcomer like me and it's quite
>> difficult to follow the expected behaviour so I thought this might be a
>> chance to start breaking them down in smaller, distilled classes...
>> something a little more manageable and testable... anyway, grab what you
>> like, leave what you don't  :)
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Re: Xref parsing performance

Posted by Andreas Lehmkuehler <an...@lehmi.de>.

Am 28.02.2015 um 19:54 schrieb Andreas Lehmkuehler:
> Am 28.02.2015 um 18:34 schrieb Maruan Sahyoun:
>>
>> Am 28.02.2015 um 18:18 schrieb Andreas Lehmkuehler <an...@lehmi.de>:
>>
>>> Am 28.02.2015 um 18:07 schrieb Maruan Sahyoun:
>>>> Hi,
>>>>
>>>> Am 28.02.2015 um 17:53 schrieb Andreas Lehmkuehler <an...@lehmi.de>:
>>>>
>>>>> Am 28.02.2015 um 17:49 schrieb Maruan Sahyoun:
>>>>>> Hi,
>>>>>>
>>>>>> Am 28.02.2015 um 17:32 schrieb Andreas Lehmkuehler <an...@lehmi.de>:
>>>>>>
>>>>>>> Hi
>>>>>>>
>>>>>>> Am 28.02.2015 um 16:47 schrieb Tilman Hausherr:
>>>>>>>> Hi Andrea,
>>>>>>>>
>>>>>>>> While a speed improvement in parsing of large files would be much
>>>>>>>> appreciated
>>>>>>>> (especially by the TIKA users), there are several problems with your
>>>>>>>> change:
>>>>>>> +1
>>>>>>>
>>>>>>>> - don't do changes that need JDK7 or higher even if they are cool. We
>>>>>>>> use JDK6
>>>>>>>> currently.
>>>>>>>>
>>>>>>>> - regressions:
>>>>>>>>
>>>>>>>> Error converting file PDFBOX-2250-110264-xref-zeronumber.pdf
>>>>>>>> java.io.IOException: XREF for 3:0 points to wrong object: 1:0
>>>>>>>>      at
>>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:696)
>>>>>>>>
>>>>>>>>      at
>>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:639)
>>>>>>>>
>>>>>>>>      at
>>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:600)
>>>>>>>>      at
>>>>>>>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:346)
>>>>>>>>      at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:373)
>>>>>>>>      at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:811)
>>>>>>>>      at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:757)
>>>>>>>>      at
>>>>>>>> org.apache.pdfbox.util.TestPDFToImage.doTestFile(TestPDFToImage.java:201)
>>>>>>>>      at
>>>>>>>> org.apache.pdfbox.util.TestPDFToImage.testRenderImage(TestPDFToImage.java:343)
>>>>>>>>
>>>>>>>>      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>>>>>      at
>>>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>>>>>>
>>>>>>>>      at
>>>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>>>>
>>>>>>>>
>>>>>>>>      at java.lang.reflect.Method.invoke(Method.java:606)
>>>>>>>>      at junit.framework.TestCase.runTest(TestCase.java:176)
>>>>>>>>      at junit.framework.TestCase.runBare(TestCase.java:141)
>>>>>>>>      at junit.framework.TestResult$1.protect(TestResult.java:122)
>>>>>>>>      at junit.framework.TestResult.runProtected(TestResult.java:142)
>>>>>>>>      at junit.framework.TestResult.run(TestResult.java:125)
>>>>>>>>      at junit.framework.TestCase.run(TestCase.java:129)
>>>>>>>>      at junit.framework.TestSuite.runTest(TestSuite.java:255)
>>>>>>>>      at junit.framework.TestSuite.run(TestSuite.java:250)
>>>>>>>>      at junit.textui.TestRunner.doRun(TestRunner.java:116)
>>>>>>>>      at junit.textui.TestRunner.start(TestRunner.java:183)
>>>>>>>>      at junit.textui.TestRunner.main(TestRunner.java:137)
>>>>>>>>      at org.apache.pdfbox.util.TestPDFToImage.main(TestPDFToImage.java:393)
>>>>>>>>
>>>>>>>>
>>>>>>>> Error converting file PDFBOX-2599.pdf
>>>>>>>> java.io.IOException: XREF for 2:0 points to wrong object: 1:0
>>>>>>>>      at
>>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:696)
>>>>>>>>
>>>>>>>>      at
>>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:639)
>>>>>>>>
>>>>>>>>      at
>>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:600)
>>>>>>>>      at
>>>>>>>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:346)
>>>>>>>>      at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:373)
>>>>>>>>      at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:811)
>>>>>>>>      at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:757)
>>>>>>>>      at
>>>>>>>> org.apache.pdfbox.util.TestPDFToImage.doTestFile(TestPDFToImage.java:201)
>>>>>>>>      at
>>>>>>>> org.apache.pdfbox.util.TestPDFToImage.testRenderImage(TestPDFToImage.java:343)
>>>>>>>>
>>>>>>>>      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>>>>>      at
>>>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>>>>>>
>>>>>>>>      at
>>>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>>>>
>>>>>>>>
>>>>>>>>      at java.lang.reflect.Method.invoke(Method.java:606)
>>>>>>>>      at junit.framework.TestCase.runTest(TestCase.java:176)
>>>>>>>>      at junit.framework.TestCase.runBare(TestCase.java:141)
>>>>>>>>      at junit.framework.TestResult$1.protect(TestResult.java:122)
>>>>>>>>      at junit.framework.TestResult.runProtected(TestResult.java:142)
>>>>>>>>      at junit.framework.TestResult.run(TestResult.java:125)
>>>>>>>>      at junit.framework.TestCase.run(TestCase.java:129)
>>>>>>>>      at junit.framework.TestSuite.runTest(TestSuite.java:255)
>>>>>>>>      at junit.framework.TestSuite.run(TestSuite.java:250)
>>>>>>>>      at junit.textui.TestRunner.doRun(TestRunner.java:116)
>>>>>>>>      at junit.textui.TestRunner.start(TestRunner.java:183)
>>>>>>>>      at junit.textui.TestRunner.main(TestRunner.java:137)
>>>>>>>>      at org.apache.pdfbox.util.TestPDFToImage.main(TestPDFToImage.java:393)
>>>>>>>>
>>>>>>>>
>>>>>>>> - why change only one of the members of that cosobjectkey class to int?
>>>>>>>> According to the spec, both are integers. Maybe there's a good reason,
>>>>>>>> but I'd
>>>>>>>> like to know.
>>>>>>> ASFAIK there is no good reason not to change both to int.
>>>>>>
>>>>>> as the offset is a 10 digit number is that really covered being an int?
>>>>> It's about the object number not the offset. We are using a long for the
>>>>> offset. The spec is quite clear about those numbers. They have to be
>>>>> integers and the max value for an integer within a pdf is 2^31-1 due to the
>>>>> fact that the assumed default platform for a conforming reader should be
>>>>> 32-bit.
>>>>>
>>>>> BTW, I've changed the object/generation number to int.
>>>>
>>>> Yes, but that's a should in the spec and not a shall so it's recommended but
>>>> might not be followed.
>>> Hmm, those values shall be integers and integers should be 32 bit. So, do we
>>> really have to be afraid that someone should exceed that limit?
>>
>> I've yet to come across such a file but the Annex C talks about minimum
>> architectural limits. So as we are changing from long to int we might be at
>> risk which haven't been before. btw. one of my customers is producing PDFs in
>> the  (low) GB size range :-)
> OK, I'm going to revert the change for the object number to be on the safe side ...
Done. I'm refactored the handling of the object+generationnumber within 
COSObject as well.

BR
Andreas
>
> Andreas
>>
>>>
>>>>
>>>>
>>>>>
>>>>>>
>>>>>> BR
>>>>>> Maruan
>>>>>>
>>>>>>>
>>>>>>>> - even if you get rid of the regressions, a remaining problem is that
>>>>>>>>     - Andreas L. is currently working on some parser stuff in PDFBOX-2527
>>>>>>> That's not a problem. For now I'm focused on the parsing process itself
>>>>>>> and am working on one last piece, the rebuild mechanism.
>>>>>>>
>>>>>>>>     - your change is too big to evaluate (I'm speaking only for myself
>>>>>>>> there).
>>>>>>>> It would be better to first submit only small refactorings in
>>>>>>>> PDFBOX-2576, and
>>>>>>>
>>>>>>> I agree. We should try to break up the patch into smaller pieces if
>>>>>>> possible. Let's start with the long -> int change
>>>>>>>
>>>>>>>> then the optimization you mention (or the other way around). The parser is
>>>>>>>> indeed a tricky part of the code (And SonarQube and Software Diagnostics
>>>>>>>> have
>>>>>>>> also flagged it as too complex). I did some refactorings a few weeks ago
>>>>>>>> there
>>>>>>>> (splitting methods), but stopped because I couldn't come up with names
>>>>>>>> for the
>>>>>>>> new methods. I just didn't understand what they were doing.
>>>>>>>>
>>>>>>>> Tilman
>>>>>>>
>>>>>>> BR
>>>>>>> Andreas Lehmkühler
>>>>>>>
>>>>>>>>
>>>>>>>> Am 27.02.2015 um 16:34 schrieb Andrea Vacondio:
>>>>>>>>> Hi,
>>>>>>>>> few days ago I was profiling PDFBox when loading medium/large size
>>>>>>>>> documents and I think I found something.
>>>>>>>>> If you try loading the document
>>>>>>>>> http://www.adobe.com/devnet/acrobat/pdfs/pdf_reference_1-7.pdf  you'll see
>>>>>>>>> it takes quite some time and that's mostly spent in the
>>>>>>>>> XrefTrailerResolver.getContainedObjectNumbers. The issue is that every
>>>>>>>>> time
>>>>>>>>> an object contained in an unparsed object stream is found, the
>>>>>>>>> XrefTrailerResolver performs a full scan of the xref entries found in the
>>>>>>>>> document, in this case hundreds of thousands. If the object streams are
>>>>>>>>> many (like in the given doc), it performs many full scans resulting in
>>>>>>>>> poor
>>>>>>>>> performance.
>>>>>>>>> I'm trying to get familiar with the PDFBox code and I decided to try and
>>>>>>>>> fix this herehttps://github.com/torakiki/sambox/tree/xref
>>>>>>>>> As you can see I refactored a bit extracting some classes and covered the
>>>>>>>>> expect behaviour with unit tests. I tested it with few random docs,
>>>>>>>>> loading
>>>>>>>>> and saving them back and the output is exactly the same with or without my
>>>>>>>>> changes. The pdf_reference_1-7.pdf doc loads in half of the time, same as
>>>>>>>>> this
>>>>>>>>> http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
>>>>>>>>>
>>>>>>>>> it takes half the time. Other kind of docs loads in a comparable amount of
>>>>>>>>> time and even profiling memory usage it seems comparable if not a little
>>>>>>>>> less.
>>>>>>>>> Maybe someone wants to take a look?
>>>>>>>>>
>>>>>>>>> I understand my changes look a bit invasive and the issue could
>>>>>>>>> probably be
>>>>>>>>> fixed differently, on the other hand the couple BaseParser+COSParser looks
>>>>>>>>> like a big intimidating monster to a newcomer like me and it's quite
>>>>>>>>> difficult to follow the expected behaviour so I thought this might be a
>>>>>>>>> chance to start breaking them down in smaller, distilled classes...
>>>>>>>>> something a little more manageable and testable... anyway, grab what you
>>>>>>>>> like, leave what you don't  :)
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>
>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Xref parsing performance

Posted by Andreas Lehmkuehler <an...@lehmi.de>.

Am 28.02.2015 um 18:34 schrieb Maruan Sahyoun:
>
> Am 28.02.2015 um 18:18 schrieb Andreas Lehmkuehler <an...@lehmi.de>:
>
>> Am 28.02.2015 um 18:07 schrieb Maruan Sahyoun:
>>> Hi,
>>>
>>> Am 28.02.2015 um 17:53 schrieb Andreas Lehmkuehler <an...@lehmi.de>:
>>>
>>>> Am 28.02.2015 um 17:49 schrieb Maruan Sahyoun:
>>>>> Hi,
>>>>>
>>>>> Am 28.02.2015 um 17:32 schrieb Andreas Lehmkuehler <an...@lehmi.de>:
>>>>>
>>>>>> Hi
>>>>>>
>>>>>> Am 28.02.2015 um 16:47 schrieb Tilman Hausherr:
>>>>>>> Hi Andrea,
>>>>>>>
>>>>>>> While a speed improvement in parsing of large files would be much appreciated
>>>>>>> (especially by the TIKA users), there are several problems with your change:
>>>>>> +1
>>>>>>
>>>>>>> - don't do changes that need JDK7 or higher even if they are cool. We use JDK6
>>>>>>> currently.
>>>>>>>
>>>>>>> - regressions:
>>>>>>>
>>>>>>> Error converting file PDFBOX-2250-110264-xref-zeronumber.pdf
>>>>>>> java.io.IOException: XREF for 3:0 points to wrong object: 1:0
>>>>>>>      at
>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:696)
>>>>>>>      at
>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:639)
>>>>>>>      at org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:600)
>>>>>>>      at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:346)
>>>>>>>      at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:373)
>>>>>>>      at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:811)
>>>>>>>      at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:757)
>>>>>>>      at org.apache.pdfbox.util.TestPDFToImage.doTestFile(TestPDFToImage.java:201)
>>>>>>>      at
>>>>>>> org.apache.pdfbox.util.TestPDFToImage.testRenderImage(TestPDFToImage.java:343)
>>>>>>>      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>>>>      at
>>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>>>>>      at
>>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>>>
>>>>>>>      at java.lang.reflect.Method.invoke(Method.java:606)
>>>>>>>      at junit.framework.TestCase.runTest(TestCase.java:176)
>>>>>>>      at junit.framework.TestCase.runBare(TestCase.java:141)
>>>>>>>      at junit.framework.TestResult$1.protect(TestResult.java:122)
>>>>>>>      at junit.framework.TestResult.runProtected(TestResult.java:142)
>>>>>>>      at junit.framework.TestResult.run(TestResult.java:125)
>>>>>>>      at junit.framework.TestCase.run(TestCase.java:129)
>>>>>>>      at junit.framework.TestSuite.runTest(TestSuite.java:255)
>>>>>>>      at junit.framework.TestSuite.run(TestSuite.java:250)
>>>>>>>      at junit.textui.TestRunner.doRun(TestRunner.java:116)
>>>>>>>      at junit.textui.TestRunner.start(TestRunner.java:183)
>>>>>>>      at junit.textui.TestRunner.main(TestRunner.java:137)
>>>>>>>      at org.apache.pdfbox.util.TestPDFToImage.main(TestPDFToImage.java:393)
>>>>>>>
>>>>>>>
>>>>>>> Error converting file PDFBOX-2599.pdf
>>>>>>> java.io.IOException: XREF for 2:0 points to wrong object: 1:0
>>>>>>>      at
>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:696)
>>>>>>>      at
>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:639)
>>>>>>>      at org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:600)
>>>>>>>      at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:346)
>>>>>>>      at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:373)
>>>>>>>      at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:811)
>>>>>>>      at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:757)
>>>>>>>      at org.apache.pdfbox.util.TestPDFToImage.doTestFile(TestPDFToImage.java:201)
>>>>>>>      at
>>>>>>> org.apache.pdfbox.util.TestPDFToImage.testRenderImage(TestPDFToImage.java:343)
>>>>>>>      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>>>>      at
>>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>>>>>      at
>>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>>>
>>>>>>>      at java.lang.reflect.Method.invoke(Method.java:606)
>>>>>>>      at junit.framework.TestCase.runTest(TestCase.java:176)
>>>>>>>      at junit.framework.TestCase.runBare(TestCase.java:141)
>>>>>>>      at junit.framework.TestResult$1.protect(TestResult.java:122)
>>>>>>>      at junit.framework.TestResult.runProtected(TestResult.java:142)
>>>>>>>      at junit.framework.TestResult.run(TestResult.java:125)
>>>>>>>      at junit.framework.TestCase.run(TestCase.java:129)
>>>>>>>      at junit.framework.TestSuite.runTest(TestSuite.java:255)
>>>>>>>      at junit.framework.TestSuite.run(TestSuite.java:250)
>>>>>>>      at junit.textui.TestRunner.doRun(TestRunner.java:116)
>>>>>>>      at junit.textui.TestRunner.start(TestRunner.java:183)
>>>>>>>      at junit.textui.TestRunner.main(TestRunner.java:137)
>>>>>>>      at org.apache.pdfbox.util.TestPDFToImage.main(TestPDFToImage.java:393)
>>>>>>>
>>>>>>>
>>>>>>> - why change only one of the members of that cosobjectkey class to int?
>>>>>>> According to the spec, both are integers. Maybe there's a good reason, but I'd
>>>>>>> like to know.
>>>>>> ASFAIK there is no good reason not to change both to int.
>>>>>
>>>>> as the offset is a 10 digit number is that really covered being an int?
>>>> It's about the object number not the offset. We are using a long for the offset. The spec is quite clear about those numbers. They have to be integers and the max value for an integer within a pdf is 2^31-1 due to the fact that the assumed default platform for a conforming reader should be 32-bit.
>>>>
>>>> BTW, I've changed the object/generation number to int.
>>>
>>> Yes, but that's a should in the spec and not a shall so it's recommended but might not be followed.
>> Hmm, those values shall be integers and integers should be 32 bit. So, do we really have to be afraid that someone should exceed that limit?
>
> I've yet to come across such a file but the Annex C talks about minimum architectural limits. So as we are changing from long to int we might be at risk which haven't been before. btw. one of my customers is producing PDFs in the  (low) GB size range :-)
OK, I'm going to revert the change for the object number to be on the safe side ...

Andreas
>
>>
>>>
>>>
>>>>
>>>>>
>>>>> BR
>>>>> Maruan
>>>>>
>>>>>>
>>>>>>> - even if you get rid of the regressions, a remaining problem is that
>>>>>>>     - Andreas L. is currently working on some parser stuff in PDFBOX-2527
>>>>>> That's not a problem. For now I'm focused on the parsing process itself and am working on one last piece, the rebuild mechanism.
>>>>>>
>>>>>>>     - your change is too big to evaluate (I'm speaking only for myself there).
>>>>>>> It would be better to first submit only small refactorings in PDFBOX-2576, and
>>>>>>
>>>>>> I agree. We should try to break up the patch into smaller pieces if possible. Let's start with the long -> int change
>>>>>>
>>>>>>> then the optimization you mention (or the other way around). The parser is
>>>>>>> indeed a tricky part of the code (And SonarQube and Software Diagnostics have
>>>>>>> also flagged it as too complex). I did some refactorings a few weeks ago there
>>>>>>> (splitting methods), but stopped because I couldn't come up with names for the
>>>>>>> new methods. I just didn't understand what they were doing.
>>>>>>>
>>>>>>> Tilman
>>>>>>
>>>>>> BR
>>>>>> Andreas Lehmkühler
>>>>>>
>>>>>>>
>>>>>>> Am 27.02.2015 um 16:34 schrieb Andrea Vacondio:
>>>>>>>> Hi,
>>>>>>>> few days ago I was profiling PDFBox when loading medium/large size
>>>>>>>> documents and I think I found something.
>>>>>>>> If you try loading the document
>>>>>>>> http://www.adobe.com/devnet/acrobat/pdfs/pdf_reference_1-7.pdf  you'll see
>>>>>>>> it takes quite some time and that's mostly spent in the
>>>>>>>> XrefTrailerResolver.getContainedObjectNumbers. The issue is that every time
>>>>>>>> an object contained in an unparsed object stream is found, the
>>>>>>>> XrefTrailerResolver performs a full scan of the xref entries found in the
>>>>>>>> document, in this case hundreds of thousands. If the object streams are
>>>>>>>> many (like in the given doc), it performs many full scans resulting in poor
>>>>>>>> performance.
>>>>>>>> I'm trying to get familiar with the PDFBox code and I decided to try and
>>>>>>>> fix this herehttps://github.com/torakiki/sambox/tree/xref
>>>>>>>> As you can see I refactored a bit extracting some classes and covered the
>>>>>>>> expect behaviour with unit tests. I tested it with few random docs, loading
>>>>>>>> and saving them back and the output is exactly the same with or without my
>>>>>>>> changes. The pdf_reference_1-7.pdf doc loads in half of the time, same as
>>>>>>>> this
>>>>>>>> http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
>>>>>>>> it takes half the time. Other kind of docs loads in a comparable amount of
>>>>>>>> time and even profiling memory usage it seems comparable if not a little
>>>>>>>> less.
>>>>>>>> Maybe someone wants to take a look?
>>>>>>>>
>>>>>>>> I understand my changes look a bit invasive and the issue could probably be
>>>>>>>> fixed differently, on the other hand the couple BaseParser+COSParser looks
>>>>>>>> like a big intimidating monster to a newcomer like me and it's quite
>>>>>>>> difficult to follow the expected behaviour so I thought this might be a
>>>>>>>> chance to start breaking them down in smaller, distilled classes...
>>>>>>>> something a little more manageable and testable... anyway, grab what you
>>>>>>>> like, leave what you don't  :)
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>
>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Xref parsing performance

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.

Am 28.02.2015 um 18:18 schrieb Andreas Lehmkuehler <an...@lehmi.de>:

> Am 28.02.2015 um 18:07 schrieb Maruan Sahyoun:
>> Hi,
>> 
>> Am 28.02.2015 um 17:53 schrieb Andreas Lehmkuehler <an...@lehmi.de>:
>> 
>>> Am 28.02.2015 um 17:49 schrieb Maruan Sahyoun:
>>>> Hi,
>>>> 
>>>> Am 28.02.2015 um 17:32 schrieb Andreas Lehmkuehler <an...@lehmi.de>:
>>>> 
>>>>> Hi
>>>>> 
>>>>> Am 28.02.2015 um 16:47 schrieb Tilman Hausherr:
>>>>>> Hi Andrea,
>>>>>> 
>>>>>> While a speed improvement in parsing of large files would be much appreciated
>>>>>> (especially by the TIKA users), there are several problems with your change:
>>>>> +1
>>>>> 
>>>>>> - don't do changes that need JDK7 or higher even if they are cool. We use JDK6
>>>>>> currently.
>>>>>> 
>>>>>> - regressions:
>>>>>> 
>>>>>> Error converting file PDFBOX-2250-110264-xref-zeronumber.pdf
>>>>>> java.io.IOException: XREF for 3:0 points to wrong object: 1:0
>>>>>>     at
>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:696)
>>>>>>     at
>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:639)
>>>>>>     at org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:600)
>>>>>>     at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:346)
>>>>>>     at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:373)
>>>>>>     at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:811)
>>>>>>     at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:757)
>>>>>>     at org.apache.pdfbox.util.TestPDFToImage.doTestFile(TestPDFToImage.java:201)
>>>>>>     at
>>>>>> org.apache.pdfbox.util.TestPDFToImage.testRenderImage(TestPDFToImage.java:343)
>>>>>>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>>>     at
>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>>>>     at
>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>> 
>>>>>>     at java.lang.reflect.Method.invoke(Method.java:606)
>>>>>>     at junit.framework.TestCase.runTest(TestCase.java:176)
>>>>>>     at junit.framework.TestCase.runBare(TestCase.java:141)
>>>>>>     at junit.framework.TestResult$1.protect(TestResult.java:122)
>>>>>>     at junit.framework.TestResult.runProtected(TestResult.java:142)
>>>>>>     at junit.framework.TestResult.run(TestResult.java:125)
>>>>>>     at junit.framework.TestCase.run(TestCase.java:129)
>>>>>>     at junit.framework.TestSuite.runTest(TestSuite.java:255)
>>>>>>     at junit.framework.TestSuite.run(TestSuite.java:250)
>>>>>>     at junit.textui.TestRunner.doRun(TestRunner.java:116)
>>>>>>     at junit.textui.TestRunner.start(TestRunner.java:183)
>>>>>>     at junit.textui.TestRunner.main(TestRunner.java:137)
>>>>>>     at org.apache.pdfbox.util.TestPDFToImage.main(TestPDFToImage.java:393)
>>>>>> 
>>>>>> 
>>>>>> Error converting file PDFBOX-2599.pdf
>>>>>> java.io.IOException: XREF for 2:0 points to wrong object: 1:0
>>>>>>     at
>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:696)
>>>>>>     at
>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:639)
>>>>>>     at org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:600)
>>>>>>     at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:346)
>>>>>>     at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:373)
>>>>>>     at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:811)
>>>>>>     at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:757)
>>>>>>     at org.apache.pdfbox.util.TestPDFToImage.doTestFile(TestPDFToImage.java:201)
>>>>>>     at
>>>>>> org.apache.pdfbox.util.TestPDFToImage.testRenderImage(TestPDFToImage.java:343)
>>>>>>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>>>     at
>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>>>>     at
>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>> 
>>>>>>     at java.lang.reflect.Method.invoke(Method.java:606)
>>>>>>     at junit.framework.TestCase.runTest(TestCase.java:176)
>>>>>>     at junit.framework.TestCase.runBare(TestCase.java:141)
>>>>>>     at junit.framework.TestResult$1.protect(TestResult.java:122)
>>>>>>     at junit.framework.TestResult.runProtected(TestResult.java:142)
>>>>>>     at junit.framework.TestResult.run(TestResult.java:125)
>>>>>>     at junit.framework.TestCase.run(TestCase.java:129)
>>>>>>     at junit.framework.TestSuite.runTest(TestSuite.java:255)
>>>>>>     at junit.framework.TestSuite.run(TestSuite.java:250)
>>>>>>     at junit.textui.TestRunner.doRun(TestRunner.java:116)
>>>>>>     at junit.textui.TestRunner.start(TestRunner.java:183)
>>>>>>     at junit.textui.TestRunner.main(TestRunner.java:137)
>>>>>>     at org.apache.pdfbox.util.TestPDFToImage.main(TestPDFToImage.java:393)
>>>>>> 
>>>>>> 
>>>>>> - why change only one of the members of that cosobjectkey class to int?
>>>>>> According to the spec, both are integers. Maybe there's a good reason, but I'd
>>>>>> like to know.
>>>>> ASFAIK there is no good reason not to change both to int.
>>>> 
>>>> as the offset is a 10 digit number is that really covered being an int?
>>> It's about the object number not the offset. We are using a long for the offset. The spec is quite clear about those numbers. They have to be integers and the max value for an integer within a pdf is 2^31-1 due to the fact that the assumed default platform for a conforming reader should be 32-bit.
>>> 
>>> BTW, I've changed the object/generation number to int.
>> 
>> Yes, but that's a should in the spec and not a shall so it's recommended but might not be followed.
> Hmm, those values shall be integers and integers should be 32 bit. So, do we really have to be afraid that someone should exceed that limit?

I've yet to come across such a file but the Annex C talks about minimum architectural limits. So as we are changing from long to int we might be at risk which haven't been before. btw. one of my customers is producing PDFs in the  (low) GB size range :-)

> 
>> 
>> 
>>> 
>>>> 
>>>> BR
>>>> Maruan
>>>> 
>>>>> 
>>>>>> - even if you get rid of the regressions, a remaining problem is that
>>>>>>    - Andreas L. is currently working on some parser stuff in PDFBOX-2527
>>>>> That's not a problem. For now I'm focused on the parsing process itself and am working on one last piece, the rebuild mechanism.
>>>>> 
>>>>>>    - your change is too big to evaluate (I'm speaking only for myself there).
>>>>>> It would be better to first submit only small refactorings in PDFBOX-2576, and
>>>>> 
>>>>> I agree. We should try to break up the patch into smaller pieces if possible. Let's start with the long -> int change
>>>>> 
>>>>>> then the optimization you mention (or the other way around). The parser is
>>>>>> indeed a tricky part of the code (And SonarQube and Software Diagnostics have
>>>>>> also flagged it as too complex). I did some refactorings a few weeks ago there
>>>>>> (splitting methods), but stopped because I couldn't come up with names for the
>>>>>> new methods. I just didn't understand what they were doing.
>>>>>> 
>>>>>> Tilman
>>>>> 
>>>>> BR
>>>>> Andreas Lehmkühler
>>>>> 
>>>>>> 
>>>>>> Am 27.02.2015 um 16:34 schrieb Andrea Vacondio:
>>>>>>> Hi,
>>>>>>> few days ago I was profiling PDFBox when loading medium/large size
>>>>>>> documents and I think I found something.
>>>>>>> If you try loading the document
>>>>>>> http://www.adobe.com/devnet/acrobat/pdfs/pdf_reference_1-7.pdf  you'll see
>>>>>>> it takes quite some time and that's mostly spent in the
>>>>>>> XrefTrailerResolver.getContainedObjectNumbers. The issue is that every time
>>>>>>> an object contained in an unparsed object stream is found, the
>>>>>>> XrefTrailerResolver performs a full scan of the xref entries found in the
>>>>>>> document, in this case hundreds of thousands. If the object streams are
>>>>>>> many (like in the given doc), it performs many full scans resulting in poor
>>>>>>> performance.
>>>>>>> I'm trying to get familiar with the PDFBox code and I decided to try and
>>>>>>> fix this herehttps://github.com/torakiki/sambox/tree/xref
>>>>>>> As you can see I refactored a bit extracting some classes and covered the
>>>>>>> expect behaviour with unit tests. I tested it with few random docs, loading
>>>>>>> and saving them back and the output is exactly the same with or without my
>>>>>>> changes. The pdf_reference_1-7.pdf doc loads in half of the time, same as
>>>>>>> this
>>>>>>> http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
>>>>>>> it takes half the time. Other kind of docs loads in a comparable amount of
>>>>>>> time and even profiling memory usage it seems comparable if not a little
>>>>>>> less.
>>>>>>> Maybe someone wants to take a look?
>>>>>>> 
>>>>>>> I understand my changes look a bit invasive and the issue could probably be
>>>>>>> fixed differently, on the other hand the couple BaseParser+COSParser looks
>>>>>>> like a big intimidating monster to a newcomer like me and it's quite
>>>>>>> difficult to follow the expected behaviour so I thought this might be a
>>>>>>> chance to start breaking them down in smaller, distilled classes...
>>>>>>> something a little more manageable and testable... anyway, grab what you
>>>>>>> like, leave what you don't  :)
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>> 
>>>>> 
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>> 
>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>> 
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>

Re: Xref parsing performance

Posted by Andreas Lehmkuehler <an...@lehmi.de>.

Am 28.02.2015 um 18:07 schrieb Maruan Sahyoun:
> Hi,
>
> Am 28.02.2015 um 17:53 schrieb Andreas Lehmkuehler <an...@lehmi.de>:
>
>> Am 28.02.2015 um 17:49 schrieb Maruan Sahyoun:
>>> Hi,
>>>
>>> Am 28.02.2015 um 17:32 schrieb Andreas Lehmkuehler <an...@lehmi.de>:
>>>
>>>> Hi
>>>>
>>>> Am 28.02.2015 um 16:47 schrieb Tilman Hausherr:
>>>>> Hi Andrea,
>>>>>
>>>>> While a speed improvement in parsing of large files would be much appreciated
>>>>> (especially by the TIKA users), there are several problems with your change:
>>>> +1
>>>>
>>>>> - don't do changes that need JDK7 or higher even if they are cool. We use JDK6
>>>>> currently.
>>>>>
>>>>> - regressions:
>>>>>
>>>>> Error converting file PDFBOX-2250-110264-xref-zeronumber.pdf
>>>>> java.io.IOException: XREF for 3:0 points to wrong object: 1:0
>>>>>      at
>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:696)
>>>>>      at
>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:639)
>>>>>      at org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:600)
>>>>>      at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:346)
>>>>>      at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:373)
>>>>>      at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:811)
>>>>>      at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:757)
>>>>>      at org.apache.pdfbox.util.TestPDFToImage.doTestFile(TestPDFToImage.java:201)
>>>>>      at
>>>>> org.apache.pdfbox.util.TestPDFToImage.testRenderImage(TestPDFToImage.java:343)
>>>>>      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>>      at
>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>>>      at
>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>
>>>>>      at java.lang.reflect.Method.invoke(Method.java:606)
>>>>>      at junit.framework.TestCase.runTest(TestCase.java:176)
>>>>>      at junit.framework.TestCase.runBare(TestCase.java:141)
>>>>>      at junit.framework.TestResult$1.protect(TestResult.java:122)
>>>>>      at junit.framework.TestResult.runProtected(TestResult.java:142)
>>>>>      at junit.framework.TestResult.run(TestResult.java:125)
>>>>>      at junit.framework.TestCase.run(TestCase.java:129)
>>>>>      at junit.framework.TestSuite.runTest(TestSuite.java:255)
>>>>>      at junit.framework.TestSuite.run(TestSuite.java:250)
>>>>>      at junit.textui.TestRunner.doRun(TestRunner.java:116)
>>>>>      at junit.textui.TestRunner.start(TestRunner.java:183)
>>>>>      at junit.textui.TestRunner.main(TestRunner.java:137)
>>>>>      at org.apache.pdfbox.util.TestPDFToImage.main(TestPDFToImage.java:393)
>>>>>
>>>>>
>>>>> Error converting file PDFBOX-2599.pdf
>>>>> java.io.IOException: XREF for 2:0 points to wrong object: 1:0
>>>>>      at
>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:696)
>>>>>      at
>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:639)
>>>>>      at org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:600)
>>>>>      at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:346)
>>>>>      at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:373)
>>>>>      at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:811)
>>>>>      at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:757)
>>>>>      at org.apache.pdfbox.util.TestPDFToImage.doTestFile(TestPDFToImage.java:201)
>>>>>      at
>>>>> org.apache.pdfbox.util.TestPDFToImage.testRenderImage(TestPDFToImage.java:343)
>>>>>      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>>      at
>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>>>      at
>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>
>>>>>      at java.lang.reflect.Method.invoke(Method.java:606)
>>>>>      at junit.framework.TestCase.runTest(TestCase.java:176)
>>>>>      at junit.framework.TestCase.runBare(TestCase.java:141)
>>>>>      at junit.framework.TestResult$1.protect(TestResult.java:122)
>>>>>      at junit.framework.TestResult.runProtected(TestResult.java:142)
>>>>>      at junit.framework.TestResult.run(TestResult.java:125)
>>>>>      at junit.framework.TestCase.run(TestCase.java:129)
>>>>>      at junit.framework.TestSuite.runTest(TestSuite.java:255)
>>>>>      at junit.framework.TestSuite.run(TestSuite.java:250)
>>>>>      at junit.textui.TestRunner.doRun(TestRunner.java:116)
>>>>>      at junit.textui.TestRunner.start(TestRunner.java:183)
>>>>>      at junit.textui.TestRunner.main(TestRunner.java:137)
>>>>>      at org.apache.pdfbox.util.TestPDFToImage.main(TestPDFToImage.java:393)
>>>>>
>>>>>
>>>>> - why change only one of the members of that cosobjectkey class to int?
>>>>> According to the spec, both are integers. Maybe there's a good reason, but I'd
>>>>> like to know.
>>>> ASFAIK there is no good reason not to change both to int.
>>>
>>> as the offset is a 10 digit number is that really covered being an int?
>> It's about the object number not the offset. We are using a long for the offset. The spec is quite clear about those numbers. They have to be integers and the max value for an integer within a pdf is 2^31-1 due to the fact that the assumed default platform for a conforming reader should be 32-bit.
>>
>> BTW, I've changed the object/generation number to int.
>
> Yes, but that's a should in the spec and not a shall so it's recommended but might not be followed.
Hmm, those values shall be integers and integers should be 32 bit. So, do we 
really have to be afraid that someone should exceed that limit?

>
>
>>
>>>
>>> BR
>>> Maruan
>>>
>>>>
>>>>> - even if you get rid of the regressions, a remaining problem is that
>>>>>     - Andreas L. is currently working on some parser stuff in PDFBOX-2527
>>>> That's not a problem. For now I'm focused on the parsing process itself and am working on one last piece, the rebuild mechanism.
>>>>
>>>>>     - your change is too big to evaluate (I'm speaking only for myself there).
>>>>> It would be better to first submit only small refactorings in PDFBOX-2576, and
>>>>
>>>> I agree. We should try to break up the patch into smaller pieces if possible. Let's start with the long -> int change
>>>>
>>>>> then the optimization you mention (or the other way around). The parser is
>>>>> indeed a tricky part of the code (And SonarQube and Software Diagnostics have
>>>>> also flagged it as too complex). I did some refactorings a few weeks ago there
>>>>> (splitting methods), but stopped because I couldn't come up with names for the
>>>>> new methods. I just didn't understand what they were doing.
>>>>>
>>>>> Tilman
>>>>
>>>> BR
>>>> Andreas Lehmkühler
>>>>
>>>>>
>>>>> Am 27.02.2015 um 16:34 schrieb Andrea Vacondio:
>>>>>> Hi,
>>>>>> few days ago I was profiling PDFBox when loading medium/large size
>>>>>> documents and I think I found something.
>>>>>> If you try loading the document
>>>>>> http://www.adobe.com/devnet/acrobat/pdfs/pdf_reference_1-7.pdf  you'll see
>>>>>> it takes quite some time and that's mostly spent in the
>>>>>> XrefTrailerResolver.getContainedObjectNumbers. The issue is that every time
>>>>>> an object contained in an unparsed object stream is found, the
>>>>>> XrefTrailerResolver performs a full scan of the xref entries found in the
>>>>>> document, in this case hundreds of thousands. If the object streams are
>>>>>> many (like in the given doc), it performs many full scans resulting in poor
>>>>>> performance.
>>>>>> I'm trying to get familiar with the PDFBox code and I decided to try and
>>>>>> fix this herehttps://github.com/torakiki/sambox/tree/xref
>>>>>> As you can see I refactored a bit extracting some classes and covered the
>>>>>> expect behaviour with unit tests. I tested it with few random docs, loading
>>>>>> and saving them back and the output is exactly the same with or without my
>>>>>> changes. The pdf_reference_1-7.pdf doc loads in half of the time, same as
>>>>>> this
>>>>>> http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
>>>>>> it takes half the time. Other kind of docs loads in a comparable amount of
>>>>>> time and even profiling memory usage it seems comparable if not a little
>>>>>> less.
>>>>>> Maybe someone wants to take a look?
>>>>>>
>>>>>> I understand my changes look a bit invasive and the issue could probably be
>>>>>> fixed differently, on the other hand the couple BaseParser+COSParser looks
>>>>>> like a big intimidating monster to a newcomer like me and it's quite
>>>>>> difficult to follow the expected behaviour so I thought this might be a
>>>>>> chance to start breaking them down in smaller, distilled classes...
>>>>>> something a little more manageable and testable... anyway, grab what you
>>>>>> like, leave what you don't  :)
>>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Xref parsing performance

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.

Hi,

Am 28.02.2015 um 17:53 schrieb Andreas Lehmkuehler <an...@lehmi.de>:

> Am 28.02.2015 um 17:49 schrieb Maruan Sahyoun:
>> Hi,
>> 
>> Am 28.02.2015 um 17:32 schrieb Andreas Lehmkuehler <an...@lehmi.de>:
>> 
>>> Hi
>>> 
>>> Am 28.02.2015 um 16:47 schrieb Tilman Hausherr:
>>>> Hi Andrea,
>>>> 
>>>> While a speed improvement in parsing of large files would be much appreciated
>>>> (especially by the TIKA users), there are several problems with your change:
>>> +1
>>> 
>>>> - don't do changes that need JDK7 or higher even if they are cool. We use JDK6
>>>> currently.
>>>> 
>>>> - regressions:
>>>> 
>>>> Error converting file PDFBOX-2250-110264-xref-zeronumber.pdf
>>>> java.io.IOException: XREF for 3:0 points to wrong object: 1:0
>>>>     at
>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:696)
>>>>     at
>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:639)
>>>>     at org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:600)
>>>>     at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:346)
>>>>     at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:373)
>>>>     at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:811)
>>>>     at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:757)
>>>>     at org.apache.pdfbox.util.TestPDFToImage.doTestFile(TestPDFToImage.java:201)
>>>>     at
>>>> org.apache.pdfbox.util.TestPDFToImage.testRenderImage(TestPDFToImage.java:343)
>>>>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>     at
>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>>     at
>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>> 
>>>>     at java.lang.reflect.Method.invoke(Method.java:606)
>>>>     at junit.framework.TestCase.runTest(TestCase.java:176)
>>>>     at junit.framework.TestCase.runBare(TestCase.java:141)
>>>>     at junit.framework.TestResult$1.protect(TestResult.java:122)
>>>>     at junit.framework.TestResult.runProtected(TestResult.java:142)
>>>>     at junit.framework.TestResult.run(TestResult.java:125)
>>>>     at junit.framework.TestCase.run(TestCase.java:129)
>>>>     at junit.framework.TestSuite.runTest(TestSuite.java:255)
>>>>     at junit.framework.TestSuite.run(TestSuite.java:250)
>>>>     at junit.textui.TestRunner.doRun(TestRunner.java:116)
>>>>     at junit.textui.TestRunner.start(TestRunner.java:183)
>>>>     at junit.textui.TestRunner.main(TestRunner.java:137)
>>>>     at org.apache.pdfbox.util.TestPDFToImage.main(TestPDFToImage.java:393)
>>>> 
>>>> 
>>>> Error converting file PDFBOX-2599.pdf
>>>> java.io.IOException: XREF for 2:0 points to wrong object: 1:0
>>>>     at
>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:696)
>>>>     at
>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:639)
>>>>     at org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:600)
>>>>     at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:346)
>>>>     at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:373)
>>>>     at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:811)
>>>>     at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:757)
>>>>     at org.apache.pdfbox.util.TestPDFToImage.doTestFile(TestPDFToImage.java:201)
>>>>     at
>>>> org.apache.pdfbox.util.TestPDFToImage.testRenderImage(TestPDFToImage.java:343)
>>>>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>     at
>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>>     at
>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>> 
>>>>     at java.lang.reflect.Method.invoke(Method.java:606)
>>>>     at junit.framework.TestCase.runTest(TestCase.java:176)
>>>>     at junit.framework.TestCase.runBare(TestCase.java:141)
>>>>     at junit.framework.TestResult$1.protect(TestResult.java:122)
>>>>     at junit.framework.TestResult.runProtected(TestResult.java:142)
>>>>     at junit.framework.TestResult.run(TestResult.java:125)
>>>>     at junit.framework.TestCase.run(TestCase.java:129)
>>>>     at junit.framework.TestSuite.runTest(TestSuite.java:255)
>>>>     at junit.framework.TestSuite.run(TestSuite.java:250)
>>>>     at junit.textui.TestRunner.doRun(TestRunner.java:116)
>>>>     at junit.textui.TestRunner.start(TestRunner.java:183)
>>>>     at junit.textui.TestRunner.main(TestRunner.java:137)
>>>>     at org.apache.pdfbox.util.TestPDFToImage.main(TestPDFToImage.java:393)
>>>> 
>>>> 
>>>> - why change only one of the members of that cosobjectkey class to int?
>>>> According to the spec, both are integers. Maybe there's a good reason, but I'd
>>>> like to know.
>>> ASFAIK there is no good reason not to change both to int.
>> 
>> as the offset is a 10 digit number is that really covered being an int?
> It's about the object number not the offset. We are using a long for the offset. The spec is quite clear about those numbers. They have to be integers and the max value for an integer within a pdf is 2^31-1 due to the fact that the assumed default platform for a conforming reader should be 32-bit.
> 
> BTW, I've changed the object/generation number to int.

Yes, but that's a should in the spec and not a shall so it's recommended but might not be followed.


> 
>> 
>> BR
>> Maruan
>> 
>>> 
>>>> - even if you get rid of the regressions, a remaining problem is that
>>>>    - Andreas L. is currently working on some parser stuff in PDFBOX-2527
>>> That's not a problem. For now I'm focused on the parsing process itself and am working on one last piece, the rebuild mechanism.
>>> 
>>>>    - your change is too big to evaluate (I'm speaking only for myself there).
>>>> It would be better to first submit only small refactorings in PDFBOX-2576, and
>>> 
>>> I agree. We should try to break up the patch into smaller pieces if possible. Let's start with the long -> int change
>>> 
>>>> then the optimization you mention (or the other way around). The parser is
>>>> indeed a tricky part of the code (And SonarQube and Software Diagnostics have
>>>> also flagged it as too complex). I did some refactorings a few weeks ago there
>>>> (splitting methods), but stopped because I couldn't come up with names for the
>>>> new methods. I just didn't understand what they were doing.
>>>> 
>>>> Tilman
>>> 
>>> BR
>>> Andreas Lehmkühler
>>> 
>>>> 
>>>> Am 27.02.2015 um 16:34 schrieb Andrea Vacondio:
>>>>> Hi,
>>>>> few days ago I was profiling PDFBox when loading medium/large size
>>>>> documents and I think I found something.
>>>>> If you try loading the document
>>>>> http://www.adobe.com/devnet/acrobat/pdfs/pdf_reference_1-7.pdf  you'll see
>>>>> it takes quite some time and that's mostly spent in the
>>>>> XrefTrailerResolver.getContainedObjectNumbers. The issue is that every time
>>>>> an object contained in an unparsed object stream is found, the
>>>>> XrefTrailerResolver performs a full scan of the xref entries found in the
>>>>> document, in this case hundreds of thousands. If the object streams are
>>>>> many (like in the given doc), it performs many full scans resulting in poor
>>>>> performance.
>>>>> I'm trying to get familiar with the PDFBox code and I decided to try and
>>>>> fix this herehttps://github.com/torakiki/sambox/tree/xref
>>>>> As you can see I refactored a bit extracting some classes and covered the
>>>>> expect behaviour with unit tests. I tested it with few random docs, loading
>>>>> and saving them back and the output is exactly the same with or without my
>>>>> changes. The pdf_reference_1-7.pdf doc loads in half of the time, same as
>>>>> this
>>>>> http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
>>>>> it takes half the time. Other kind of docs loads in a comparable amount of
>>>>> time and even profiling memory usage it seems comparable if not a little
>>>>> less.
>>>>> Maybe someone wants to take a look?
>>>>> 
>>>>> I understand my changes look a bit invasive and the issue could probably be
>>>>> fixed differently, on the other hand the couple BaseParser+COSParser looks
>>>>> like a big intimidating monster to a newcomer like me and it's quite
>>>>> difficult to follow the expected behaviour so I thought this might be a
>>>>> chance to start breaking them down in smaller, distilled classes...
>>>>> something a little more manageable and testable... anyway, grab what you
>>>>> like, leave what you don't  :)
>>>>> 
>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>> 
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Xref parsing performance

Posted by Andreas Lehmkuehler <an...@lehmi.de>.

Am 28.02.2015 um 17:49 schrieb Maruan Sahyoun:
> Hi,
>
> Am 28.02.2015 um 17:32 schrieb Andreas Lehmkuehler <an...@lehmi.de>:
>
>> Hi
>>
>> Am 28.02.2015 um 16:47 schrieb Tilman Hausherr:
>>> Hi Andrea,
>>>
>>> While a speed improvement in parsing of large files would be much appreciated
>>> (especially by the TIKA users), there are several problems with your change:
>> +1
>>
>>> - don't do changes that need JDK7 or higher even if they are cool. We use JDK6
>>> currently.
>>>
>>> - regressions:
>>>
>>> Error converting file PDFBOX-2250-110264-xref-zeronumber.pdf
>>> java.io.IOException: XREF for 3:0 points to wrong object: 1:0
>>>      at
>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:696)
>>>      at
>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:639)
>>>      at org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:600)
>>>      at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:346)
>>>      at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:373)
>>>      at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:811)
>>>      at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:757)
>>>      at org.apache.pdfbox.util.TestPDFToImage.doTestFile(TestPDFToImage.java:201)
>>>      at
>>> org.apache.pdfbox.util.TestPDFToImage.testRenderImage(TestPDFToImage.java:343)
>>>      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>      at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>      at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>
>>>      at java.lang.reflect.Method.invoke(Method.java:606)
>>>      at junit.framework.TestCase.runTest(TestCase.java:176)
>>>      at junit.framework.TestCase.runBare(TestCase.java:141)
>>>      at junit.framework.TestResult$1.protect(TestResult.java:122)
>>>      at junit.framework.TestResult.runProtected(TestResult.java:142)
>>>      at junit.framework.TestResult.run(TestResult.java:125)
>>>      at junit.framework.TestCase.run(TestCase.java:129)
>>>      at junit.framework.TestSuite.runTest(TestSuite.java:255)
>>>      at junit.framework.TestSuite.run(TestSuite.java:250)
>>>      at junit.textui.TestRunner.doRun(TestRunner.java:116)
>>>      at junit.textui.TestRunner.start(TestRunner.java:183)
>>>      at junit.textui.TestRunner.main(TestRunner.java:137)
>>>      at org.apache.pdfbox.util.TestPDFToImage.main(TestPDFToImage.java:393)
>>>
>>>
>>> Error converting file PDFBOX-2599.pdf
>>> java.io.IOException: XREF for 2:0 points to wrong object: 1:0
>>>      at
>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:696)
>>>      at
>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:639)
>>>      at org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:600)
>>>      at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:346)
>>>      at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:373)
>>>      at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:811)
>>>      at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:757)
>>>      at org.apache.pdfbox.util.TestPDFToImage.doTestFile(TestPDFToImage.java:201)
>>>      at
>>> org.apache.pdfbox.util.TestPDFToImage.testRenderImage(TestPDFToImage.java:343)
>>>      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>      at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>      at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>
>>>      at java.lang.reflect.Method.invoke(Method.java:606)
>>>      at junit.framework.TestCase.runTest(TestCase.java:176)
>>>      at junit.framework.TestCase.runBare(TestCase.java:141)
>>>      at junit.framework.TestResult$1.protect(TestResult.java:122)
>>>      at junit.framework.TestResult.runProtected(TestResult.java:142)
>>>      at junit.framework.TestResult.run(TestResult.java:125)
>>>      at junit.framework.TestCase.run(TestCase.java:129)
>>>      at junit.framework.TestSuite.runTest(TestSuite.java:255)
>>>      at junit.framework.TestSuite.run(TestSuite.java:250)
>>>      at junit.textui.TestRunner.doRun(TestRunner.java:116)
>>>      at junit.textui.TestRunner.start(TestRunner.java:183)
>>>      at junit.textui.TestRunner.main(TestRunner.java:137)
>>>      at org.apache.pdfbox.util.TestPDFToImage.main(TestPDFToImage.java:393)
>>>
>>>
>>> - why change only one of the members of that cosobjectkey class to int?
>>> According to the spec, both are integers. Maybe there's a good reason, but I'd
>>> like to know.
>> ASFAIK there is no good reason not to change both to int.
>
> as the offset is a 10 digit number is that really covered being an int?
It's about the object number not the offset. We are using a long for the offset. 
The spec is quite clear about those numbers. They have to be integers and the 
max value for an integer within a pdf is 2^31-1 due to the fact that the assumed 
default platform for a conforming reader should be 32-bit.

BTW, I've changed the object/generation number to int.

>
> BR
> Maruan
>
>>
>>> - even if you get rid of the regressions, a remaining problem is that
>>>     - Andreas L. is currently working on some parser stuff in PDFBOX-2527
>> That's not a problem. For now I'm focused on the parsing process itself and am working on one last piece, the rebuild mechanism.
>>
>>>     - your change is too big to evaluate (I'm speaking only for myself there).
>>> It would be better to first submit only small refactorings in PDFBOX-2576, and
>>
>> I agree. We should try to break up the patch into smaller pieces if possible. Let's start with the long -> int change
>>
>>> then the optimization you mention (or the other way around). The parser is
>>> indeed a tricky part of the code (And SonarQube and Software Diagnostics have
>>> also flagged it as too complex). I did some refactorings a few weeks ago there
>>> (splitting methods), but stopped because I couldn't come up with names for the
>>> new methods. I just didn't understand what they were doing.
>>>
>>> Tilman
>>
>> BR
>> Andreas Lehmkühler
>>
>>>
>>> Am 27.02.2015 um 16:34 schrieb Andrea Vacondio:
>>>> Hi,
>>>> few days ago I was profiling PDFBox when loading medium/large size
>>>> documents and I think I found something.
>>>> If you try loading the document
>>>> http://www.adobe.com/devnet/acrobat/pdfs/pdf_reference_1-7.pdf  you'll see
>>>> it takes quite some time and that's mostly spent in the
>>>> XrefTrailerResolver.getContainedObjectNumbers. The issue is that every time
>>>> an object contained in an unparsed object stream is found, the
>>>> XrefTrailerResolver performs a full scan of the xref entries found in the
>>>> document, in this case hundreds of thousands. If the object streams are
>>>> many (like in the given doc), it performs many full scans resulting in poor
>>>> performance.
>>>> I'm trying to get familiar with the PDFBox code and I decided to try and
>>>> fix this herehttps://github.com/torakiki/sambox/tree/xref
>>>> As you can see I refactored a bit extracting some classes and covered the
>>>> expect behaviour with unit tests. I tested it with few random docs, loading
>>>> and saving them back and the output is exactly the same with or without my
>>>> changes. The pdf_reference_1-7.pdf doc loads in half of the time, same as
>>>> this
>>>> http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
>>>> it takes half the time. Other kind of docs loads in a comparable amount of
>>>> time and even profiling memory usage it seems comparable if not a little
>>>> less.
>>>> Maybe someone wants to take a look?
>>>>
>>>> I understand my changes look a bit invasive and the issue could probably be
>>>> fixed differently, on the other hand the couple BaseParser+COSParser looks
>>>> like a big intimidating monster to a newcomer like me and it's quite
>>>> difficult to follow the expected behaviour so I thought this might be a
>>>> chance to start breaking them down in smaller, distilled classes...
>>>> something a little more manageable and testable... anyway, grab what you
>>>> like, leave what you don't  :)
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Xref parsing performance

Posted by Andreas Lehmkuehler <an...@lehmi.de>.

Am 28.02.2015 um 17:58 schrieb Tilman Hausherr:
> Am 28.02.2015 um 17:49 schrieb Maruan Sahyoun:
>>>> >>
>>>> >>- why change only one of the members of that cosobjectkey class to int?
>>>> >>According to the spec, both are integers. Maybe there's a good reason, but
>>>> I'd
>>>> >>like to know.
>>> >ASFAIK there is no good reason not to change both to int.
>> as the offset is a 10 digit number is that really covered being an int?
>
> I would have waited for his explanation... and now I understand what he may have
> thought - consider this change you did:
>
> -                                        new COSObjectKey(-fileOffset, 0));
> +                                        new COSObjectKey((int)-fileOffset, 0));
> the first paramater has a double usage within PDFBox, if negative it is an
> offset in an object stream. Do we know that these are always smaller than
> 0x7FFFFFFF ?
Yeah, you are right and I'm not happy with that change, too. Maybe I should have 
add a comment. We should refactor that part too, I already have somethin in my 
mind ....


BR
Andreas

> Tilman
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Xref parsing performance

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 28.02.2015 um 17:49 schrieb Maruan Sahyoun:
>>> >>
>>> >>- why change only one of the members of that cosobjectkey class to int?
>>> >>According to the spec, both are integers. Maybe there's a good reason, but I'd
>>> >>like to know.
>> >ASFAIK there is no good reason not to change both to int.
> as the offset is a 10 digit number is that really covered being an int?

I would have waited for his explanation... and now I understand what he 
may have thought - consider this change you did:

-                                        new COSObjectKey(-fileOffset, 0));
+                                        new COSObjectKey((int)-fileOffset, 0));

the first paramater has a double usage within PDFBox, if negative it is 
an offset in an object stream. Do we know that these are always smaller 
than 0x7FFFFFFF ?

Tilman

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Xref parsing performance

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.

Hi,

Am 28.02.2015 um 17:32 schrieb Andreas Lehmkuehler <an...@lehmi.de>:

> Hi
> 
> Am 28.02.2015 um 16:47 schrieb Tilman Hausherr:
>> Hi Andrea,
>> 
>> While a speed improvement in parsing of large files would be much appreciated
>> (especially by the TIKA users), there are several problems with your change:
> +1
> 
>> - don't do changes that need JDK7 or higher even if they are cool. We use JDK6
>> currently.
>> 
>> - regressions:
>> 
>> Error converting file PDFBOX-2250-110264-xref-zeronumber.pdf
>> java.io.IOException: XREF for 3:0 points to wrong object: 1:0
>>     at
>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:696)
>>     at
>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:639)
>>     at org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:600)
>>     at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:346)
>>     at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:373)
>>     at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:811)
>>     at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:757)
>>     at org.apache.pdfbox.util.TestPDFToImage.doTestFile(TestPDFToImage.java:201)
>>     at
>> org.apache.pdfbox.util.TestPDFToImage.testRenderImage(TestPDFToImage.java:343)
>>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>     at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>     at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> 
>>     at java.lang.reflect.Method.invoke(Method.java:606)
>>     at junit.framework.TestCase.runTest(TestCase.java:176)
>>     at junit.framework.TestCase.runBare(TestCase.java:141)
>>     at junit.framework.TestResult$1.protect(TestResult.java:122)
>>     at junit.framework.TestResult.runProtected(TestResult.java:142)
>>     at junit.framework.TestResult.run(TestResult.java:125)
>>     at junit.framework.TestCase.run(TestCase.java:129)
>>     at junit.framework.TestSuite.runTest(TestSuite.java:255)
>>     at junit.framework.TestSuite.run(TestSuite.java:250)
>>     at junit.textui.TestRunner.doRun(TestRunner.java:116)
>>     at junit.textui.TestRunner.start(TestRunner.java:183)
>>     at junit.textui.TestRunner.main(TestRunner.java:137)
>>     at org.apache.pdfbox.util.TestPDFToImage.main(TestPDFToImage.java:393)
>> 
>> 
>> Error converting file PDFBOX-2599.pdf
>> java.io.IOException: XREF for 2:0 points to wrong object: 1:0
>>     at
>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:696)
>>     at
>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:639)
>>     at org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:600)
>>     at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:346)
>>     at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:373)
>>     at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:811)
>>     at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:757)
>>     at org.apache.pdfbox.util.TestPDFToImage.doTestFile(TestPDFToImage.java:201)
>>     at
>> org.apache.pdfbox.util.TestPDFToImage.testRenderImage(TestPDFToImage.java:343)
>>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>     at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>     at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> 
>>     at java.lang.reflect.Method.invoke(Method.java:606)
>>     at junit.framework.TestCase.runTest(TestCase.java:176)
>>     at junit.framework.TestCase.runBare(TestCase.java:141)
>>     at junit.framework.TestResult$1.protect(TestResult.java:122)
>>     at junit.framework.TestResult.runProtected(TestResult.java:142)
>>     at junit.framework.TestResult.run(TestResult.java:125)
>>     at junit.framework.TestCase.run(TestCase.java:129)
>>     at junit.framework.TestSuite.runTest(TestSuite.java:255)
>>     at junit.framework.TestSuite.run(TestSuite.java:250)
>>     at junit.textui.TestRunner.doRun(TestRunner.java:116)
>>     at junit.textui.TestRunner.start(TestRunner.java:183)
>>     at junit.textui.TestRunner.main(TestRunner.java:137)
>>     at org.apache.pdfbox.util.TestPDFToImage.main(TestPDFToImage.java:393)
>> 
>> 
>> - why change only one of the members of that cosobjectkey class to int?
>> According to the spec, both are integers. Maybe there's a good reason, but I'd
>> like to know.
> ASFAIK there is no good reason not to change both to int.

as the offset is a 10 digit number is that really covered being an int?

BR
Maruan

> 
>> - even if you get rid of the regressions, a remaining problem is that
>>    - Andreas L. is currently working on some parser stuff in PDFBOX-2527
> That's not a problem. For now I'm focused on the parsing process itself and am working on one last piece, the rebuild mechanism.
> 
>>    - your change is too big to evaluate (I'm speaking only for myself there).
>> It would be better to first submit only small refactorings in PDFBOX-2576, and
> 
> I agree. We should try to break up the patch into smaller pieces if possible. Let's start with the long -> int change
> 
>> then the optimization you mention (or the other way around). The parser is
>> indeed a tricky part of the code (And SonarQube and Software Diagnostics have
>> also flagged it as too complex). I did some refactorings a few weeks ago there
>> (splitting methods), but stopped because I couldn't come up with names for the
>> new methods. I just didn't understand what they were doing.
>> 
>> Tilman
> 
> BR
> Andreas Lehmkühler
> 
>> 
>> Am 27.02.2015 um 16:34 schrieb Andrea Vacondio:
>>> Hi,
>>> few days ago I was profiling PDFBox when loading medium/large size
>>> documents and I think I found something.
>>> If you try loading the document
>>> http://www.adobe.com/devnet/acrobat/pdfs/pdf_reference_1-7.pdf  you'll see
>>> it takes quite some time and that's mostly spent in the
>>> XrefTrailerResolver.getContainedObjectNumbers. The issue is that every time
>>> an object contained in an unparsed object stream is found, the
>>> XrefTrailerResolver performs a full scan of the xref entries found in the
>>> document, in this case hundreds of thousands. If the object streams are
>>> many (like in the given doc), it performs many full scans resulting in poor
>>> performance.
>>> I'm trying to get familiar with the PDFBox code and I decided to try and
>>> fix this herehttps://github.com/torakiki/sambox/tree/xref
>>> As you can see I refactored a bit extracting some classes and covered the
>>> expect behaviour with unit tests. I tested it with few random docs, loading
>>> and saving them back and the output is exactly the same with or without my
>>> changes. The pdf_reference_1-7.pdf doc loads in half of the time, same as
>>> this
>>> http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
>>> it takes half the time. Other kind of docs loads in a comparable amount of
>>> time and even profiling memory usage it seems comparable if not a little
>>> less.
>>> Maybe someone wants to take a look?
>>> 
>>> I understand my changes look a bit invasive and the issue could probably be
>>> fixed differently, on the other hand the couple BaseParser+COSParser looks
>>> like a big intimidating monster to a newcomer like me and it's quite
>>> difficult to follow the expected behaviour so I thought this might be a
>>> chance to start breaking them down in smaller, distilled classes...
>>> something a little more manageable and testable... anyway, grab what you
>>> like, leave what you don't  :)
>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Xref parsing performance

Posted by Andreas Lehmkuehler <an...@lehmi.de>.

Hi

Am 28.02.2015 um 16:47 schrieb Tilman Hausherr:
> Hi Andrea,
>
> While a speed improvement in parsing of large files would be much appreciated
> (especially by the TIKA users), there are several problems with your change:
+1

> - don't do changes that need JDK7 or higher even if they are cool. We use JDK6
> currently.
>
> - regressions:
>
> Error converting file PDFBOX-2250-110264-xref-zeronumber.pdf
> java.io.IOException: XREF for 3:0 points to wrong object: 1:0
>      at
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:696)
>      at
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:639)
>      at org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:600)
>      at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:346)
>      at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:373)
>      at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:811)
>      at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:757)
>      at org.apache.pdfbox.util.TestPDFToImage.doTestFile(TestPDFToImage.java:201)
>      at
> org.apache.pdfbox.util.TestPDFToImage.testRenderImage(TestPDFToImage.java:343)
>      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>      at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>      at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>
>      at java.lang.reflect.Method.invoke(Method.java:606)
>      at junit.framework.TestCase.runTest(TestCase.java:176)
>      at junit.framework.TestCase.runBare(TestCase.java:141)
>      at junit.framework.TestResult$1.protect(TestResult.java:122)
>      at junit.framework.TestResult.runProtected(TestResult.java:142)
>      at junit.framework.TestResult.run(TestResult.java:125)
>      at junit.framework.TestCase.run(TestCase.java:129)
>      at junit.framework.TestSuite.runTest(TestSuite.java:255)
>      at junit.framework.TestSuite.run(TestSuite.java:250)
>      at junit.textui.TestRunner.doRun(TestRunner.java:116)
>      at junit.textui.TestRunner.start(TestRunner.java:183)
>      at junit.textui.TestRunner.main(TestRunner.java:137)
>      at org.apache.pdfbox.util.TestPDFToImage.main(TestPDFToImage.java:393)
>
>
> Error converting file PDFBOX-2599.pdf
> java.io.IOException: XREF for 2:0 points to wrong object: 1:0
>      at
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:696)
>      at
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:639)
>      at org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:600)
>      at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:346)
>      at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:373)
>      at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:811)
>      at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:757)
>      at org.apache.pdfbox.util.TestPDFToImage.doTestFile(TestPDFToImage.java:201)
>      at
> org.apache.pdfbox.util.TestPDFToImage.testRenderImage(TestPDFToImage.java:343)
>      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>      at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>      at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>
>      at java.lang.reflect.Method.invoke(Method.java:606)
>      at junit.framework.TestCase.runTest(TestCase.java:176)
>      at junit.framework.TestCase.runBare(TestCase.java:141)
>      at junit.framework.TestResult$1.protect(TestResult.java:122)
>      at junit.framework.TestResult.runProtected(TestResult.java:142)
>      at junit.framework.TestResult.run(TestResult.java:125)
>      at junit.framework.TestCase.run(TestCase.java:129)
>      at junit.framework.TestSuite.runTest(TestSuite.java:255)
>      at junit.framework.TestSuite.run(TestSuite.java:250)
>      at junit.textui.TestRunner.doRun(TestRunner.java:116)
>      at junit.textui.TestRunner.start(TestRunner.java:183)
>      at junit.textui.TestRunner.main(TestRunner.java:137)
>      at org.apache.pdfbox.util.TestPDFToImage.main(TestPDFToImage.java:393)
>
>
> - why change only one of the members of that cosobjectkey class to int?
> According to the spec, both are integers. Maybe there's a good reason, but I'd
> like to know.
ASFAIK there is no good reason not to change both to int.

> - even if you get rid of the regressions, a remaining problem is that
>     - Andreas L. is currently working on some parser stuff in PDFBOX-2527
That's not a problem. For now I'm focused on the parsing process itself and am 
working on one last piece, the rebuild mechanism.

>     - your change is too big to evaluate (I'm speaking only for myself there).
> It would be better to first submit only small refactorings in PDFBOX-2576, and

I agree. We should try to break up the patch into smaller pieces if possible. 
Let's start with the long -> int change

> then the optimization you mention (or the other way around). The parser is
> indeed a tricky part of the code (And SonarQube and Software Diagnostics have
> also flagged it as too complex). I did some refactorings a few weeks ago there
> (splitting methods), but stopped because I couldn't come up with names for the
> new methods. I just didn't understand what they were doing.
>
> Tilman

BR
Andreas Lehmkühler

>
> Am 27.02.2015 um 16:34 schrieb Andrea Vacondio:
>> Hi,
>> few days ago I was profiling PDFBox when loading medium/large size
>> documents and I think I found something.
>> If you try loading the document
>> http://www.adobe.com/devnet/acrobat/pdfs/pdf_reference_1-7.pdf  you'll see
>> it takes quite some time and that's mostly spent in the
>> XrefTrailerResolver.getContainedObjectNumbers. The issue is that every time
>> an object contained in an unparsed object stream is found, the
>> XrefTrailerResolver performs a full scan of the xref entries found in the
>> document, in this case hundreds of thousands. If the object streams are
>> many (like in the given doc), it performs many full scans resulting in poor
>> performance.
>> I'm trying to get familiar with the PDFBox code and I decided to try and
>> fix this herehttps://github.com/torakiki/sambox/tree/xref
>> As you can see I refactored a bit extracting some classes and covered the
>> expect behaviour with unit tests. I tested it with few random docs, loading
>> and saving them back and the output is exactly the same with or without my
>> changes. The pdf_reference_1-7.pdf doc loads in half of the time, same as
>> this
>> http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
>> it takes half the time. Other kind of docs loads in a comparable amount of
>> time and even profiling memory usage it seems comparable if not a little
>> less.
>> Maybe someone wants to take a look?
>>
>> I understand my changes look a bit invasive and the issue could probably be
>> fixed differently, on the other hand the couple BaseParser+COSParser looks
>> like a big intimidating monster to a newcomer like me and it's quite
>> difficult to follow the expected behaviour so I thought this might be a
>> chance to start breaking them down in smaller, distilled classes...
>> something a little more manageable and testable... anyway, grab what you
>> like, leave what you don't  :)
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Xref parsing performance

Posted by Tilman Hausherr <TH...@t-online.de>.

Hi Andrea,

While a speed improvement in parsing of large files would be much 
appreciated (especially by the TIKA users), there are several problems 
with your change:

- don't do changes that need JDK7 or higher even if they are cool. We 
use JDK6 currently.

- regressions:

Error converting file PDFBOX-2250-110264-xref-zeronumber.pdf
java.io.IOException: XREF for 3:0 points to wrong object: 1:0
     at 
org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:696)
     at 
org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:639)
     at 
org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:600)
     at 
org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:346)
     at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:373)
     at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:811)
     at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:757)
     at 
org.apache.pdfbox.util.TestPDFToImage.doTestFile(TestPDFToImage.java:201)
     at 
org.apache.pdfbox.util.TestPDFToImage.testRenderImage(TestPDFToImage.java:343)
     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
     at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
     at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
     at java.lang.reflect.Method.invoke(Method.java:606)
     at junit.framework.TestCase.runTest(TestCase.java:176)
     at junit.framework.TestCase.runBare(TestCase.java:141)
     at junit.framework.TestResult$1.protect(TestResult.java:122)
     at junit.framework.TestResult.runProtected(TestResult.java:142)
     at junit.framework.TestResult.run(TestResult.java:125)
     at junit.framework.TestCase.run(TestCase.java:129)
     at junit.framework.TestSuite.runTest(TestSuite.java:255)
     at junit.framework.TestSuite.run(TestSuite.java:250)
     at junit.textui.TestRunner.doRun(TestRunner.java:116)
     at junit.textui.TestRunner.start(TestRunner.java:183)
     at junit.textui.TestRunner.main(TestRunner.java:137)
     at org.apache.pdfbox.util.TestPDFToImage.main(TestPDFToImage.java:393)


Error converting file PDFBOX-2599.pdf
java.io.IOException: XREF for 2:0 points to wrong object: 1:0
     at 
org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:696)
     at 
org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:639)
     at 
org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:600)
     at 
org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:346)
     at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:373)
     at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:811)
     at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:757)
     at 
org.apache.pdfbox.util.TestPDFToImage.doTestFile(TestPDFToImage.java:201)
     at 
org.apache.pdfbox.util.TestPDFToImage.testRenderImage(TestPDFToImage.java:343)
     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
     at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
     at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
     at java.lang.reflect.Method.invoke(Method.java:606)
     at junit.framework.TestCase.runTest(TestCase.java:176)
     at junit.framework.TestCase.runBare(TestCase.java:141)
     at junit.framework.TestResult$1.protect(TestResult.java:122)
     at junit.framework.TestResult.runProtected(TestResult.java:142)
     at junit.framework.TestResult.run(TestResult.java:125)
     at junit.framework.TestCase.run(TestCase.java:129)
     at junit.framework.TestSuite.runTest(TestSuite.java:255)
     at junit.framework.TestSuite.run(TestSuite.java:250)
     at junit.textui.TestRunner.doRun(TestRunner.java:116)
     at junit.textui.TestRunner.start(TestRunner.java:183)
     at junit.textui.TestRunner.main(TestRunner.java:137)
     at org.apache.pdfbox.util.TestPDFToImage.main(TestPDFToImage.java:393)


- why change only one of the members of that cosobjectkey class to int? 
According to the spec, both are integers. Maybe there's a good reason, 
but I'd like to know.

- even if you get rid of the regressions, a remaining problem is that
    - Andreas L. is currently working on some parser stuff in PDFBOX-2527
    - your change is too big to evaluate (I'm speaking only for myself 
there). It would be better to first submit only small refactorings in 
PDFBOX-2576, and then the optimization you mention (or the other way 
around). The parser is indeed a tricky part of the code (And SonarQube 
and Software Diagnostics have also flagged it as too complex). I did 
some refactorings a few weeks ago there (splitting methods), but stopped 
because I couldn't come up with names for the new methods. I just didn't 
understand what they were doing.

Tilman

Am 27.02.2015 um 16:34 schrieb Andrea Vacondio:
> Hi,
> few days ago I was profiling PDFBox when loading medium/large size
> documents and I think I found something.
> If you try loading the document
> http://www.adobe.com/devnet/acrobat/pdfs/pdf_reference_1-7.pdf  you'll see
> it takes quite some time and that's mostly spent in the
> XrefTrailerResolver.getContainedObjectNumbers. The issue is that every time
> an object contained in an unparsed object stream is found, the
> XrefTrailerResolver performs a full scan of the xref entries found in the
> document, in this case hundreds of thousands. If the object streams are
> many (like in the given doc), it performs many full scans resulting in poor
> performance.
> I'm trying to get familiar with the PDFBox code and I decided to try and
> fix this herehttps://github.com/torakiki/sambox/tree/xref
> As you can see I refactored a bit extracting some classes and covered the
> expect behaviour with unit tests. I tested it with few random docs, loading
> and saving them back and the output is exactly the same with or without my
> changes. The pdf_reference_1-7.pdf doc loads in half of the time, same as
> this
> http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
> it takes half the time. Other kind of docs loads in a comparable amount of
> time and even profiling memory usage it seems comparable if not a little
> less.
> Maybe someone wants to take a look?
>
> I understand my changes look a bit invasive and the issue could probably be
> fixed differently, on the other hand the couple BaseParser+COSParser looks
> like a big intimidating monster to a newcomer like me and it's quite
> difficult to follow the expected behaviour so I thought this might be a
> chance to start breaking them down in smaller, distilled classes...
> something a little more manageable and testable... anyway, grab what you
> like, leave what you don't  :)
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Xref parsing performance

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.

looked at it quickly - very nice!
 
Maruan

Am 27.02.2015 um 16:34 schrieb Andrea Vacondio <an...@gmail.com>:

> Hi,
> few days ago I was profiling PDFBox when loading medium/large size
> documents and I think I found something.
> If you try loading the document
> http://www.adobe.com/devnet/acrobat/pdfs/pdf_reference_1-7.pdf you'll see
> it takes quite some time and that's mostly spent in the
> XrefTrailerResolver.getContainedObjectNumbers. The issue is that every time
> an object contained in an unparsed object stream is found, the
> XrefTrailerResolver performs a full scan of the xref entries found in the
> document, in this case hundreds of thousands. If the object streams are
> many (like in the given doc), it performs many full scans resulting in poor
> performance.
> I'm trying to get familiar with the PDFBox code and I decided to try and
> fix this here https://github.com/torakiki/sambox/tree/xref
> As you can see I refactored a bit extracting some classes and covered the
> expect behaviour with unit tests. I tested it with few random docs, loading
> and saving them back and the output is exactly the same with or without my
> changes. The pdf_reference_1-7.pdf doc loads in half of the time, same as
> this
> http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
> it takes half the time. Other kind of docs loads in a comparable amount of
> time and even profiling memory usage it seems comparable if not a little
> less.
> Maybe someone wants to take a look?
> 
> I understand my changes look a bit invasive and the issue could probably be
> fixed differently, on the other hand the couple BaseParser+COSParser looks
> like a big intimidating monster to a newcomer like me and it's quite
> difficult to follow the expected behaviour so I thought this might be a
> chance to start breaking them down in smaller, distilled classes...
> something a little more manageable and testable... anyway, grab what you
> like, leave what you don't  :)