You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Rodrigo Caniçali <ro...@yahoo.com.br> on 2013/11/01 22:55:45 UTC

WARNING: Did not found XRef object at specified startxref position

Hi,

I found on a mailing list of 2012-jun-14 that this problem has been already discussed, but here is pretty different.

I also get the warning "Did not found XRef object at specified startxref position xxx" when executing the main function of org.apache.pdfbox.ExtractText class. However, some PDF texts are ignored and are not printed on the output TXT file. These same texts are displayed by Acrobat Reader and can be copyed by the user as texts from this program.

If the option "-nonSeq" is selected, then appears a "java.io.IOException: Error: Expected a long type, actual=..." which stops the text extraction.

Please, is there any way to make it work?

Thanks,

Rodrigo

Re: WARNING: Did not found XRef object at specified startxref position

Posted by Thomas Chojecki <in...@rayman2200.de>.
Hi Rodrigo,
as Maruan already tell you, we use the ISO-32000. This document isn't  
free so you can also try this one:

http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf

Best regards
Thomas

Zitat von Rodrigo Caniçali <ro...@yahoo.com.br>:

> Thomas,
>
> I found several PDF specifications on the net.
>
> Please, which is the PDF specification followed by PDFBOX library.
>
> Thanks,
>
> Rodrigo
>
>
>
> Em Quinta-feira, 14 de Novembro de 2013 11:30, Rodrigo Caniçali  
> <ro...@yahoo.com.br> escreveu:
>
> Hi Thomas,
>
> There is no such object at the whole document. Looking for the  
> keyword "/XRef" or "80 0", the editor cannot find them  
> anywhere. However I could find at the end of the document the  
> following code:
>
> xref
> 0 47
> 0000000000 65535 f 
> 0000000009 00000 n 
> 0000052584 00000 n 
> 0000052633 00000 n 
> 0000009275 00000 n 
> 0000000199 00000 n 
> 0000003543 00000 n 
> ....
> 0000052345 0000 n 
>
> trailer
> <<
>
> /Size 47
> /Root 2 0 R
> /Info 1 0 R
>>>
> startxref
> 52279
> %%EOF
>
> Changing the reference 52279 by 53730 which is the address of  
> "xref", it seems that the xref table position error has been solved. 
>
> But the following warning is still been displayed and some text are  
> still not been extracted:
>
> Loading PDF D:\Documents and Settings\05215385726\rpf_tributos.pdf
> Time for loading: 0.094 seconds
> Starting text extraction
> Writing to D:\Documents and Settings\05215385726\rpf_tributos.txt
> Nov 14, 2013 11:24:36 AM org.apache.pdfbox.util.PDFStreamEngine  
> processOperator
> INFO: unsupported/disabled operation: o
> Nov 14, 2013 11:24:36 AM org.apache.pdfbox.util.PDFStreamEngine  
> processOperator
> INFO: unsupported/disabled operation: Os
> Nov 14, 2013 11:24:36 AM org.apache.pdfbox.util.PDFStreamEngine  
> processOperator
> INFO: unsupported/disabled operation: a
> Nov 14, 2013 11:24:36 AM org.apache.pdfbox.util.PDFStreamEngine  
> processOperator
> INFO: unsupported/disabled operation: su
>
>
> Also, with the "-nonSeq" option enabled, the error below is displayed:
>
> Loading PDF D:\Documents and Settings\05215385726\Meus  
> documentos\rpf_tributos.pdf
> Exception in thread "main" java.io.IOException: Error: Expected a  
> long type, actual='K`_'
> at org.apache.pdfbox.pdfparser.BaseParser.readLong(BaseParser.java:1668)
> at  
> org.apache.pdfbox.pdfparser.BaseParser.readObjectNumber(BaseParser.java:1598)
> at  
> org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseObjectDynamically(NonSequentialPDFParser.java:1183)
> at  
> org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseObjectDynamically(NonSequentialPDFParser.java:1130)
> at  
> org.apache.pdfbox.pdfparser.NonSequentialPDFParser.initialParse(NonSequentialPDFParser.java:420)
> at  
> org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parse(NonSequentialPDFParser.java:702)
> at org.apache.pdfbox.pdmodel.PDDocument.loadNonSeq(PDDocument.java:1139)
> at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:208)
> at org.apache.pdfbox.ExtractText.main(ExtractText.java:85)
>
>
> I wonder if I could write a routine to fix a document like this  
> before parsing it with PDFbox, since it can be parsed by Acrobat  
> Reader.
>
> Thanks,
>
> Rodrigo
>
>
>
> Em Quarta-feira, 13 de Novembro de 2013 19:49, Thomas Chojecki  
> <in...@rayman2200.de> escreveu:
>
> Hi Rodrigo,
> it look like the startxref position (52779) is wrong and point into a 
> stream instead at the beginning of a xref table or stream. The value 
> inside the exception shows a compressed string and it might be the 
> xref stream.
>
> You can open a hex editor and jump directly to the position 52779 and 
> look for a object that may look like
>
> ,---
>
> 80 0 obj <<
> /Type /XRef
> /Index [0 424]
> /Size 424
> /W [1 3 1]
> /Root 421 0 R
> /Info 422 0 R
> /ID [<14895AE8C3218939710EBBFF5EAD0E28> <14895AE8C3218939710EBBFF5EAD0E28>]
> /Length 1073
> /Filter /FlateDecode
>>>
> stream
> ...
> endstream
> endobj
>
> `---
>
> If you find this object with the /Type /XRef you can go to the 
> beginning of it, in this case the 80 0 obj and write down the position 
> of this object. Then you can go to the end of the file and overwrite 
> the startxref 52779 position with you marked position and try to parse 
> the document again.
>
> This should work and indicate that the pdf creator you are using, 
> creates wrong object positions. Pdfbox can parse only documents that 
> provide correct xref tables / streams, otherwise the parser does not 
> know how to handle the document.
>
> Best regards
> Thomas
>
>
>
> Zitat von Rodrigo Caniçali <ro...@yahoo.com.br>:
>
>> Hi Thomas,
>>
>> Below is the stacktrace when the option “-nonSeq” is enabled:
>>
>> Loading PDF D:\Documents and Settings\05215385726\Meus 
>> documentos\rpf_tributos.pdf
>> Exception in thread "main" java.io.IOException: Error: Expected a 
>> long type, actual='!@:g8lJLDX5I'H%oMioAqC?O$d[,X]%dZ#a?Wos'
>> at org.apache.pdfbox.pdfparser.BaseParser.readLong(BaseParser.java:1668)
>> at 
>> org.apache.pdfbox.pdfparser.BaseParser.readObjectNumber(BaseParser.java:1598)
>>
>  at 
>> org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseXrefObjStream(NonSequentialPDFParser.java:460)
>> at 
>> org.apache.pdfbox.pdfparser.NonSequentialPDFParser.initialParse(NonSequentialPDFParser.java:358)
>> at 
>> org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parse(NonSequentialPDFParser.java:702)
>> at org.apache.pdfbox.pdmodel.PDDocument.loadNonSeq(PDDocument.java:1139)
>> at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:208)
>> at org.apache.pdfbox.ExtractText.main(ExtractText.java:85)
>>
>>
>> When that option is disabled, the following warnings are printed on 
>> Eclipse console and some text of PDF document is not extracted:
>>
>> Loading PDF
>  D:\Documents and Settings\05215385726\Meus 
>> documentos\rpf_tributos.pdf
>> Nov 04, 2013 10:16:13 AM 
>> org.apache.pdfbox.pdfparser.XrefTrailerResolver setStartxref
>> WARNING: Did not found XRef object at specified startxref position 52779
>> Time for loading: 0.125 seconds
>> Starting text extraction
>> Writing to D:\Documents and Settings\05215385726\Meus 
>> documentos\rpf_tributos.txt
>> Nov 04, 2013 10:16:14 AM org.apache.pdfbox.util.PDFStreamEngine 
>> processOperator
>> INFO: unsupported/disabled operation: o
>> Nov 04, 2013 10:16:14 AM org.apache.pdfbox.util.PDFStreamEngine 
>> processOperator
>> INFO: unsupported/disabled operation: Os
>> Nov 04,
>  2013 10:16:14 AM org.apache.pdfbox.util.PDFStreamEngine 
>> processOperator
>> INFO: unsupported/disabled operation: a
>> Nov 04, 2013 10:16:14 AM org.apache.pdfbox.util.PDFStreamEngine 
>> processOperator
>> INFO: unsupported/disabled operation: su
>>
>> Thanks,
>>
>> Rodrigo
>>
>>
>>
>> Em Sábado, 2 de Novembro de 2013 10:24, Rodrigo Caniçali 
>> <ro...@yahoo.com.br> escreveu:
>>
>> Hi Thomas,
>>
>> Thanks for your answer.
>>
>> I am afraid the document
>  is confidential, but I canprovide the 
>> stacktrace and find out if it is possible to generate a 
>> non-confidential example on Monday when I will be at the office again.
>>
>> Best regards,
>> Rodrigo
>>
>>
>>
>>
>>
>> Em Sábado, 2 de Novembro de 2013 5:50, Thomas Chojecki 
>> <in...@rayman2200.de> escreveu:
>>
>>
>> Zitat von Rodrigo Caniçali <ro...@yahoo.com.br>:
>>
>>> Hi,
>> Hi
>  Rodrigo,
>>
>>> I found on a mailing list of 2012-jun-14 that this problem has been 
>>> already discussed, but here is pretty different.
>> I think I found the discussion.
>>
>>> I also get the warning "Did not found XRef object at specified 
>>> startxref position xxx" when executing the main function 
>>> of org.apache.pdfbox.ExtractText class. However, some PDF texts are 
>>> ignored and are not printed on the output TXT file. These same texts 
>>> are displayed by Acrobat Reader and can be copyed by the user as 
>>> texts from this program.
>>
>> Your document is broken and it work with Acrobat Reader, because he 
>> isn't
>  strict enough against the specification.
>>
>> Many developer that try to create a pdf writer, test it against the 
>> Acrobat Reader and does not follow always the specification. So the 
>> reference is to create Acrobat Reader and not specification conformant 
>> documents. This lead to the problem that 3rd party libraries like 
>> pdfbox can't sometimes parse such documents.
>>
>> In your case the xref table isn't there, where the parser supposing 
>> it. If you can provide use such document, we can try to find the cause 
>> of the problem and maybe fixing it.
>>
>>>
>>> If the option "-nonSeq" is selected, then appears a 
>>> "java.io.IOException: Error:
>  Expected a long type, actual=..." which 
>>> stops the text extraction.
>> Maybe you can post the first three lines from the stacktrace, this 
>> will help debugging the problem.
>>
>>> Please, is there any way to make it work?
>> It is nearly impossible reconstructing such cases. If you can provide 
>> us more informations or maybe the document, it will help use improving 
>> the parser, if possible.
>>
>> We do our best to support as many document as we can, but in some 
>> cases we need to be strict to support the existing fine parsing 
>> documents. This problem is also one point on the agenda of the pdfbox 
>> 2.0.0 version.
>>
>>
>>>
>>> Thanks,
>>>
>>> Rodrigo
>>
>> Best regards
>> Thomas




Re: WARNING: Did not found XRef object at specified startxref position

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
Hi,

PDFBox targets ISO-32000.

BR

Maruan

> Am 20.11.2013 um 19:29 schrieb Rodrigo Caniçali <ro...@yahoo.com.br>:
> 
> Thomas,
> 
> I found several PDF specifications on the net.
> 
> Please, which is the PDF specification followed by PDFBOX library.
> 
> Thanks,
> 
> Rodrigo
> 
> 
> 
> Em Quinta-feira, 14 de Novembro de 2013 11:30, Rodrigo Caniçali <ro...@yahoo.com.br> escreveu:
> 
> Hi Thomas,
> 
> There is no such object at the whole document. Looking for the keyword "/XRef" or "80 0", the editor cannot find them anywhere. However I could find at the end of the document the following code:
> 
> xref
> 0 47
> 0000000000 65535 f 
> 0000000009 00000 n 
> 0000052584 00000 n 
> 0000052633 00000 n 
> 0000009275 00000 n 
> 0000000199 00000 n 
> 0000003543 00000 n 
> ....
> 0000052345 0000 n 
> 
> trailer
> <<
> 
> /Size 47
> /Root 2 0 R
> /Info 1 0 R
> startxref
> 52279
> %%EOF
> 
> Changing the reference 52279 by 53730 which is the address of "xref", it seems that the xref table position error has been solved. 
> 
> But the following warning is still been displayed and some text are still not been extracted:
> 
> Loading PDF D:\Documents and Settings\05215385726\rpf_tributos.pdf
> Time for loading: 0.094 seconds
> Starting text extraction
> Writing to D:\Documents and Settings\05215385726\rpf_tributos.txt
> Nov 14, 2013 11:24:36 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
> INFO: unsupported/disabled operation: o
> Nov 14, 2013 11:24:36 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
> INFO: unsupported/disabled operation: Os
> Nov 14, 2013 11:24:36 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
> INFO: unsupported/disabled operation: a
> Nov 14, 2013 11:24:36 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
> INFO: unsupported/disabled operation: su
> 
> 
> Also, with the "-nonSeq" option enabled, the error below is displayed:
> 
> Loading PDF D:\Documents and Settings\05215385726\Meus documentos\rpf_tributos.pdf
> Exception in thread "main" java.io.IOException: Error: Expected a long type, actual='K`_'
> at org.apache.pdfbox.pdfparser.BaseParser.readLong(BaseParser.java:1668)
> at org.apache.pdfbox.pdfparser.BaseParser.readObjectNumber(BaseParser.java:1598)
> at org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseObjectDynamically(NonSequentialPDFParser.java:1183)
> at org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseObjectDynamically(NonSequentialPDFParser.java:1130)
> at org.apache.pdfbox.pdfparser.NonSequentialPDFParser.initialParse(NonSequentialPDFParser.java:420)
> at org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parse(NonSequentialPDFParser.java:702)
> at org.apache.pdfbox.pdmodel.PDDocument.loadNonSeq(PDDocument.java:1139)
> at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:208)
> at org.apache.pdfbox.ExtractText.main(ExtractText.java:85)
> 
> 
> I wonder if I could write a routine to fix a document like this before parsing it with PDFbox, since it can be parsed by Acrobat Reader.
> 
> Thanks,
> 
> Rodrigo
> 
> 
> 
> Em Quarta-feira, 13 de Novembro de 2013 19:49, Thomas Chojecki <in...@rayman2200.de> escreveu:
> 
> Hi Rodrigo,
> it look like the startxref position (52779) is wrong and point into a  
> stream instead at the beginning of a xref table or stream. The value  
> inside the exception shows a compressed string and it might be the  
> xref stream.
> 
> You can open a hex editor and jump directly to the position 52779 and  
> look for a object that may look like
> 
> ,---
> 
> 80 0 obj <<
> /Type /XRef
> /Index [0 424]
> /Size 424
> /W [1 3 1]
> /Root 421 0 R
> /Info 422 0 R
> /ID [<14895AE8C3218939710EBBFF5EAD0E28> <14895AE8C3218939710EBBFF5EAD0E28>]
> /Length 1073
> /Filter /FlateDecode
> stream
> ...
> endstream
> endobj
> 
> `---
> 
> If you find this object with the /Type /XRef you can go to the  
> beginning of it, in this case the 80 0 obj and write down the position  
> of this object. Then you can go to the end of the file and overwrite  
> the startxref 52779 position with you marked position and try to parse  
> the document again.
> 
> This should work and indicate that the pdf creator you are using,  
> creates wrong object positions. Pdfbox can parse only documents that  
> provide correct xref tables / streams, otherwise the parser does not  
> know how to handle the document.
> 
> Best regards
> Thomas
> 
> 
> 
> Zitat von Rodrigo Caniçali <ro...@yahoo.com.br>:
> 
>> Hi Thomas,
>> 
>> Below is the stacktrace when the option “-nonSeq” is enabled:
>> 
>> Loading PDF D:\Documents and Settings\05215385726\Meus  
>> documentos\rpf_tributos.pdf
>> Exception in thread "main" java.io.IOException: Error: Expected a  
>> long type, actual='!@:g8lJLDX5I'H%oMioAqC?O$d[,X]%dZ#a?Wos'
>> at org.apache.pdfbox.pdfparser.BaseParser.readLong(BaseParser.java:1668)
>> at  
>> org.apache.pdfbox.pdfparser.BaseParser.readObjectNumber(BaseParser.java:1598)
> at  
>> org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseXrefObjStream(NonSequentialPDFParser.java:460)
>> at  
>> org.apache.pdfbox.pdfparser.NonSequentialPDFParser.initialParse(NonSequentialPDFParser.java:358)
>> at  
>> org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parse(NonSequentialPDFParser.java:702)
>> at org.apache.pdfbox.pdmodel.PDDocument.loadNonSeq(PDDocument.java:1139)
>> at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:208)
>> at org.apache.pdfbox.ExtractText.main(ExtractText.java:85)
>> 
>> 
>> When that option is disabled, the following warnings are printed on  
>> Eclipse console and some text of PDF document is not extracted:
>> 
>> Loading PDF
> D:\Documents and Settings\05215385726\Meus  
>> documentos\rpf_tributos.pdf
>> Nov 04, 2013 10:16:13 AM  
>> org.apache.pdfbox.pdfparser.XrefTrailerResolver setStartxref
>> WARNING: Did not found XRef object at specified startxref position 52779
>> Time for loading: 0.125 seconds
>> Starting text extraction
>> Writing to D:\Documents and Settings\05215385726\Meus  
>> documentos\rpf_tributos.txt
>> Nov 04, 2013 10:16:14 AM org.apache.pdfbox.util.PDFStreamEngine  
>> processOperator
>> INFO: unsupported/disabled operation: o
>> Nov 04, 2013 10:16:14 AM org.apache.pdfbox.util.PDFStreamEngine  
>> processOperator
>> INFO: unsupported/disabled operation: Os
>> Nov 04,
> 2013 10:16:14 AM org.apache.pdfbox.util.PDFStreamEngine  
>> processOperator
>> INFO: unsupported/disabled operation: a
>> Nov 04, 2013 10:16:14 AM org.apache.pdfbox.util.PDFStreamEngine  
>> processOperator
>> INFO: unsupported/disabled operation: su
>> 
>> Thanks,
>> 
>> Rodrigo
>> 
>> 
>> 
>> Em Sábado, 2 de Novembro de 2013 10:24, Rodrigo Caniçali  
>> <ro...@yahoo.com.br> escreveu:
>> 
>> Hi Thomas,
>> 
>> Thanks for your answer.
>> 
>> I am afraid the document
> is confidential, but I canprovide the  
>> stacktrace and find out if it is possible to generate a  
>> non-confidential example on Monday when I will be at the office again.
>> 
>> Best regards,
>> Rodrigo
>> 
>> 
>> 
>> 
>> 
>> Em Sábado, 2 de Novembro de 2013 5:50, Thomas Chojecki  
>> <in...@rayman2200.de> escreveu:
>> 
>> 
>> Zitat von Rodrigo Caniçali <ro...@yahoo.com.br>:
>> 
>>> Hi,
>> Hi
> Rodrigo,
>> 
>>> I found on a mailing list of 2012-jun-14 that this problem has been 
>>> already discussed, but here is pretty different.
>> I think I found the discussion.
>> 
>>> I also get the warning "Did not found XRef object at specified 
>>> startxref position xxx" when executing the main function 
>>> of org.apache.pdfbox.ExtractText class. However, some PDF texts are 
>>> ignored and are not printed on the output TXT file. These same texts 
>>> are displayed by Acrobat Reader and can be copyed by the user as 
>>> texts from this program.
>> 
>> Your document is broken and it work with Acrobat Reader, because he 
>> isn't
> strict enough against the specification.
>> 
>> Many developer that try to create a pdf writer, test it against the 
>> Acrobat Reader and does not follow always the specification. So the 
>> reference is to create Acrobat Reader and not specification conformant 
>> documents. This lead to the problem that 3rd party libraries like 
>> pdfbox can't sometimes parse such documents.
>> 
>> In your case the xref table isn't there, where the parser supposing 
>> it. If you can provide use such document, we can try to find the cause 
>> of the problem and maybe fixing it.
>> 
>>> 
>>> If the option "-nonSeq" is selected, then appears a 
>>> "java.io.IOException: Error:
> Expected a long type, actual=..." which 
>>> stops the text extraction.
>> Maybe you can post the first three lines from the stacktrace, this 
>> will help debugging the problem.
>> 
>>> Please, is there any way to make it work?
>> It is nearly impossible reconstructing such cases. If you can provide 
>> us more informations or maybe the document, it will help use improving 
>> the parser, if possible.
>> 
>> We do our best to support as many document as we can, but in some 
>> cases we need to be strict to support the existing fine parsing 
>> documents. This problem is also one point on the agenda of the pdfbox 
>> 2.0.0 version.
>> 
>> 
>>> 
>>> Thanks,
>>> 
>>> Rodrigo
>> 
>> Best regards
>> Thomas

Re: WARNING: Did not found XRef object at specified startxref position

Posted by Rodrigo Caniçali <ro...@yahoo.com.br>.
Thomas,

I found several PDF specifications on the net.

Please, which is the PDF specification followed by PDFBOX library.

Thanks,

Rodrigo



Em Quinta-feira, 14 de Novembro de 2013 11:30, Rodrigo Caniçali <ro...@yahoo.com.br> escreveu:
 
Hi Thomas,

There is no such object at the whole document. Looking for the keyword "/XRef" or "80 0", the editor cannot find them anywhere. However I could find at the end of the document the following code:

xref
0 47
0000000000 65535 f 
0000000009 00000 n 
0000052584 00000 n 
0000052633 00000 n 
0000009275 00000 n 
0000000199 00000 n 
0000003543 00000 n 
....
0000052345 0000 n 

trailer
<<

/Size 47
/Root 2 0 R
/Info 1 0 R
>>
startxref
52279
%%EOF

Changing the reference 52279 by 53730 which is the address of "xref", it seems that the xref table position error has been solved. 

But the following warning is still been displayed and some text are still not been extracted:

Loading PDF D:\Documents and Settings\05215385726\rpf_tributos.pdf
Time for loading: 0.094 seconds
Starting text extraction
Writing to D:\Documents and Settings\05215385726\rpf_tributos.txt
Nov 14, 2013 11:24:36 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: o
Nov 14, 2013 11:24:36 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: Os
Nov 14, 2013 11:24:36 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: a
Nov 14, 2013 11:24:36 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: su


Also, with the "-nonSeq" option enabled, the error below is displayed:

Loading PDF D:\Documents and Settings\05215385726\Meus documentos\rpf_tributos.pdf
Exception in thread "main" java.io.IOException: Error: Expected a long type, actual='K`_'
at org.apache.pdfbox.pdfparser.BaseParser.readLong(BaseParser.java:1668)
at org.apache.pdfbox.pdfparser.BaseParser.readObjectNumber(BaseParser.java:1598)
at org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseObjectDynamically(NonSequentialPDFParser.java:1183)
at org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseObjectDynamically(NonSequentialPDFParser.java:1130)
at org.apache.pdfbox.pdfparser.NonSequentialPDFParser.initialParse(NonSequentialPDFParser.java:420)
at org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parse(NonSequentialPDFParser.java:702)
at org.apache.pdfbox.pdmodel.PDDocument.loadNonSeq(PDDocument.java:1139)
at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:208)
at org.apache.pdfbox.ExtractText.main(ExtractText.java:85)


I wonder if I could write a routine to fix a document like this before parsing it with PDFbox, since it can be parsed by Acrobat Reader.

Thanks,

Rodrigo



Em Quarta-feira, 13 de Novembro de 2013 19:49, Thomas Chojecki <in...@rayman2200.de> escreveu:
 
Hi Rodrigo,
it look like the startxref position (52779) is wrong and point into a  
stream instead at the beginning of a xref table or stream. The value  
inside the exception shows a compressed string and it might be the  
xref stream.

You can open a hex editor and jump directly to the position 52779 and  
look for a object that may look like

,---

80 0 obj <<
/Type /XRef
/Index [0 424]
/Size 424
/W [1 3 1]
/Root 421 0 R
/Info 422 0 R
/ID [<14895AE8C3218939710EBBFF5EAD0E28> <14895AE8C3218939710EBBFF5EAD0E28>]
/Length 1073
/Filter /FlateDecode
>>
stream
...
endstream
endobj

`---

If you find this object with the /Type /XRef you can go to the  
beginning of it, in this case the 80 0 obj and write down the position  
of this object. Then you can go to the end of the file and overwrite  
the startxref 52779 position with you marked position and try to parse  
the document again.

This should work and indicate that the pdf creator you are using,  
creates wrong object positions. Pdfbox can parse only documents that  
provide correct xref tables / streams, otherwise the parser does not  
know how to handle the document.

Best regards
Thomas



Zitat von Rodrigo Caniçali <ro...@yahoo.com.br>:

> Hi Thomas,
>
> Below is the stacktrace when the option “-nonSeq” is enabled:
>
> Loading PDF D:\Documents and Settings\05215385726\Meus  
> documentos\rpf_tributos.pdf
> Exception in thread "main" java.io.IOException: Error: Expected a  
> long type, actual='!@:g8lJLDX5I'H%oMioAqC?O$d[,X]%dZ#a?Wos'
> at org.apache.pdfbox.pdfparser.BaseParser.readLong(BaseParser.java:1668)
> at  
> org.apache.pdfbox.pdfparser.BaseParser.readObjectNumber(BaseParser.java:1598)
>
 at  
> org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseXrefObjStream(NonSequentialPDFParser.java:460)
> at  
> org.apache.pdfbox.pdfparser.NonSequentialPDFParser.initialParse(NonSequentialPDFParser.java:358)
> at  
> org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parse(NonSequentialPDFParser.java:702)
> at org.apache.pdfbox.pdmodel.PDDocument.loadNonSeq(PDDocument.java:1139)
> at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:208)
> at org.apache.pdfbox.ExtractText.main(ExtractText.java:85)
>
>
> When that option is disabled, the following warnings are printed on  
> Eclipse console and some text of PDF document is not extracted:
>
> Loading PDF
 D:\Documents and Settings\05215385726\Meus  
> documentos\rpf_tributos.pdf
> Nov 04, 2013 10:16:13 AM  
> org.apache.pdfbox.pdfparser.XrefTrailerResolver setStartxref
> WARNING: Did not found XRef object at specified startxref position 52779
> Time for loading: 0.125 seconds
> Starting text extraction
> Writing to D:\Documents and Settings\05215385726\Meus  
> documentos\rpf_tributos.txt
> Nov 04, 2013 10:16:14 AM org.apache.pdfbox.util.PDFStreamEngine  
> processOperator
> INFO: unsupported/disabled operation: o
> Nov 04, 2013 10:16:14 AM org.apache.pdfbox.util.PDFStreamEngine  
> processOperator
> INFO: unsupported/disabled operation: Os
> Nov 04,
 2013 10:16:14 AM org.apache.pdfbox.util.PDFStreamEngine  
> processOperator
> INFO: unsupported/disabled operation: a
> Nov 04, 2013 10:16:14 AM org.apache.pdfbox.util.PDFStreamEngine  
> processOperator
> INFO: unsupported/disabled operation: su
>
> Thanks,
>
> Rodrigo
>
>
>
> Em Sábado, 2 de Novembro de 2013 10:24, Rodrigo Caniçali  
> <ro...@yahoo.com.br> escreveu:
>
> Hi Thomas,
>
> Thanks for your answer.
>
> I am afraid the document
 is confidential, but I canprovide the  
> stacktrace and find out if it is possible to generate a  
> non-confidential example on Monday when I will be at the office again.
>
> Best regards,
> Rodrigo
>
>
>
>
>
> Em Sábado, 2 de Novembro de 2013 5:50, Thomas Chojecki  
> <in...@rayman2200.de> escreveu:
>
>
> Zitat von Rodrigo Caniçali <ro...@yahoo.com.br>:
>
>> Hi,
> Hi
 Rodrigo,
>
>> I found on a mailing list of 2012-jun-14 that this problem has been 
>> already discussed, but here is pretty different.
> I think I found the discussion.
>
>> I also get the warning "Did not found XRef object at specified 
>> startxref position xxx" when executing the main function 
>> of org.apache.pdfbox.ExtractText class. However, some PDF texts are 
>> ignored and are not printed on the output TXT file. These same texts 
>> are displayed by Acrobat Reader and can be copyed by the user as 
>> texts from this program.
>
> Your document is broken and it work with Acrobat Reader, because he 
> isn't
 strict enough against the specification.
>
> Many developer that try to create a pdf writer, test it against the 
> Acrobat Reader and does not follow always the specification. So the 
> reference is to create Acrobat Reader and not specification conformant 
> documents. This lead to the problem that 3rd party libraries like 
> pdfbox can't sometimes parse such documents.
>
> In your case the xref table isn't there, where the parser supposing 
> it. If you can provide use such document, we can try to find the cause 
> of the problem and maybe fixing it.
>
>>
>> If the option "-nonSeq" is selected, then appears a 
>> "java.io.IOException: Error:
 Expected a long type, actual=..." which 
>> stops the text extraction.
> Maybe you can post the first three lines from the stacktrace, this 
> will help debugging the problem.
>
>> Please, is there any way to make it work?
> It is nearly impossible reconstructing such cases. If you can provide 
> us more informations or maybe the document, it will help use improving 
> the parser, if possible.
>
> We do our best to support as many document as we can, but in some 
> cases we need to be strict to support the existing fine parsing 
> documents. This problem is also one point on the agenda of the pdfbox 
> 2.0.0 version.
>
>
>>
>> Thanks,
>>
>> Rodrigo
>
> Best regards
> Thomas

Re: WARNING: Did not found XRef object at specified startxref position

Posted by Thomas Chojecki <in...@rayman2200.de>.
Zitat von Rodrigo Caniçali <ro...@yahoo.com.br>:

> Hi Thomas,
>
> There is no such object at the whole document. Looking for the  
> keyword "/XRef" or "80 0", the editor cannot find them
The 80 0 was an example. If you can't find the /XRef it means there is  
no xref stream, only the xref table. The table looks invalid at the  
first look. Normally the objects should be sorted so the first object  
need to appear first in the document and the second as the next  
object. So correctly the offsets for the objects should be go from  
small to large numbers. In your document the objects seems to be  
shuffled. I haven't see such a document since I'm working with pdfs.  
For me the file is completely broken. Maybe the file was previously  
merged or something else, so the structure broke, I have no idea.

There is no way for pdfbox parsing such files and I can't imagine that  
we will be able to parse it in the future.

Sry that I could not help you with this one.

Best regards
Thomas

> anywhere. However I could find at the end of the document the following code:
>
> xref
> 0 47
> 0000000000 65535 f 
> 0000000009 00000 n 
> 0000052584 00000 n 
> 0000052633 00000 n 
> 0000009275 00000 n 
> 0000000199 00000 n 
> 0000003543 00000 n 
> ....
> 0000052345 0000 n 
>
> trailer
> <<
>
> /Size 47
> /Root 2 0 R
> /Info 1 0 R
>>>
> startxref
> 52279
> %%EOF
>
> Changing the reference 52279 by 53730 which is the address of  
> "xref", it seems that the xref table position error has been solved. 
>
> But the following warning is still been displayed and some text are  
> still not been extracted:
>
> Loading PDF D:\Documents and Settings\05215385726\rpf_tributos.pdf
> Time for loading: 0.094 seconds
> Starting text extraction
> Writing to D:\Documents and Settings\05215385726\rpf_tributos.txt
> Nov 14, 2013 11:24:36 AM org.apache.pdfbox.util.PDFStreamEngine  
> processOperator
> INFO: unsupported/disabled operation: o
> Nov 14, 2013 11:24:36 AM org.apache.pdfbox.util.PDFStreamEngine  
> processOperator
> INFO: unsupported/disabled operation: Os
> Nov 14, 2013 11:24:36 AM org.apache.pdfbox.util.PDFStreamEngine  
> processOperator
> INFO: unsupported/disabled operation: a
> Nov 14, 2013 11:24:36 AM org.apache.pdfbox.util.PDFStreamEngine  
> processOperator
> INFO: unsupported/disabled operation: su
>
>
> Also, with the "-nonSeq" option enabled, the error below is displayed:
>
> Loading PDF D:\Documents and Settings\05215385726\Meus  
> documentos\rpf_tributos.pdf
> Exception in thread "main" java.io.IOException: Error: Expected a  
> long type, actual='K`_'
> at org.apache.pdfbox.pdfparser.BaseParser.readLong(BaseParser.java:1668)
> at  
> org.apache.pdfbox.pdfparser.BaseParser.readObjectNumber(BaseParser.java:1598)
> at  
> org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseObjectDynamically(NonSequentialPDFParser.java:1183)
> at  
> org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseObjectDynamically(NonSequentialPDFParser.java:1130)
> at  
> org.apache.pdfbox.pdfparser.NonSequentialPDFParser.initialParse(NonSequentialPDFParser.java:420)
> at  
> org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parse(NonSequentialPDFParser.java:702)
> at org.apache.pdfbox.pdmodel.PDDocument.loadNonSeq(PDDocument.java:1139)
> at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:208)
> at org.apache.pdfbox.ExtractText.main(ExtractText.java:85)
>
>
> I wonder if I could write a routine to fix a document like this  
> before parsing it with PDFbox, since it can be parsed by Acrobat  
> Reader.
>
> Thanks,
>
> Rodrigo
>
>
>
> Em Quarta-feira, 13 de Novembro de 2013 19:49, Thomas Chojecki  
> <in...@rayman2200.de> escreveu:
>
> Hi Rodrigo,
> it look like the startxref position (52779) is wrong and point into a 
> stream instead at the beginning of a xref table or stream. The value 
> inside the exception shows a compressed string and it might be the 
> xref stream.
>
> You can open a hex editor and jump directly to the position 52779 and 
> look for a object that may look like
>
> ,---
>
> 80 0 obj <<
> /Type /XRef
> /Index [0 424]
> /Size 424
> /W [1 3 1]
> /Root 421 0 R
> /Info 422 0 R
> /ID [<14895AE8C3218939710EBBFF5EAD0E28> <14895AE8C3218939710EBBFF5EAD0E28>]
> /Length 1073
> /Filter /FlateDecode
>>>
> stream
> ...
> endstream
> endobj
>
> `---
>
> If you find this object with the /Type /XRef you can go to the 
> beginning of it, in this case the 80 0 obj and write down the position 
> of this object. Then you can go to the end of the file and overwrite 
> the startxref 52779 position with you marked position and try to parse 
> the document again.
>
> This should work and indicate that the pdf creator you are using, 
> creates wrong object positions. Pdfbox can parse only documents that 
> provide correct xref tables / streams, otherwise the parser does not 
> know how to handle the document.
>
> Best regards
> Thomas
>
>
>
> Zitat von Rodrigo Caniçali <ro...@yahoo.com.br>:
>
>> Hi Thomas,
>>
>> Below is the stacktrace when the option “-nonSeq” is enabled:
>>
>> Loading PDF D:\Documents and Settings\05215385726\Meus 
>> documentos\rpf_tributos.pdf
>> Exception in thread "main" java.io.IOException: Error: Expected a 
>> long type, actual='!@:g8lJLDX5I'H%oMioAqC?O$d[,X]%dZ#a?Wos'
>> at org.apache.pdfbox.pdfparser.BaseParser.readLong(BaseParser.java:1668)
>> at 
>> org.apache.pdfbox.pdfparser.BaseParser.readObjectNumber(BaseParser.java:1598)
>> at 
>> org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseXrefObjStream(NonSequentialPDFParser.java:460)
>> at 
>> org.apache.pdfbox.pdfparser.NonSequentialPDFParser.initialParse(NonSequentialPDFParser.java:358)
>> at 
>> org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parse(NonSequentialPDFParser.java:702)
>> at org.apache.pdfbox.pdmodel.PDDocument.loadNonSeq(PDDocument.java:1139)
>> at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:208)
>> at org.apache.pdfbox.ExtractText.main(ExtractText.java:85)
>>
>>
>> When that option is disabled, the following warnings are printed on 
>> Eclipse console and some text of PDF document is not extracted:
>>
>> Loading PDF D:\Documents and Settings\05215385726\Meus 
>> documentos\rpf_tributos.pdf
>> Nov 04, 2013 10:16:13 AM 
>> org.apache.pdfbox.pdfparser.XrefTrailerResolver setStartxref
>> WARNING: Did not found XRef object at specified startxref position 52779
>> Time for loading: 0.125 seconds
>> Starting text extraction
>> Writing to D:\Documents and Settings\05215385726\Meus 
>> documentos\rpf_tributos.txt
>> Nov 04, 2013 10:16:14 AM org.apache.pdfbox.util.PDFStreamEngine 
>> processOperator
>> INFO: unsupported/disabled operation: o
>> Nov 04, 2013 10:16:14 AM org.apache.pdfbox.util.PDFStreamEngine 
>> processOperator
>> INFO: unsupported/disabled operation: Os
>> Nov 04, 2013 10:16:14 AM org.apache.pdfbox.util.PDFStreamEngine 
>> processOperator
>> INFO: unsupported/disabled operation: a
>> Nov 04, 2013 10:16:14 AM org.apache.pdfbox.util.PDFStreamEngine 
>> processOperator
>> INFO: unsupported/disabled operation: su
>>
>> Thanks,
>>
>> Rodrigo
>>
>>
>>
>> Em Sábado, 2 de Novembro de 2013 10:24, Rodrigo Caniçali 
>> <ro...@yahoo.com.br> escreveu:
>>
>> Hi Thomas,
>>
>> Thanks for your answer.
>>
>> I am afraid the document is confidential, but I canprovide the 
>> stacktrace and find out if it is possible to generate a 
>> non-confidential example on Monday when I will be at the office again.
>>
>> Best regards,
>> Rodrigo
>>
>>
>>
>>
>>
>> Em Sábado, 2 de Novembro de 2013 5:50, Thomas Chojecki 
>> <in...@rayman2200.de> escreveu:
>>
>>
>> Zitat von Rodrigo Caniçali <ro...@yahoo.com.br>:
>>
>>> Hi,
>> Hi Rodrigo,
>>
>>> I found on a mailing list of 2012-jun-14 that this problem has been 
>>> already discussed, but here is pretty different.
>> I think I found the discussion.
>>
>>> I also get the warning "Did not found XRef object at specified 
>>> startxref position xxx" when executing the main function 
>>> of org.apache.pdfbox.ExtractText class. However, some PDF texts are 
>>> ignored and are not printed on the output TXT file. These same texts 
>>> are displayed by Acrobat Reader and can be copyed by the user as 
>>> texts from this program.
>>
>> Your document is broken and it work with Acrobat Reader, because he 
>> isn't strict enough against the specification.
>>
>> Many developer that try to create a pdf writer, test it against the 
>> Acrobat Reader and does not follow always the specification. So the 
>> reference is to create Acrobat Reader and not specification conformant 
>> documents. This lead to the problem that 3rd party libraries like 
>> pdfbox can't sometimes parse such documents.
>>
>> In your case the xref table isn't there, where the parser supposing 
>> it. If you can provide use such document, we can try to find the cause 
>> of the problem and maybe fixing it.
>>
>>>
>>> If the option "-nonSeq" is selected, then appears a 
>>> "java.io.IOException: Error: Expected a long type, actual=..." which 
>>> stops the text extraction.
>> Maybe you can post the first three lines from the stacktrace, this 
>> will help debugging the problem.
>>
>>> Please, is there any way to make it work?
>> It is nearly impossible reconstructing such cases. If you can provide 
>> us more informations or maybe the document, it will help use improving 
>> the parser, if possible.
>>
>> We do our best to support as many document as we can, but in some 
>> cases we need to be strict to support the existing fine parsing 
>> documents. This problem is also one point on the agenda of the pdfbox 
>> 2.0.0 version.
>>
>>
>>>
>>> Thanks,
>>>
>>> Rodrigo
>>
>> Best regards
>> Thomas




Re: WARNING: Did not found XRef object at specified startxref position

Posted by Rodrigo Caniçali <ro...@yahoo.com.br>.
Hi Thomas,

There is no such object at the whole document. Looking for the keyword "/XRef" or "80 0", the editor cannot find them anywhere. However I could find at the end of the document the following code:

xref
0 47
0000000000 65535 f 
0000000009 00000 n 
0000052584 00000 n 
0000052633 00000 n 
0000009275 00000 n 
0000000199 00000 n 
0000003543 00000 n 
....
0000052345 0000 n 

trailer
<<

/Size 47
/Root 2 0 R
/Info 1 0 R
>>
startxref
52279
%%EOF

Changing the reference 52279 by 53730 which is the address of "xref", it seems that the xref table position error has been solved. 

But the following warning is still been displayed and some text are still not been extracted:

Loading PDF D:\Documents and Settings\05215385726\rpf_tributos.pdf
Time for loading: 0.094 seconds
Starting text extraction
Writing to D:\Documents and Settings\05215385726\rpf_tributos.txt
Nov 14, 2013 11:24:36 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: o
Nov 14, 2013 11:24:36 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: Os
Nov 14, 2013 11:24:36 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: a
Nov 14, 2013 11:24:36 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: su


Also, with the "-nonSeq" option enabled, the error below is displayed:

Loading PDF D:\Documents and Settings\05215385726\Meus documentos\rpf_tributos.pdf
Exception in thread "main" java.io.IOException: Error: Expected a long type, actual='K`_'
at org.apache.pdfbox.pdfparser.BaseParser.readLong(BaseParser.java:1668)
at org.apache.pdfbox.pdfparser.BaseParser.readObjectNumber(BaseParser.java:1598)
at org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseObjectDynamically(NonSequentialPDFParser.java:1183)
at org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseObjectDynamically(NonSequentialPDFParser.java:1130)
at org.apache.pdfbox.pdfparser.NonSequentialPDFParser.initialParse(NonSequentialPDFParser.java:420)
at org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parse(NonSequentialPDFParser.java:702)
at org.apache.pdfbox.pdmodel.PDDocument.loadNonSeq(PDDocument.java:1139)
at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:208)
at org.apache.pdfbox.ExtractText.main(ExtractText.java:85)


I wonder if I could write a routine to fix a document like this before parsing it with PDFbox, since it can be parsed by Acrobat Reader.

Thanks,

Rodrigo



Em Quarta-feira, 13 de Novembro de 2013 19:49, Thomas Chojecki <in...@rayman2200.de> escreveu:
 
Hi Rodrigo,
it look like the startxref position (52779) is wrong and point into a  
stream instead at the beginning of a xref table or stream. The value  
inside the exception shows a compressed string and it might be the  
xref stream.

You can open a hex editor and jump directly to the position 52779 and  
look for a object that may look like

,---

80 0 obj <<
/Type /XRef
/Index [0 424]
/Size 424
/W [1 3 1]
/Root 421 0 R
/Info 422 0 R
/ID [<14895AE8C3218939710EBBFF5EAD0E28> <14895AE8C3218939710EBBFF5EAD0E28>]
/Length 1073
/Filter /FlateDecode
>>
stream
...
endstream
endobj

`---

If you find this object with the /Type /XRef you can go to the  
beginning of it, in this case the 80 0 obj and write down the position  
of this object. Then you can go to the end of the file and overwrite  
the startxref 52779 position with you marked position and try to parse  
the document again.

This should work and indicate that the pdf creator you are using,  
creates wrong object positions. Pdfbox can parse only documents that  
provide correct xref tables / streams, otherwise the parser does not  
know how to handle the document.

Best regards
Thomas



Zitat von Rodrigo Caniçali <ro...@yahoo.com.br>:

> Hi Thomas,
>
> Below is the stacktrace when the option “-nonSeq” is enabled:
>
> Loading PDF D:\Documents and Settings\05215385726\Meus  
> documentos\rpf_tributos.pdf
> Exception in thread "main" java.io.IOException: Error: Expected a  
> long type, actual='!@:g8lJLDX5I'H%oMioAqC?O$d[,X]%dZ#a?Wos'
> at org.apache.pdfbox.pdfparser.BaseParser.readLong(BaseParser.java:1668)
> at  
> org.apache.pdfbox.pdfparser.BaseParser.readObjectNumber(BaseParser.java:1598)
> at  
> org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseXrefObjStream(NonSequentialPDFParser.java:460)
> at  
> org.apache.pdfbox.pdfparser.NonSequentialPDFParser.initialParse(NonSequentialPDFParser.java:358)
> at  
> org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parse(NonSequentialPDFParser.java:702)
> at org.apache.pdfbox.pdmodel.PDDocument.loadNonSeq(PDDocument.java:1139)
> at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:208)
> at org.apache.pdfbox.ExtractText.main(ExtractText.java:85)
>
>
> When that option is disabled, the following warnings are printed on  
> Eclipse console and some text of PDF document is not extracted:
>
> Loading PDF D:\Documents and Settings\05215385726\Meus  
> documentos\rpf_tributos.pdf
> Nov 04, 2013 10:16:13 AM  
> org.apache.pdfbox.pdfparser.XrefTrailerResolver setStartxref
> WARNING: Did not found XRef object at specified startxref position 52779
> Time for loading: 0.125 seconds
> Starting text extraction
> Writing to D:\Documents and Settings\05215385726\Meus  
> documentos\rpf_tributos.txt
> Nov 04, 2013 10:16:14 AM org.apache.pdfbox.util.PDFStreamEngine  
> processOperator
> INFO: unsupported/disabled operation: o
> Nov 04, 2013 10:16:14 AM org.apache.pdfbox.util.PDFStreamEngine  
> processOperator
> INFO: unsupported/disabled operation: Os
> Nov 04, 2013 10:16:14 AM org.apache.pdfbox.util.PDFStreamEngine  
> processOperator
> INFO: unsupported/disabled operation: a
> Nov 04, 2013 10:16:14 AM org.apache.pdfbox.util.PDFStreamEngine  
> processOperator
> INFO: unsupported/disabled operation: su
>
> Thanks,
>
> Rodrigo
>
>
>
> Em Sábado, 2 de Novembro de 2013 10:24, Rodrigo Caniçali  
> <ro...@yahoo.com.br> escreveu:
>
> Hi Thomas,
>
> Thanks for your answer.
>
> I am afraid the document is confidential, but I canprovide the  
> stacktrace and find out if it is possible to generate a  
> non-confidential example on Monday when I will be at the office again.
>
> Best regards,
> Rodrigo
>
>
>
>
>
> Em Sábado, 2 de Novembro de 2013 5:50, Thomas Chojecki  
> <in...@rayman2200.de> escreveu:
>
>
> Zitat von Rodrigo Caniçali <ro...@yahoo.com.br>:
>
>> Hi,
> Hi Rodrigo,
>
>> I found on a mailing list of 2012-jun-14 that this problem has been 
>> already discussed, but here is pretty different.
> I think I found the discussion.
>
>> I also get the warning "Did not found XRef object at specified 
>> startxref position xxx" when executing the main function 
>> of org.apache.pdfbox.ExtractText class. However, some PDF texts are 
>> ignored and are not printed on the output TXT file. These same texts 
>> are displayed by Acrobat Reader and can be copyed by the user as 
>> texts from this program.
>
> Your document is broken and it work with Acrobat Reader, because he 
> isn't strict enough against the specification.
>
> Many developer that try to create a pdf writer, test it against the 
> Acrobat Reader and does not follow always the specification. So the 
> reference is to create Acrobat Reader and not specification conformant 
> documents. This lead to the problem that 3rd party libraries like 
> pdfbox can't sometimes parse such documents.
>
> In your case the xref table isn't there, where the parser supposing 
> it. If you can provide use such document, we can try to find the cause 
> of the problem and maybe fixing it.
>
>>
>> If the option "-nonSeq" is selected, then appears a 
>> "java.io.IOException: Error: Expected a long type, actual=..." which 
>> stops the text extraction.
> Maybe you can post the first three lines from the stacktrace, this 
> will help debugging the problem.
>
>> Please, is there any way to make it work?
> It is nearly impossible reconstructing such cases. If you can provide 
> us more informations or maybe the document, it will help use improving 
> the parser, if possible.
>
> We do our best to support as many document as we can, but in some 
> cases we need to be strict to support the existing fine parsing 
> documents. This problem is also one point on the agenda of the pdfbox 
> 2.0.0 version.
>
>
>>
>> Thanks,
>>
>> Rodrigo
>
> Best regards
> Thomas

Re: WARNING: Did not found XRef object at specified startxref position

Posted by Thomas Chojecki <in...@rayman2200.de>.
Hi Rodrigo,
it look like the startxref position (52779) is wrong and point into a  
stream instead at the beginning of a xref table or stream. The value  
inside the exception shows a compressed string and it might be the  
xref stream.

You can open a hex editor and jump directly to the position 52779 and  
look for a object that may look like

,---

80 0 obj <<
/Type /XRef
/Index [0 424]
/Size 424
/W [1 3 1]
/Root 421 0 R
/Info 422 0 R
/ID [<14895AE8C3218939710EBBFF5EAD0E28> <14895AE8C3218939710EBBFF5EAD0E28>]
/Length 1073
/Filter /FlateDecode
>>
stream
...
endstream
endobj

`---

If you find this object with the /Type /XRef you can go to the  
beginning of it, in this case the 80 0 obj and write down the position  
of this object. Then you can go to the end of the file and overwrite  
the startxref 52779 position with you marked position and try to parse  
the document again.

This should work and indicate that the pdf creator you are using,  
creates wrong object positions. Pdfbox can parse only documents that  
provide correct xref tables / streams, otherwise the parser does not  
know how to handle the document.

Best regards
Thomas


Zitat von Rodrigo Caniçali <ro...@yahoo.com.br>:

> Hi Thomas,
>
> Below is the stacktrace when the option “-nonSeq” is enabled:
>
> Loading PDF D:\Documents and Settings\05215385726\Meus  
> documentos\rpf_tributos.pdf
> Exception in thread "main" java.io.IOException: Error: Expected a  
> long type, actual='!@:g8lJLDX5I'H%oMioAqC?O$d[,X]%dZ#a?Wos'
> at org.apache.pdfbox.pdfparser.BaseParser.readLong(BaseParser.java:1668)
> at  
> org.apache.pdfbox.pdfparser.BaseParser.readObjectNumber(BaseParser.java:1598)
> at  
> org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseXrefObjStream(NonSequentialPDFParser.java:460)
> at  
> org.apache.pdfbox.pdfparser.NonSequentialPDFParser.initialParse(NonSequentialPDFParser.java:358)
> at  
> org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parse(NonSequentialPDFParser.java:702)
> at org.apache.pdfbox.pdmodel.PDDocument.loadNonSeq(PDDocument.java:1139)
> at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:208)
> at org.apache.pdfbox.ExtractText.main(ExtractText.java:85)
>
>
> When that option is disabled, the following warnings are printed on  
> Eclipse console and some text of PDF document is not extracted:
>
> Loading PDF D:\Documents and Settings\05215385726\Meus  
> documentos\rpf_tributos.pdf
> Nov 04, 2013 10:16:13 AM  
> org.apache.pdfbox.pdfparser.XrefTrailerResolver setStartxref
> WARNING: Did not found XRef object at specified startxref position 52779
> Time for loading: 0.125 seconds
> Starting text extraction
> Writing to D:\Documents and Settings\05215385726\Meus  
> documentos\rpf_tributos.txt
> Nov 04, 2013 10:16:14 AM org.apache.pdfbox.util.PDFStreamEngine  
> processOperator
> INFO: unsupported/disabled operation: o
> Nov 04, 2013 10:16:14 AM org.apache.pdfbox.util.PDFStreamEngine  
> processOperator
> INFO: unsupported/disabled operation: Os
> Nov 04, 2013 10:16:14 AM org.apache.pdfbox.util.PDFStreamEngine  
> processOperator
> INFO: unsupported/disabled operation: a
> Nov 04, 2013 10:16:14 AM org.apache.pdfbox.util.PDFStreamEngine  
> processOperator
> INFO: unsupported/disabled operation: su
>
> Thanks,
>
> Rodrigo
>
>
>
> Em Sábado, 2 de Novembro de 2013 10:24, Rodrigo Caniçali  
> <ro...@yahoo.com.br> escreveu:
>
> Hi Thomas,
>
> Thanks for your answer.
>
> I am afraid the document is confidential, but I canprovide the  
> stacktrace and find out if it is possible to generate a  
> non-confidential example on Monday when I will be at the office again.
>
> Best regards,
> Rodrigo
>
>
>
>
>
> Em Sábado, 2 de Novembro de 2013 5:50, Thomas Chojecki  
> <in...@rayman2200.de> escreveu:
>
>
> Zitat von Rodrigo Caniçali <ro...@yahoo.com.br>:
>
>> Hi,
> Hi Rodrigo,
>
>> I found on a mailing list of 2012-jun-14 that this problem has been 
>> already discussed, but here is pretty different.
> I think I found the discussion.
>
>> I also get the warning "Did not found XRef object at specified 
>> startxref position xxx" when executing the main function 
>> of org.apache.pdfbox.ExtractText class. However, some PDF texts are 
>> ignored and are not printed on the output TXT file. These same texts 
>> are displayed by Acrobat Reader and can be copyed by the user as 
>> texts from this program.
>
> Your document is broken and it work with Acrobat Reader, because he 
> isn't strict enough against the specification.
>
> Many developer that try to create a pdf writer, test it against the 
> Acrobat Reader and does not follow always the specification. So the 
> reference is to create Acrobat Reader and not specification conformant 
> documents. This lead to the problem that 3rd party libraries like 
> pdfbox can't sometimes parse such documents.
>
> In your case the xref table isn't there, where the parser supposing 
> it. If you can provide use such document, we can try to find the cause 
> of the problem and maybe fixing it.
>
>>
>> If the option "-nonSeq" is selected, then appears a 
>> "java.io.IOException: Error: Expected a long type, actual=..." which 
>> stops the text extraction.
> Maybe you can post the first three lines from the stacktrace, this 
> will help debugging the problem.
>
>> Please, is there any way to make it work?
> It is nearly impossible reconstructing such cases. If you can provide 
> us more informations or maybe the document, it will help use improving 
> the parser, if possible.
>
> We do our best to support as many document as we can, but in some 
> cases we need to be strict to support the existing fine parsing 
> documents. This problem is also one point on the agenda of the pdfbox 
> 2.0.0 version.
>
>
>>
>> Thanks,
>>
>> Rodrigo
>
> Best regards
> Thomas




Re: WARNING: Did not found XRef object at specified startxref position

Posted by Rodrigo Caniçali <ro...@yahoo.com.br>.
Hi Thomas,

Below is the stacktrace when the option “-nonSeq” is enabled:

Loading PDF D:\Documents and Settings\05215385726\Meus documentos\rpf_tributos.pdf
Exception in thread "main" java.io.IOException: Error: Expected a long type, actual='!@:g8lJLDX5I'H%oMioAqC?O$d[,X]%dZ#a?Wos'
at org.apache.pdfbox.pdfparser.BaseParser.readLong(BaseParser.java:1668)
at org.apache.pdfbox.pdfparser.BaseParser.readObjectNumber(BaseParser.java:1598)
at org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseXrefObjStream(NonSequentialPDFParser.java:460)
at org.apache.pdfbox.pdfparser.NonSequentialPDFParser.initialParse(NonSequentialPDFParser.java:358)
at org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parse(NonSequentialPDFParser.java:702)
at org.apache.pdfbox.pdmodel.PDDocument.loadNonSeq(PDDocument.java:1139)
at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:208)
at org.apache.pdfbox.ExtractText.main(ExtractText.java:85)


When that option is disabled, the following warnings are printed on Eclipse console and some text of PDF document is not extracted:

Loading PDF D:\Documents and Settings\05215385726\Meus documentos\rpf_tributos.pdf
Nov 04, 2013 10:16:13 AM org.apache.pdfbox.pdfparser.XrefTrailerResolver setStartxref
WARNING: Did not found XRef object at specified startxref position 52779
Time for loading: 0.125 seconds
Starting text extraction
Writing to D:\Documents and Settings\05215385726\Meus documentos\rpf_tributos.txt
Nov 04, 2013 10:16:14 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: o
Nov 04, 2013 10:16:14 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: Os
Nov 04, 2013 10:16:14 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: a
Nov 04, 2013 10:16:14 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: su

Thanks,

Rodrigo



Em Sábado, 2 de Novembro de 2013 10:24, Rodrigo Caniçali <ro...@yahoo.com.br> escreveu:
 
Hi Thomas,

Thanks for your answer.

I am afraid the document is confidential, but I canprovide the stacktrace and find out if it is possible to generate a non-confidential example on Monday when I will be at the office again.

Best regards,
Rodrigo





Em Sábado, 2 de Novembro de 2013 5:50, Thomas Chojecki <in...@rayman2200.de> escreveu:


Zitat von Rodrigo Caniçali <ro...@yahoo.com.br>:

> Hi,
Hi Rodrigo,

> I found on a mailing list of 2012-jun-14 that this problem has been  
> already discussed, but here is pretty different.
I think I found the discussion.

> I also get the warning "Did not found XRef object at specified  
> startxref position xxx" when executing the main function  
> of org.apache.pdfbox.ExtractText class. However, some PDF texts are  
> ignored and are not printed on the output TXT file. These same texts  
> are displayed by Acrobat Reader and can be copyed by the user as  
> texts from this program.

Your document is broken and it work with Acrobat Reader, because he  
isn't strict enough against the specification.

Many developer that try to create a pdf writer, test it against the  
Acrobat Reader and does not follow always the specification. So the  
reference is to create Acrobat Reader and not specification conformant  
documents. This lead to the problem that 3rd party libraries like  
pdfbox can't sometimes parse such documents.

In your case the xref table isn't there, where the parser supposing  
it. If you can provide use such document, we can try to find the cause  
of the problem and maybe fixing it.

>
> If the option "-nonSeq" is selected, then appears a  
> "java.io.IOException: Error: Expected a long type, actual=..." which  
> stops the text extraction.
Maybe you can post the first three lines from the stacktrace, this  
will help debugging the problem.

> Please, is there any way to make it work?
It is nearly impossible reconstructing such cases. If you can provide  
us more informations or maybe the document, it will help use improving  
the parser, if possible.

We do our best to support as many document as we can, but in some  
cases we need to be strict to support the existing fine parsing  
documents. This problem is also one point on the agenda of the pdfbox  
2.0.0 version.


>
> Thanks,
>
> Rodrigo

Best regards
Thomas

Re: WARNING: Did not found XRef object at specified startxref position

Posted by Rodrigo Caniçali <ro...@yahoo.com.br>.
Hi Thomas,

Thanks for your answer.

I am afraid the document is confidential, but I canprovide the stacktrace and find out if it is possible to generate a non-confidential example on Monday when I will be at the office again.

Best regards,
Rodrigo




Em Sábado, 2 de Novembro de 2013 5:50, Thomas Chojecki <in...@rayman2200.de> escreveu:
 

Zitat von Rodrigo Caniçali <ro...@yahoo.com.br>:

> Hi,
Hi Rodrigo,

> I found on a mailing list of 2012-jun-14 that this problem has been  
> already discussed, but here is pretty different.
I think I found the discussion.

> I also get the warning "Did not found XRef object at specified  
> startxref position xxx" when executing the main function  
> of org.apache.pdfbox.ExtractText class. However, some PDF texts are  
> ignored and are not printed on the output TXT file. These same texts  
> are displayed by Acrobat Reader and can be copyed by the user as  
> texts from this program.

Your document is broken and it work with Acrobat Reader, because he  
isn't strict enough against the specification.

Many developer that try to create a pdf writer, test it against the  
Acrobat Reader and does not follow always the specification. So the  
reference is to create Acrobat Reader and not specification conformant  
documents. This lead to the problem that 3rd party libraries like  
pdfbox can't sometimes parse such documents.

In your case the xref table isn't there, where the parser supposing  
it. If you can provide use such document, we can try to find the cause  
of the problem and maybe fixing it.

>
> If the option "-nonSeq" is selected, then appears a  
> "java.io.IOException: Error: Expected a long type, actual=..." which  
> stops the text extraction.
Maybe you can post the first three lines from the stacktrace, this  
will help debugging the problem.

> Please, is there any way to make it work?
It is nearly impossible reconstructing such cases. If you can provide  
us more informations or maybe the document, it will help use improving  
the parser, if possible.

We do our best to support as many document as we can, but in some  
cases we need to be strict to support the existing fine parsing  
documents. This problem is also one point on the agenda of the pdfbox  
2.0.0 version.


>
> Thanks,
>
> Rodrigo

Best regards
Thomas

Re: WARNING: Did not found XRef object at specified startxref position

Posted by Thomas Chojecki <in...@rayman2200.de>.
Zitat von Rodrigo Caniçali <ro...@yahoo.com.br>:

> Hi,
Hi Rodrigo,

> I found on a mailing list of 2012-jun-14 that this problem has been  
> already discussed, but here is pretty different.
I think I found the discussion.

> I also get the warning "Did not found XRef object at specified  
> startxref position xxx" when executing the main function  
> of org.apache.pdfbox.ExtractText class. However, some PDF texts are  
> ignored and are not printed on the output TXT file. These same texts  
> are displayed by Acrobat Reader and can be copyed by the user as  
> texts from this program.

Your document is broken and it work with Acrobat Reader, because he  
isn't strict enough against the specification.

Many developer that try to create a pdf writer, test it against the  
Acrobat Reader and does not follow always the specification. So the  
reference is to create Acrobat Reader and not specification conformant  
documents. This lead to the problem that 3rd party libraries like  
pdfbox can't sometimes parse such documents.

In your case the xref table isn't there, where the parser supposing  
it. If you can provide use such document, we can try to find the cause  
of the problem and maybe fixing it.

>
> If the option "-nonSeq" is selected, then appears a  
> "java.io.IOException: Error: Expected a long type, actual=..." which  
> stops the text extraction.
Maybe you can post the first three lines from the stacktrace, this  
will help debugging the problem.

> Please, is there any way to make it work?
It is nearly impossible reconstructing such cases. If you can provide  
us more informations or maybe the document, it will help use improving  
the parser, if possible.

We do our best to support as many document as we can, but in some  
cases we need to be strict to support the existing fine parsing  
documents. This problem is also one point on the agenda of the pdfbox  
2.0.0 version.

>
> Thanks,
>
> Rodrigo

Best regards
Thomas