You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Malcolm Vincent <ma...@gmail.com> on 2017/11/09 08:53:06 UTC

Adobe InDesign PDF

Hi,

I've been using PDFBox to read and write PDFs successfully for a while
and have started running into a few issues recently.

I seem to be getting the following errors when loading PDFs generated
in Adobe InDesign / Acrobat Distiller (the PDFs render fine in Acrobat
Reader, pdf.js and chrome).

The first one seems to be a UI thing for the PDFReader function so I'm
ignoring it.

The second and third are the problem. They are both related. I get
them when I use PDFBox in my own code as well as in the app, but since
they are warnings they do not flag up as runtime errors I can catch.

#1
Nov 09, 2017 8:31:45 AM java.util.prefs.WindowsPreferences <init>
WARNING: Could not open/create prefs root node Software\JavaSoft\Prefs
at root 0x80000002. Windows RegCreateKeyEx(...) returned error code 5.

#2
Nov 09, 2017 8:32:03 AM org.apache.pdfbox.pdfparser.BaseParser
parseCOSDictionaryNameValuePair
WARNING: Bad Dictionary Declaration
org.apache.pdfbox.pdfparser.InputStreamSource@498fa7e0

#3
Nov 09, 2017 8:32:03 AM org.apache.pdfbox.pdfparser.BaseParser
parseCOSDictionary
WARNING: Invalid dictionary, found: '?' but expected: '/' at offset 2861

I have traced the problem to the following PDF content at the end of
Page 1 Stream 1.

/Span <</Lang (en-GB)/MCID 8 >>BDC
BT
9 0 0 9 99.3376 555.6879 Tm
(text string here)Tj
ET
EMC
/Span <</Lang
endstream
endobj

The last dictionary entry seems to be incomplete.

When I go on to process the files in my own code, I iterate over the
content stream, perform my function and replace the stream content,
the stream ends up incorrect and the resulting PDFs will not load in
Acrobat Reader (although they do in chrome).

My options appear to be

(a) grep the file for this and remove or overwrite it with a string
operation before using PDFBox

(b) update the source to cope with this condition

(c) kick the PDF back as invalid - difficult since the file is a
"valid" PDF that is generated in Adobe and reads ok in Adobe

I have verified this by manually overtyping <</Lang with spaces and
then everything works perfectly in my own code and in PDFReader.

Any thoughts?

Best wishes,
Malcolm.

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Adobe InDesign PDF

Posted by Tilman Hausherr <TH...@t-online.de>.

Now I got it. Yes, I forgot that PDFDebugger needs to parse individual 
streams when showing streams in the stream window, and then errors can 
happen because streams are incomplete.

How to reproduce the problem:

         PDDocument doc = new PDDocument();
         PDPage page = new PDPage();
         PDPageContentStream cs = new PDPageContentStream(doc, page, 
AppendMode.APPEND, false);
         OutputStream os = cs.getOutput();
         os.write(("0 g\n"
                 + "/Span <</Lang (en-GB)/MCID 8 >>BDC\n"
                 + "    BT\n"
                 + "    9 0 0 9 99.3376 555.6879 Tm\n"
                 + "    (Some text)Tj\n"
                 + "    ET\n"
                 + "    EMC\n"
                 + "    /Span <</Lang").getBytes());
         os.flush();
         cs.close();
         cs = new PDPageContentStream(doc, page, AppendMode.APPEND, false);
         os = cs.getOutput();
         os.write(("(en-GB)/MCID 9 >>BDC\n"
                 + "    BT\n"
                 + "    9 0 0 9 145.7323 555.6879 Tm\n"
                 + "    (Some more text)Tj\n"
                 + "    ET\n"
                 + "    EMC").getBytes());
         os.flush();
         cs.close();
         doc.addPage(page);
         doc.save(new File("2streams.pdf"));

However there is no solution, this is by design...to do syntax 
highlighting one needs to parse. It happens in StreamPane.java in the 
debugger subproject.

Tilman



Am 16.11.2017 um 11:25 schrieb Malcolm Vincent:
> Hi Tilman,
>
> They fail because they parse the stream every time you click on it -
> rather than the whole page.
>
> I haven't checked the code yet to find it, but you can tell because
> every time you click on (an incomplete) stream in PDFReader it throws
> the same exceptions again on the console. From a debugging perspective
> this is a great feature to have and most of the time streams seem to
> be generated as complete entities.
>
> Like I say - now I know you can't parse streams individually in normal
> use I don't have a problem. It was my logic that was at fault.
>
> Best Wishes
> Malcolm.
>
>
>
>
>
>
> On 10 November 2017 at 16:46, Tilman Hausherr <TH...@t-online.de> wrote:
>> Hi,
>>
>> Yes it is true that page content streams can be split. But
>> PDFReader/PDFDebugger should be able to handle that because deep inside,
>> PDFStreamParser() is called when a page is rendered. PDFReader/PDFDebugger
>> do show the individual streams for debugging purpose but they're not
>> rendering them individually. So I'm wondering how it is possible that they
>> fail but you succeed.
>>
>> Re the java warning, I get it too and didn't even bother to fix it. See
>> https://stackoverflow.com/questions/5354838/java-java-util-preferences-failing
>> https://stackoverflow.com/questions/16428098/groovy-shell-warning-could-not-open-create-prefs-root-node
>>
>> Tilman
>>
>>
>>
>> Am 10.11.2017 um 10:04 schrieb Malcolm Vincent:
>>> Hi Tilman,
>>>
>>> Thanks for replying. I'll see if I can get permission from the client
>>> to upload the file.
>>>
>>> The same behaviour occurs in PDFBox 2.0.x - I am on 2.0.8 currently.
>>>
>>> I'm pretty definite now about what's happening.
>>>
>>> The "issue" (if it is an issue) is that I was treating the streams the
>>> same way that PDFReader does, and loading them one at a time.
>>>
>>> It appears that this is not a safe thing to do because the streams are
>>> not fully complete parseable entities in their own right and some
>>> higher level token constructs - like COSDictionary for example - can
>>> be split across stream boundaries by the Adobe PDF generators. So
>>> although atomic tokens like int may generally be ok, more complex
>>> things are not.
>>>
>>> This is why my code that uses PDFbox is throwing the warnings and also
>>> why it happens with the PDFReader / debug function in the app. Every
>>> time I click on a stream with a dictionary that is partly in one
>>> stream and partly in another the parser throws a warning on the
>>> console.
>>>
>>> I am unclear exactly how this fits with the specification - a quick
>>> "find" has not cleared it up - but I suppose in theory since PDF is a
>>> binary format the stream could break at any byte and any token could
>>> be split right in the middle.
>>>
>>> Following on from that analysis this appears to be the way to get the
>>> tokens on a page and process them ... at least it has resolved my
>>> problem on the PDF files I am currently processing ...
>>>
>>>       PDPage page = my_pdf.getPage(i);
>>>       PDFStreamParser parser = new PDFStreamParser(page);
>>>       parser.parse();
>>>       page.setContents(processTokens(parser.getTokens()));
>>>
>>> where processTokens() is my worker function.
>>>
>>> Of course this assumes that the generator has not broken atomic tokens
>>> in the middle of the content since the PDFBox doc says streams parsed
>>> this way are concatenated with a whitespace character between them.
>>>
>>> For completeness here is a fragment of one of my PDFs which shows the
>>> dictionary split across the end of one stream and the start of the
>>> next ...
>>>
>>>       /Span <</Lang (en-GB)/MCID 8 >>BDC
>>>       BT
>>>       9 0 0 9 99.3376 555.6879 Tm
>>>       (Some text)Tj
>>>       ET
>>>       EMC
>>>       /Span <</Lang
>>>       endstream
>>>       endobj
>>>       19 0 obj
>>>       <<
>>>       /Length 2852
>>>       >>
>>>       stream
>>>       (en-GB)/MCID 9 >>BDC
>>>       BT
>>>       9 0 0 9 145.7323 555.6879 Tm
>>>       (Some more text)Tj
>>>       ET
>>>       EMC
>>>
>>>
>>> Best Wishes,
>>> Malcolm
>>>
>>> On 9 November 2017 at 17:58, Tilman Hausherr <TH...@t-online.de>
>>> wrote:
>>>> Hi,
>>>>
>>>> What PDFBox version are you using and can you upload the PDF to a
>>>> sharehoster? Splits between tokens shouldn't be a problem.
>>>>
>>>> Tilman
>>>>
>>>> PS: please don't start a new thread like you did today, this is
>>>> confusing.
>>>> Answer to yourself on the list instead.
>>>>
>>>>
>>>> Am 09.11.2017 um 09:53 schrieb Malcolm Vincent:
>>>>> Hi,
>>>>>
>>>>> I've been using PDFBox to read and write PDFs successfully for a while
>>>>> and have started running into a few issues recently.
>>>>>
>>>>> I seem to be getting the following errors when loading PDFs generated
>>>>> in Adobe InDesign / Acrobat Distiller (the PDFs render fine in Acrobat
>>>>> Reader, pdf.js and chrome).
>>>>>
>>>>> The first one seems to be a UI thing for the PDFReader function so I'm
>>>>> ignoring it.
>>>>>
>>>>> The second and third are the problem. They are both related. I get
>>>>> them when I use PDFBox in my own code as well as in the app, but since
>>>>> they are warnings they do not flag up as runtime errors I can catch.
>>>>>
>>>>> #1
>>>>> Nov 09, 2017 8:31:45 AM java.util.prefs.WindowsPreferences <init>
>>>>> WARNING: Could not open/create prefs root node Software\JavaSoft\Prefs
>>>>> at root 0x80000002. Windows RegCreateKeyEx(...) returned error code 5.
>>>>>
>>>>> #2
>>>>> Nov 09, 2017 8:32:03 AM org.apache.pdfbox.pdfparser.BaseParser
>>>>> parseCOSDictionaryNameValuePair
>>>>> WARNING: Bad Dictionary Declaration
>>>>> org.apache.pdfbox.pdfparser.InputStreamSource@498fa7e0
>>>>>
>>>>> #3
>>>>> Nov 09, 2017 8:32:03 AM org.apache.pdfbox.pdfparser.BaseParser
>>>>> parseCOSDictionary
>>>>> WARNING: Invalid dictionary, found: '?' but expected: '/' at offset 2861
>>>>>
>>>>> I have traced the problem to the following PDF content at the end of
>>>>> Page 1 Stream 1.
>>>>>
>>>>> /Span <</Lang (en-GB)/MCID 8 >>BDC
>>>>> BT
>>>>> 9 0 0 9 99.3376 555.6879 Tm
>>>>> (text string here)Tj
>>>>> ET
>>>>> EMC
>>>>> /Span <</Lang
>>>>> endstream
>>>>> endobj
>>>>>
>>>>> The last dictionary entry seems to be incomplete.
>>>>>
>>>>> When I go on to process the files in my own code, I iterate over the
>>>>> content stream, perform my function and replace the stream content,
>>>>> the stream ends up incorrect and the resulting PDFs will not load in
>>>>> Acrobat Reader (although they do in chrome).
>>>>>
>>>>> My options appear to be
>>>>>
>>>>> (a) grep the file for this and remove or overwrite it with a string
>>>>> operation before using PDFBox
>>>>>
>>>>> (b) update the source to cope with this condition
>>>>>
>>>>> (c) kick the PDF back as invalid - difficult since the file is a
>>>>> "valid" PDF that is generated in Adobe and reads ok in Adobe
>>>>>
>>>>> I have verified this by manually overtyping <</Lang with spaces and
>>>>> then everything works perfectly in my own code and in PDFReader.
>>>>>
>>>>> Any thoughts?
>>>>>
>>>>> Best wishes,
>>>>> Malcolm.
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Adobe InDesign PDF

Posted by Malcolm Vincent <ma...@gmail.com>.

Hi Tilman,

They fail because they parse the stream every time you click on it -
rather than the whole page.

I haven't checked the code yet to find it, but you can tell because
every time you click on (an incomplete) stream in PDFReader it throws
the same exceptions again on the console. From a debugging perspective
this is a great feature to have and most of the time streams seem to
be generated as complete entities.

Like I say - now I know you can't parse streams individually in normal
use I don't have a problem. It was my logic that was at fault.

Best Wishes
Malcolm.






On 10 November 2017 at 16:46, Tilman Hausherr <TH...@t-online.de> wrote:
> Hi,
>
> Yes it is true that page content streams can be split. But
> PDFReader/PDFDebugger should be able to handle that because deep inside,
> PDFStreamParser() is called when a page is rendered. PDFReader/PDFDebugger
> do show the individual streams for debugging purpose but they're not
> rendering them individually. So I'm wondering how it is possible that they
> fail but you succeed.
>
> Re the java warning, I get it too and didn't even bother to fix it. See
> https://stackoverflow.com/questions/5354838/java-java-util-preferences-failing
> https://stackoverflow.com/questions/16428098/groovy-shell-warning-could-not-open-create-prefs-root-node
>
> Tilman
>
>
>
> Am 10.11.2017 um 10:04 schrieb Malcolm Vincent:
>>
>> Hi Tilman,
>>
>> Thanks for replying. I'll see if I can get permission from the client
>> to upload the file.
>>
>> The same behaviour occurs in PDFBox 2.0.x - I am on 2.0.8 currently.
>>
>> I'm pretty definite now about what's happening.
>>
>> The "issue" (if it is an issue) is that I was treating the streams the
>> same way that PDFReader does, and loading them one at a time.
>>
>> It appears that this is not a safe thing to do because the streams are
>> not fully complete parseable entities in their own right and some
>> higher level token constructs - like COSDictionary for example - can
>> be split across stream boundaries by the Adobe PDF generators. So
>> although atomic tokens like int may generally be ok, more complex
>> things are not.
>>
>> This is why my code that uses PDFbox is throwing the warnings and also
>> why it happens with the PDFReader / debug function in the app. Every
>> time I click on a stream with a dictionary that is partly in one
>> stream and partly in another the parser throws a warning on the
>> console.
>>
>> I am unclear exactly how this fits with the specification - a quick
>> "find" has not cleared it up - but I suppose in theory since PDF is a
>> binary format the stream could break at any byte and any token could
>> be split right in the middle.
>>
>> Following on from that analysis this appears to be the way to get the
>> tokens on a page and process them ... at least it has resolved my
>> problem on the PDF files I am currently processing ...
>>
>>      PDPage page = my_pdf.getPage(i);
>>      PDFStreamParser parser = new PDFStreamParser(page);
>>      parser.parse();
>>      page.setContents(processTokens(parser.getTokens()));
>>
>> where processTokens() is my worker function.
>>
>> Of course this assumes that the generator has not broken atomic tokens
>> in the middle of the content since the PDFBox doc says streams parsed
>> this way are concatenated with a whitespace character between them.
>>
>> For completeness here is a fragment of one of my PDFs which shows the
>> dictionary split across the end of one stream and the start of the
>> next ...
>>
>>      /Span <</Lang (en-GB)/MCID 8 >>BDC
>>      BT
>>      9 0 0 9 99.3376 555.6879 Tm
>>      (Some text)Tj
>>      ET
>>      EMC
>>      /Span <</Lang
>>      endstream
>>      endobj
>>      19 0 obj
>>      <<
>>      /Length 2852
>>      >>
>>      stream
>>      (en-GB)/MCID 9 >>BDC
>>      BT
>>      9 0 0 9 145.7323 555.6879 Tm
>>      (Some more text)Tj
>>      ET
>>      EMC
>>
>>
>> Best Wishes,
>> Malcolm
>>
>> On 9 November 2017 at 17:58, Tilman Hausherr <TH...@t-online.de>
>> wrote:
>>>
>>> Hi,
>>>
>>> What PDFBox version are you using and can you upload the PDF to a
>>> sharehoster? Splits between tokens shouldn't be a problem.
>>>
>>> Tilman
>>>
>>> PS: please don't start a new thread like you did today, this is
>>> confusing.
>>> Answer to yourself on the list instead.
>>>
>>>
>>> Am 09.11.2017 um 09:53 schrieb Malcolm Vincent:
>>>>
>>>> Hi,
>>>>
>>>> I've been using PDFBox to read and write PDFs successfully for a while
>>>> and have started running into a few issues recently.
>>>>
>>>> I seem to be getting the following errors when loading PDFs generated
>>>> in Adobe InDesign / Acrobat Distiller (the PDFs render fine in Acrobat
>>>> Reader, pdf.js and chrome).
>>>>
>>>> The first one seems to be a UI thing for the PDFReader function so I'm
>>>> ignoring it.
>>>>
>>>> The second and third are the problem. They are both related. I get
>>>> them when I use PDFBox in my own code as well as in the app, but since
>>>> they are warnings they do not flag up as runtime errors I can catch.
>>>>
>>>> #1
>>>> Nov 09, 2017 8:31:45 AM java.util.prefs.WindowsPreferences <init>
>>>> WARNING: Could not open/create prefs root node Software\JavaSoft\Prefs
>>>> at root 0x80000002. Windows RegCreateKeyEx(...) returned error code 5.
>>>>
>>>> #2
>>>> Nov 09, 2017 8:32:03 AM org.apache.pdfbox.pdfparser.BaseParser
>>>> parseCOSDictionaryNameValuePair
>>>> WARNING: Bad Dictionary Declaration
>>>> org.apache.pdfbox.pdfparser.InputStreamSource@498fa7e0
>>>>
>>>> #3
>>>> Nov 09, 2017 8:32:03 AM org.apache.pdfbox.pdfparser.BaseParser
>>>> parseCOSDictionary
>>>> WARNING: Invalid dictionary, found: '?' but expected: '/' at offset 2861
>>>>
>>>> I have traced the problem to the following PDF content at the end of
>>>> Page 1 Stream 1.
>>>>
>>>> /Span <</Lang (en-GB)/MCID 8 >>BDC
>>>> BT
>>>> 9 0 0 9 99.3376 555.6879 Tm
>>>> (text string here)Tj
>>>> ET
>>>> EMC
>>>> /Span <</Lang
>>>> endstream
>>>> endobj
>>>>
>>>> The last dictionary entry seems to be incomplete.
>>>>
>>>> When I go on to process the files in my own code, I iterate over the
>>>> content stream, perform my function and replace the stream content,
>>>> the stream ends up incorrect and the resulting PDFs will not load in
>>>> Acrobat Reader (although they do in chrome).
>>>>
>>>> My options appear to be
>>>>
>>>> (a) grep the file for this and remove or overwrite it with a string
>>>> operation before using PDFBox
>>>>
>>>> (b) update the source to cope with this condition
>>>>
>>>> (c) kick the PDF back as invalid - difficult since the file is a
>>>> "valid" PDF that is generated in Adobe and reads ok in Adobe
>>>>
>>>> I have verified this by manually overtyping <</Lang with spaces and
>>>> then everything works perfectly in my own code and in PDFReader.
>>>>
>>>> Any thoughts?
>>>>
>>>> Best wishes,
>>>> Malcolm.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Adobe InDesign PDF

Posted by Tilman Hausherr <TH...@t-online.de>.

Hi,

Yes it is true that page content streams can be split. But 
PDFReader/PDFDebugger should be able to handle that because deep inside, 
PDFStreamParser() is called when a page is rendered. 
PDFReader/PDFDebugger do show the individual streams for debugging 
purpose but they're not rendering them individually. So I'm wondering 
how it is possible that they fail but you succeed.

Re the java warning, I get it too and didn't even bother to fix it. See
https://stackoverflow.com/questions/5354838/java-java-util-preferences-failing
https://stackoverflow.com/questions/16428098/groovy-shell-warning-could-not-open-create-prefs-root-node

Tilman


Am 10.11.2017 um 10:04 schrieb Malcolm Vincent:
> Hi Tilman,
>
> Thanks for replying. I'll see if I can get permission from the client
> to upload the file.
>
> The same behaviour occurs in PDFBox 2.0.x - I am on 2.0.8 currently.
>
> I'm pretty definite now about what's happening.
>
> The "issue" (if it is an issue) is that I was treating the streams the
> same way that PDFReader does, and loading them one at a time.
>
> It appears that this is not a safe thing to do because the streams are
> not fully complete parseable entities in their own right and some
> higher level token constructs - like COSDictionary for example - can
> be split across stream boundaries by the Adobe PDF generators. So
> although atomic tokens like int may generally be ok, more complex
> things are not.
>
> This is why my code that uses PDFbox is throwing the warnings and also
> why it happens with the PDFReader / debug function in the app. Every
> time I click on a stream with a dictionary that is partly in one
> stream and partly in another the parser throws a warning on the
> console.
>
> I am unclear exactly how this fits with the specification - a quick
> "find" has not cleared it up - but I suppose in theory since PDF is a
> binary format the stream could break at any byte and any token could
> be split right in the middle.
>
> Following on from that analysis this appears to be the way to get the
> tokens on a page and process them ... at least it has resolved my
> problem on the PDF files I am currently processing ...
>
>      PDPage page = my_pdf.getPage(i);
>      PDFStreamParser parser = new PDFStreamParser(page);
>      parser.parse();
>      page.setContents(processTokens(parser.getTokens()));
>
> where processTokens() is my worker function.
>
> Of course this assumes that the generator has not broken atomic tokens
> in the middle of the content since the PDFBox doc says streams parsed
> this way are concatenated with a whitespace character between them.
>
> For completeness here is a fragment of one of my PDFs which shows the
> dictionary split across the end of one stream and the start of the
> next ...
>
>      /Span <</Lang (en-GB)/MCID 8 >>BDC
>      BT
>      9 0 0 9 99.3376 555.6879 Tm
>      (Some text)Tj
>      ET
>      EMC
>      /Span <</Lang
>      endstream
>      endobj
>      19 0 obj
>      <<
>      /Length 2852
>      >>
>      stream
>      (en-GB)/MCID 9 >>BDC
>      BT
>      9 0 0 9 145.7323 555.6879 Tm
>      (Some more text)Tj
>      ET
>      EMC
>
>
> Best Wishes,
> Malcolm
>
> On 9 November 2017 at 17:58, Tilman Hausherr <TH...@t-online.de> wrote:
>> Hi,
>>
>> What PDFBox version are you using and can you upload the PDF to a
>> sharehoster? Splits between tokens shouldn't be a problem.
>>
>> Tilman
>>
>> PS: please don't start a new thread like you did today, this is confusing.
>> Answer to yourself on the list instead.
>>
>>
>> Am 09.11.2017 um 09:53 schrieb Malcolm Vincent:
>>> Hi,
>>>
>>> I've been using PDFBox to read and write PDFs successfully for a while
>>> and have started running into a few issues recently.
>>>
>>> I seem to be getting the following errors when loading PDFs generated
>>> in Adobe InDesign / Acrobat Distiller (the PDFs render fine in Acrobat
>>> Reader, pdf.js and chrome).
>>>
>>> The first one seems to be a UI thing for the PDFReader function so I'm
>>> ignoring it.
>>>
>>> The second and third are the problem. They are both related. I get
>>> them when I use PDFBox in my own code as well as in the app, but since
>>> they are warnings they do not flag up as runtime errors I can catch.
>>>
>>> #1
>>> Nov 09, 2017 8:31:45 AM java.util.prefs.WindowsPreferences <init>
>>> WARNING: Could not open/create prefs root node Software\JavaSoft\Prefs
>>> at root 0x80000002. Windows RegCreateKeyEx(...) returned error code 5.
>>>
>>> #2
>>> Nov 09, 2017 8:32:03 AM org.apache.pdfbox.pdfparser.BaseParser
>>> parseCOSDictionaryNameValuePair
>>> WARNING: Bad Dictionary Declaration
>>> org.apache.pdfbox.pdfparser.InputStreamSource@498fa7e0
>>>
>>> #3
>>> Nov 09, 2017 8:32:03 AM org.apache.pdfbox.pdfparser.BaseParser
>>> parseCOSDictionary
>>> WARNING: Invalid dictionary, found: '?' but expected: '/' at offset 2861
>>>
>>> I have traced the problem to the following PDF content at the end of
>>> Page 1 Stream 1.
>>>
>>> /Span <</Lang (en-GB)/MCID 8 >>BDC
>>> BT
>>> 9 0 0 9 99.3376 555.6879 Tm
>>> (text string here)Tj
>>> ET
>>> EMC
>>> /Span <</Lang
>>> endstream
>>> endobj
>>>
>>> The last dictionary entry seems to be incomplete.
>>>
>>> When I go on to process the files in my own code, I iterate over the
>>> content stream, perform my function and replace the stream content,
>>> the stream ends up incorrect and the resulting PDFs will not load in
>>> Acrobat Reader (although they do in chrome).
>>>
>>> My options appear to be
>>>
>>> (a) grep the file for this and remove or overwrite it with a string
>>> operation before using PDFBox
>>>
>>> (b) update the source to cope with this condition
>>>
>>> (c) kick the PDF back as invalid - difficult since the file is a
>>> "valid" PDF that is generated in Adobe and reads ok in Adobe
>>>
>>> I have verified this by manually overtyping <</Lang with spaces and
>>> then everything works perfectly in my own code and in PDFReader.
>>>
>>> Any thoughts?
>>>
>>> Best wishes,
>>> Malcolm.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Adobe InDesign PDF

Posted by Malcolm Vincent <ma...@gmail.com>.

Hi Tilman,

Thanks for replying. I'll see if I can get permission from the client
to upload the file.

The same behaviour occurs in PDFBox 2.0.x - I am on 2.0.8 currently.

I'm pretty definite now about what's happening.

The "issue" (if it is an issue) is that I was treating the streams the
same way that PDFReader does, and loading them one at a time.

It appears that this is not a safe thing to do because the streams are
not fully complete parseable entities in their own right and some
higher level token constructs - like COSDictionary for example - can
be split across stream boundaries by the Adobe PDF generators. So
although atomic tokens like int may generally be ok, more complex
things are not.

This is why my code that uses PDFbox is throwing the warnings and also
why it happens with the PDFReader / debug function in the app. Every
time I click on a stream with a dictionary that is partly in one
stream and partly in another the parser throws a warning on the
console.

I am unclear exactly how this fits with the specification - a quick
"find" has not cleared it up - but I suppose in theory since PDF is a
binary format the stream could break at any byte and any token could
be split right in the middle.

Following on from that analysis this appears to be the way to get the
tokens on a page and process them ... at least it has resolved my
problem on the PDF files I am currently processing ...

    PDPage page = my_pdf.getPage(i);
    PDFStreamParser parser = new PDFStreamParser(page);
    parser.parse();
    page.setContents(processTokens(parser.getTokens()));

where processTokens() is my worker function.

Of course this assumes that the generator has not broken atomic tokens
in the middle of the content since the PDFBox doc says streams parsed
this way are concatenated with a whitespace character between them.

For completeness here is a fragment of one of my PDFs which shows the
dictionary split across the end of one stream and the start of the
next ...

    /Span <</Lang (en-GB)/MCID 8 >>BDC
    BT
    9 0 0 9 99.3376 555.6879 Tm
    (Some text)Tj
    ET
    EMC
    /Span <</Lang
    endstream
    endobj
    19 0 obj
    <<
    /Length 2852
    >>
    stream
    (en-GB)/MCID 9 >>BDC
    BT
    9 0 0 9 145.7323 555.6879 Tm
    (Some more text)Tj
    ET
    EMC

Best Wishes,
Malcolm

On 9 November 2017 at 17:58, Tilman Hausherr <TH...@t-online.de> wrote:
> Hi,
>
> What PDFBox version are you using and can you upload the PDF to a
> sharehoster? Splits between tokens shouldn't be a problem.
>
> Tilman
>
> PS: please don't start a new thread like you did today, this is confusing.
> Answer to yourself on the list instead.
>
>
> Am 09.11.2017 um 09:53 schrieb Malcolm Vincent:
>>
>> Hi,
>>
>> I've been using PDFBox to read and write PDFs successfully for a while
>> and have started running into a few issues recently.
>>
>> I seem to be getting the following errors when loading PDFs generated
>> in Adobe InDesign / Acrobat Distiller (the PDFs render fine in Acrobat
>> Reader, pdf.js and chrome).
>>
>> The first one seems to be a UI thing for the PDFReader function so I'm
>> ignoring it.
>>
>> The second and third are the problem. They are both related. I get
>> them when I use PDFBox in my own code as well as in the app, but since
>> they are warnings they do not flag up as runtime errors I can catch.
>>
>> #1
>> Nov 09, 2017 8:31:45 AM java.util.prefs.WindowsPreferences <init>
>> WARNING: Could not open/create prefs root node Software\JavaSoft\Prefs
>> at root 0x80000002. Windows RegCreateKeyEx(...) returned error code 5.
>>
>> #2
>> Nov 09, 2017 8:32:03 AM org.apache.pdfbox.pdfparser.BaseParser
>> parseCOSDictionaryNameValuePair
>> WARNING: Bad Dictionary Declaration
>> org.apache.pdfbox.pdfparser.InputStreamSource@498fa7e0
>>
>> #3
>> Nov 09, 2017 8:32:03 AM org.apache.pdfbox.pdfparser.BaseParser
>> parseCOSDictionary
>> WARNING: Invalid dictionary, found: '?' but expected: '/' at offset 2861
>>
>> I have traced the problem to the following PDF content at the end of
>> Page 1 Stream 1.
>>
>> /Span <</Lang (en-GB)/MCID 8 >>BDC
>> BT
>> 9 0 0 9 99.3376 555.6879 Tm
>> (text string here)Tj
>> ET
>> EMC
>> /Span <</Lang
>> endstream
>> endobj
>>
>> The last dictionary entry seems to be incomplete.
>>
>> When I go on to process the files in my own code, I iterate over the
>> content stream, perform my function and replace the stream content,
>> the stream ends up incorrect and the resulting PDFs will not load in
>> Acrobat Reader (although they do in chrome).
>>
>> My options appear to be
>>
>> (a) grep the file for this and remove or overwrite it with a string
>> operation before using PDFBox
>>
>> (b) update the source to cope with this condition
>>
>> (c) kick the PDF back as invalid - difficult since the file is a
>> "valid" PDF that is generated in Adobe and reads ok in Adobe
>>
>> I have verified this by manually overtyping <</Lang with spaces and
>> then everything works perfectly in my own code and in PDFReader.
>>
>> Any thoughts?
>>
>> Best wishes,
>> Malcolm.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Adobe InDesign PDF

Posted by Tilman Hausherr <TH...@t-online.de>.

Hi,

What PDFBox version are you using and can you upload the PDF to a 
sharehoster? Splits between tokens shouldn't be a problem.

Tilman

PS: please don't start a new thread like you did today, this is 
confusing. Answer to yourself on the list instead.

Am 09.11.2017 um 09:53 schrieb Malcolm Vincent:
> Hi,
>
> I've been using PDFBox to read and write PDFs successfully for a while
> and have started running into a few issues recently.
>
> I seem to be getting the following errors when loading PDFs generated
> in Adobe InDesign / Acrobat Distiller (the PDFs render fine in Acrobat
> Reader, pdf.js and chrome).
>
> The first one seems to be a UI thing for the PDFReader function so I'm
> ignoring it.
>
> The second and third are the problem. They are both related. I get
> them when I use PDFBox in my own code as well as in the app, but since
> they are warnings they do not flag up as runtime errors I can catch.
>
> #1
> Nov 09, 2017 8:31:45 AM java.util.prefs.WindowsPreferences <init>
> WARNING: Could not open/create prefs root node Software\JavaSoft\Prefs
> at root 0x80000002. Windows RegCreateKeyEx(...) returned error code 5.
>
> #2
> Nov 09, 2017 8:32:03 AM org.apache.pdfbox.pdfparser.BaseParser
> parseCOSDictionaryNameValuePair
> WARNING: Bad Dictionary Declaration
> org.apache.pdfbox.pdfparser.InputStreamSource@498fa7e0
>
> #3
> Nov 09, 2017 8:32:03 AM org.apache.pdfbox.pdfparser.BaseParser
> parseCOSDictionary
> WARNING: Invalid dictionary, found: '?' but expected: '/' at offset 2861
>
> I have traced the problem to the following PDF content at the end of
> Page 1 Stream 1.
>
> /Span <</Lang (en-GB)/MCID 8 >>BDC
> BT
> 9 0 0 9 99.3376 555.6879 Tm
> (text string here)Tj
> ET
> EMC
> /Span <</Lang
> endstream
> endobj
>
> The last dictionary entry seems to be incomplete.
>
> When I go on to process the files in my own code, I iterate over the
> content stream, perform my function and replace the stream content,
> the stream ends up incorrect and the resulting PDFs will not load in
> Acrobat Reader (although they do in chrome).
>
> My options appear to be
>
> (a) grep the file for this and remove or overwrite it with a string
> operation before using PDFBox
>
> (b) update the source to cope with this condition
>
> (c) kick the PDF back as invalid - difficult since the file is a
> "valid" PDF that is generated in Adobe and reads ok in Adobe
>
> I have verified this by manually overtyping <</Lang with spaces and
> then everything works perfectly in my own code and in PDFReader.
>
> Any thoughts?
>
> Best wishes,
> Malcolm.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org