You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Thomas Chojecki (JIRA)" <ji...@apache.org> on 2011/06/20 18:16:47 UTC

[jira] [Created] (PDFBOX-1042) Wrong XRefStream order while parsing incremental updated PDF with XRefStreams

Wrong XRefStream order while parsing incremental updated PDF with XRefStreams
-----------------------------------------------------------------------------

                 Key: PDFBOX-1042
                 URL: https://issues.apache.org/jira/browse/PDFBOX-1042
             Project: PDFBox
          Issue Type: Bug
          Components: Parsing
    Affects Versions: 1.5.0
            Reporter: Thomas Chojecki
            Priority: Critical


A PDF can contain two types of XRef-Entries.
Most files use XRefTables for object references.

Web-Optimized (linearized) pdf document uses XRefStreams. This is a compresed XRefTable as ObjectStream. The PDFParser parse this objects the same way as other objects and put them into an object pool (HashMap). If the document was incremental updated, more XRefStreams would be in the pdf document and all will be put into the object pool.

The XRefStreamParser begin to parse the XRefStreams and try to gain all XRefStream-Object from that pool. The objects returned from the pool aren't in the same order as read. This cause that in some cases the older Object overwrite the newer one. And this cause that the pdfbox can't find the right objects and use the older one instead.

If a user try to parse such a document, he will got an indeterminate state. older and newer objects are mixed.

In my case, a document catalog was overwrote by an old one and i can't see the changes that was made with the incremental update.

A patch and a sample pdf will come soon.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Re: [jira] [Created] (PDFBOX-1042) Wrong XRefStream order while parsing incremental updated PDF with XRefStreams

Posted by Timo Boehme <ti...@ontochem.com>.
In PDFBOX-1016 I've submitted a patch which implements the procedure you 
described and which allows to successfully parse PDF documents with 
updates (not parsable by current trunk because of redefined objects or 
replaced xrefs). It first collects all XREFs (table+objects) with their 
byte position and starts from the one referred to by last startxref, 
following the prev pointers.


Thanks
Timo

Am 21.06.2011 03:13, schrieb Adam@swmc.com:
> I was almost sure that there can not be multiple objects with the exact
> same object and revision number, for if this were allowed, how would you
> know which one to use?  But sure enough, section 7.5.6 clearly states that
> it's allowed (not sure how I missed that before or where I thought I saw
> that they had to be unique).
>
> I understand how marking them as deleted works as well as not marking them
> as deleted but just abandoning the object in favor of a new object.
> However, I want to make sure understand why the original object would not
> be visible.  First, let me quote the spec, then explain how I imagine a
> conforming reader would work, just to make sure we're all on the same
> page.
>
> Section 7.5.6
> "The update’s cross-reference section shall include a byte offset to this
> new copy of the object, overriding the old byte offset contained in the
> original cross-reference section. When a conforming reader reads the file,
> it shall build its cross-reference information in such a way that the most
> recent copy of each object shall be the one accessed from the file."
>
> So we start at the end of the file and get the trailer.  If it has a link
> to the previous trailer (/Prev), we go there (recurse until we're at the
> first trailer section).  Parse the xref section which the first trailer
> references, then return and parse the next one and so forth.  If the xref
> table contains a byte offset for an object/generation we already saw, we
> overwrite the old byte offset with the new one.  For example, if the
> original xref says object 1 0 was at offset 1234, and the next xref says
> it's at 2345 we overwrite the 1234 with 2345.
>
> Sorry if I'm a little slow, but I just want to make sure I have a firm
> grasp on how it is supposed to work (per the spec).  This will help me
> work on PDFBOX-1000 to make sure it's implemented properly.  My goal is to
> be able to have the conforming parser in such good shape that people can
> use it to determine if a PDF conforms to the spec or not, and if not
> exactly why it doesn't, what part of the spec is violated, etc.  This
> would be useful for us, as well as developers of other PDF libraries. It'd
> be awesome to have other PDF libs have automated tests which run their out
> put through our conforming parser to make sure that they're doing
> everything properly, looking for regression bugs, etc.  Yes, I realize
> that the PDF spec is 756 pages which makes this a lofty goal, but it's my
> goal none-the-less.  Now if only I had more time to work on this...
>
> ----
> Thanks,
> Adam
>
>
>
>
>
> From:
> Thomas Chojecki<in...@rayman2200.de>
> To:
> dev@pdfbox.apache.org
> Date:
> 06/20/2011 16:09
> Subject:
> Re: [jira] [Created] (PDFBOX-1042) Wrong XRefStream order while parsing
> incremental updated PDF with XRefStreams
>
>
>
> Am 20.06.2011 19:00, schrieb Adam@swmc.com:
>> What is the proposed solution for this?  According to the PDF spec,
> there
>> will never be two objects with the same object number and revision.
>> However, this is the real world, not a world of conforming PDF
> documents,
>> so I completely understand that this does occur.  My questions is
> mainly:
>> how do you plan on telling which object is the "right" one, and which
> one
>> should be overwritten?
>
> First of all, with incremental updates you can overwrite existing
> objects. The revision only increase if the prev. object was marked as
> deleted. This is little strange. In normal case the revision will never
> increment. a good parser can mark older not used objects as garbage.
>
> example:
> you have a page with some pictures. now you don't want some of the
> images but you want add some text to the rest of it. if you want update
> this page incremental, you need to get the page object. change it
> (remove some images and add some text). if you write the new page object
> and update the xref table, the old page object isn't visible any more.
> the images are still in the document and reserve the object number.
>
> the next pdf writer doesn't know that the objects aren't use any more.
> if you add another page via incremental update, you use new objects with
> new object numbers. each pdf has a limit amount of object numbers. 65k
> and you can update existing objects but don't replace it by other.
>
> so if you would spare some object numbers. you can try to mark unused
> object as deleted. so a good pdf writer see this objects and can use the
> same number with a incremented revision.
>
> this is the theory, in the real world no one do this :) 65k objects are
> enough and if someone plan to waste so many objects, he can use xref
> streams instead. ;)
>
>
> -------
>
> The normal way to overwrite an object is using the same obj no. and the
> same revision. To mark the new object, only the xref-table need to be
> updated. so it's very important to parse the table from old to new one.
>
> the new one should set the prev flag with the offset of the older one.
> so a parser read the newest xref-table, check the prev value and jump
> into this one. It's a kind of recursive parsing the table. The first is
> the  newest and the last table in the line is the oldest. A conforming
> parser should parse the oldest first.
>
> But, if you parse the xref table from the beginning of the file to the
> end of the file, you read the table also the right way.
>
> If you need to handle xref streams, the easy way from beginning to the
> end of the file, doesn't work for most documents. documents that use
> xref streams are mostly called "weboptimized" or "linearized".
>
> Weboptimized files are optimized for reading the document from beginning
> till end. this is for big documents. if you try to load a not optimized
> pdf, the application need to read the whole stream and can then handle
> the file. so if you only need the first page of the pdf, you need to
> parse the whole pdf. For the slow web this isn't a good way.
>
> weboptimized documents contain all needed information right in front of
> the file so the reader can read a small amount of data and knows, how
> many pages a document has. where the first page is without loading the
> rest of the file. try to load the pdf specification from the web. you
> will see the first page and a progress bar (depending of the speed of
> our connection) that load the rest of the document.
>
> for example:
> a weboptimized document can contain more xref streams. the first one
> contains only a few objects, the last one the rest of it.
>
>
> The patch i tried yesterday, sort the xref streams from smallest prev
> offset to the biggest. but in my example file the xref streams are mixed
> so the offset of the prev doesn't help. the best way is to jump direct
> to the offsets and read the streams recursive. maybe i got an idea
>
> Sry for the long text, tried to explain it for every user that has a
> basic knowledge of the pdf structure.
>
>>
>> ----
>> Thanks,
>> Adam
>
> Best regards
> Thomas
>
>
>
>
>
> - FHA 203b; 203k; HECM; VA; USDA; Conventional
> - Warehouse Lines; FHA-Authorized Originators
> - Lending and Servicing in over 45 States
> www.swmc.com   -  www.simplehecmcalculator.com
> Visit  www.swmc.com/resources   for helpful links on Training, Webinars, Lender Alerts and Submitting Conditions
>
> This email and any content within or attached hereto from Sun West Mortgage Company, Inc. is confidential and/or legally privileged. The information is intended only for the use of the individual or entity named on this email. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or taking any action in reliance on the contents of this email information is strictly prohibited, and that the documents should be returned to this office immediately by email. Receipt by anyone other than the intended recipient is not a waiver of any privilege. Please do not include your social security number, account number, or any other personal or financial information in the content of the email. Should you have any questions, please call (800) 453 7884.


-- 

  Timo Boehme
  OntoChem GmbH
  H.-Damerow-Str. 4
  06120 Halle/Saale
  T: +49 345 4780472
  F: +49 345 4780471
  timo.boehme@ontochem.com

_____________________________________________________________________

  OntoChem GmbH
  Geschäftsführer: Dr. Lutz Weber
  Sitz: Halle / Saale
  Registergericht: Stendal
  Registernummer: HRB 215461
_____________________________________________________________________


Re: [jira] [Created] (PDFBOX-1042) Wrong XRefStream order while parsing incremental updated PDF with XRefStreams

Posted by Ad...@swmc.com.
I was almost sure that there can not be multiple objects with the exact 
same object and revision number, for if this were allowed, how would you 
know which one to use?  But sure enough, section 7.5.6 clearly states that 
it's allowed (not sure how I missed that before or where I thought I saw 
that they had to be unique).

I understand how marking them as deleted works as well as not marking them 
as deleted but just abandoning the object in favor of a new object. 
However, I want to make sure understand why the original object would not 
be visible.  First, let me quote the spec, then explain how I imagine a 
conforming reader would work, just to make sure we're all on the same 
page.

Section 7.5.6
"The update’s cross-reference section shall include a byte offset to this 
new copy of the object, overriding the old byte offset contained in the 
original cross-reference section. When a conforming reader reads the file, 
it shall build its cross-reference information in such a way that the most 
recent copy of each object shall be the one accessed from the file."

So we start at the end of the file and get the trailer.  If it has a link 
to the previous trailer (/Prev), we go there (recurse until we're at the 
first trailer section).  Parse the xref section which the first trailer 
references, then return and parse the next one and so forth.  If the xref 
table contains a byte offset for an object/generation we already saw, we 
overwrite the old byte offset with the new one.  For example, if the 
original xref says object 1 0 was at offset 1234, and the next xref says 
it's at 2345 we overwrite the 1234 with 2345.

Sorry if I'm a little slow, but I just want to make sure I have a firm 
grasp on how it is supposed to work (per the spec).  This will help me 
work on PDFBOX-1000 to make sure it's implemented properly.  My goal is to 
be able to have the conforming parser in such good shape that people can 
use it to determine if a PDF conforms to the spec or not, and if not 
exactly why it doesn't, what part of the spec is violated, etc.  This 
would be useful for us, as well as developers of other PDF libraries. It'd 
be awesome to have other PDF libs have automated tests which run their out 
put through our conforming parser to make sure that they're doing 
everything properly, looking for regression bugs, etc.  Yes, I realize 
that the PDF spec is 756 pages which makes this a lofty goal, but it's my 
goal none-the-less.  Now if only I had more time to work on this...

---- 
Thanks,
Adam





From:
Thomas Chojecki <in...@rayman2200.de>
To:
dev@pdfbox.apache.org
Date:
06/20/2011 16:09
Subject:
Re: [jira] [Created] (PDFBOX-1042) Wrong XRefStream order while parsing 
incremental updated PDF with XRefStreams



Am 20.06.2011 19:00, schrieb Adam@swmc.com:
> What is the proposed solution for this?  According to the PDF spec, 
there
> will never be two objects with the same object number and revision.
> However, this is the real world, not a world of conforming PDF 
documents,
> so I completely understand that this does occur.  My questions is 
mainly:
> how do you plan on telling which object is the "right" one, and which 
one
> should be overwritten?

First of all, with incremental updates you can overwrite existing 
objects. The revision only increase if the prev. object was marked as 
deleted. This is little strange. In normal case the revision will never 
increment. a good parser can mark older not used objects as garbage.

example:
you have a page with some pictures. now you don't want some of the 
images but you want add some text to the rest of it. if you want update 
this page incremental, you need to get the page object. change it 
(remove some images and add some text). if you write the new page object 
and update the xref table, the old page object isn't visible any more. 
the images are still in the document and reserve the object number.

the next pdf writer doesn't know that the objects aren't use any more. 
if you add another page via incremental update, you use new objects with 
new object numbers. each pdf has a limit amount of object numbers. 65k 
and you can update existing objects but don't replace it by other.

so if you would spare some object numbers. you can try to mark unused 
object as deleted. so a good pdf writer see this objects and can use the 
same number with a incremented revision.

this is the theory, in the real world no one do this :) 65k objects are 
enough and if someone plan to waste so many objects, he can use xref 
streams instead. ;)


-------

The normal way to overwrite an object is using the same obj no. and the 
same revision. To mark the new object, only the xref-table need to be 
updated. so it's very important to parse the table from old to new one.

the new one should set the prev flag with the offset of the older one. 
so a parser read the newest xref-table, check the prev value and jump 
into this one. It's a kind of recursive parsing the table. The first is 
the  newest and the last table in the line is the oldest. A conforming 
parser should parse the oldest first.

But, if you parse the xref table from the beginning of the file to the 
end of the file, you read the table also the right way.

If you need to handle xref streams, the easy way from beginning to the 
end of the file, doesn't work for most documents. documents that use 
xref streams are mostly called "weboptimized" or "linearized".

Weboptimized files are optimized for reading the document from beginning 
till end. this is for big documents. if you try to load a not optimized 
pdf, the application need to read the whole stream and can then handle 
the file. so if you only need the first page of the pdf, you need to 
parse the whole pdf. For the slow web this isn't a good way.

weboptimized documents contain all needed information right in front of 
the file so the reader can read a small amount of data and knows, how 
many pages a document has. where the first page is without loading the 
rest of the file. try to load the pdf specification from the web. you 
will see the first page and a progress bar (depending of the speed of 
our connection) that load the rest of the document.

for example:
a weboptimized document can contain more xref streams. the first one 
contains only a few objects, the last one the rest of it.


The patch i tried yesterday, sort the xref streams from smallest prev 
offset to the biggest. but in my example file the xref streams are mixed 
so the offset of the prev doesn't help. the best way is to jump direct 
to the offsets and read the streams recursive. maybe i got an idea

Sry for the long text, tried to explain it for every user that has a 
basic knowledge of the pdf structure.

>
> ----
> Thanks,
> Adam

Best regards
Thomas





- FHA 203b; 203k; HECM; VA; USDA; Conventional 
- Warehouse Lines; FHA-Authorized Originators 
- Lending and Servicing in over 45 States 
www.swmc.com   -  www.simplehecmcalculator.com   
Visit  www.swmc.com/resources   for helpful links on Training, Webinars, Lender Alerts and Submitting Conditions  

This email and any content within or attached hereto from Sun West Mortgage Company, Inc. is confidential and/or legally privileged. The information is intended only for the use of the individual or entity named on this email. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or taking any action in reliance on the contents of this email information is strictly prohibited, and that the documents should be returned to this office immediately by email. Receipt by anyone other than the intended recipient is not a waiver of any privilege. Please do not include your social security number, account number, or any other personal or financial information in the content of the email. Should you have any questions, please call (800) 453 7884.  

Re: [jira] [Created] (PDFBOX-1042) Wrong XRefStream order while parsing incremental updated PDF with XRefStreams

Posted by Thomas Chojecki <in...@rayman2200.de>.
Am 20.06.2011 19:00, schrieb Adam@swmc.com:
> What is the proposed solution for this?  According to the PDF spec, there
> will never be two objects with the same object number and revision.
> However, this is the real world, not a world of conforming PDF documents,
> so I completely understand that this does occur.  My questions is mainly:
> how do you plan on telling which object is the "right" one, and which one
> should be overwritten?

First of all, with incremental updates you can overwrite existing 
objects. The revision only increase if the prev. object was marked as 
deleted. This is little strange. In normal case the revision will never 
increment. a good parser can mark older not used objects as garbage.

example:
you have a page with some pictures. now you don't want some of the 
images but you want add some text to the rest of it. if you want update 
this page incremental, you need to get the page object. change it 
(remove some images and add some text). if you write the new page object 
and update the xref table, the old page object isn't visible any more. 
the images are still in the document and reserve the object number.

the next pdf writer doesn't know that the objects aren't use any more. 
if you add another page via incremental update, you use new objects with 
new object numbers. each pdf has a limit amount of object numbers. 65k 
and you can update existing objects but don't replace it by other.

so if you would spare some object numbers. you can try to mark unused 
object as deleted. so a good pdf writer see this objects and can use the 
same number with a incremented revision.

this is the theory, in the real world no one do this :) 65k objects are 
enough and if someone plan to waste so many objects, he can use xref 
streams instead. ;)


-------

The normal way to overwrite an object is using the same obj no. and the 
same revision. To mark the new object, only the xref-table need to be 
updated. so it's very important to parse the table from old to new one.

the new one should set the prev flag with the offset of the older one. 
so a parser read the newest xref-table, check the prev value and jump 
into this one. It's a kind of recursive parsing the table. The first is 
the  newest and the last table in the line is the oldest. A conforming 
parser should parse the oldest first.

But, if you parse the xref table from the beginning of the file to the 
end of the file, you read the table also the right way.

If you need to handle xref streams, the easy way from beginning to the 
end of the file, doesn't work for most documents. documents that use 
xref streams are mostly called "weboptimized" or "linearized".

Weboptimized files are optimized for reading the document from beginning 
till end. this is for big documents. if you try to load a not optimized 
pdf, the application need to read the whole stream and can then handle 
the file. so if you only need the first page of the pdf, you need to 
parse the whole pdf. For the slow web this isn't a good way.

weboptimized documents contain all needed information right in front of 
the file so the reader can read a small amount of data and knows, how 
many pages a document has. where the first page is without loading the 
rest of the file. try to load the pdf specification from the web. you 
will see the first page and a progress bar (depending of the speed of 
our connection) that load the rest of the document.

for example:
a weboptimized document can contain more xref streams. the first one 
contains only a few objects, the last one the rest of it.


The patch i tried yesterday, sort the xref streams from smallest prev 
offset to the biggest. but in my example file the xref streams are mixed 
so the offset of the prev doesn't help. the best way is to jump direct 
to the offsets and read the streams recursive. maybe i got an idea

Sry for the long text, tried to explain it for every user that has a 
basic knowledge of the pdf structure.

>
> ----
> Thanks,
> Adam

Best regards
Thomas

Re: [jira] [Created] (PDFBOX-1042) Wrong XRefStream order while parsing incremental updated PDF with XRefStreams

Posted by Ad...@swmc.com.
What is the proposed solution for this?  According to the PDF spec, there 
will never be two objects with the same object number and revision. 
However, this is the real world, not a world of conforming PDF documents, 
so I completely understand that this does occur.  My questions is mainly: 
how do you plan on telling which object is the "right" one, and which one 
should be overwritten?

---- 
Thanks,
Adam





From:
"Thomas Chojecki (JIRA)" <ji...@apache.org>
To:
dev@pdfbox.apache.org
Date:
06/20/2011 09:17
Subject:
[jira] [Created] (PDFBOX-1042) Wrong XRefStream order while parsing 
incremental updated PDF with XRefStreams



Wrong XRefStream order while parsing incremental updated PDF with 
XRefStreams
-----------------------------------------------------------------------------

                 Key: PDFBOX-1042
                 URL: https://issues.apache.org/jira/browse/PDFBOX-1042
             Project: PDFBox
          Issue Type: Bug
          Components: Parsing
    Affects Versions: 1.5.0
            Reporter: Thomas Chojecki
            Priority: Critical


A PDF can contain two types of XRef-Entries.
Most files use XRefTables for object references.

Web-Optimized (linearized) pdf document uses XRefStreams. This is a 
compresed XRefTable as ObjectStream. The PDFParser parse this objects the 
same way as other objects and put them into an object pool (HashMap). If 
the document was incremental updated, more XRefStreams would be in the pdf 
document and all will be put into the object pool.

The XRefStreamParser begin to parse the XRefStreams and try to gain all 
XRefStream-Object from that pool. The objects returned from the pool 
aren't in the same order as read. This cause that in some cases the older 
Object overwrite the newer one. And this cause that the pdfbox can't find 
the right objects and use the older one instead.

If a user try to parse such a document, he will got an indeterminate 
state. older and newer objects are mixed.

In my case, a document catalog was overwrote by an old one and i can't see 
the changes that was made with the incremental update.

A patch and a sample pdf will come soon.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

 



- FHA 203b; 203k; HECM; VA; USDA; Conventional 
- Warehouse Lines; FHA-Authorized Originators 
- Lending and Servicing in over 45 States 
www.swmc.com   -  www.simplehecmcalculator.com   Visit  www.swmc.com/resources   for helpful links on Training, Webinars, Lender Alerts and Submitting Conditions  
This email and any content within or attached hereto from Sun West Mortgage Company, Inc. is confidential and/or legally privileged. The information is intended only for the use of the individual or entity named on this email. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or taking any action in reliance on the contents of this email information is strictly prohibited, and that the documents should be returned to this office immediately by email. Receipt by anyone other than the intended recipient is not a waiver of any privilege. Please do not include your social security number, account number, or any other personal or financial information in the content of the email. Should you have any questions, please call (800) 453 7884.  

[jira] [Commented] (PDFBOX-1042) Wrong XRefStream order while parsing incremental updated PDF with XRefStreams

Posted by "Timo Boehme (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-1042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052370#comment-13052370 ] 

Timo Boehme commented on PDFBOX-1042:
-------------------------------------

I've already submitted a patch for this in PDFBOX-1016. It collects all XREFs (table and object) and resolves the active ones by using the startxref and prev information. It only has to be applied to current trunk. Andreas Lehmkühler already started to have a look on it.

> Wrong XRefStream order while parsing incremental updated PDF with XRefStreams
> -----------------------------------------------------------------------------
>
>                 Key: PDFBOX-1042
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1042
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 1.5.0
>            Reporter: Thomas Chojecki
>            Priority: Critical
>
> A PDF can contain two types of XRef-Entries.
> Most files use XRefTables for object references.
> Web-Optimized (linearized) pdf document uses XRefStreams. This is a compresed XRefTable as ObjectStream. The PDFParser parse this objects the same way as other objects and put them into an object pool (HashMap). If the document was incremental updated, more XRefStreams would be in the pdf document and all will be put into the object pool.
> The XRefStreamParser begin to parse the XRefStreams and try to gain all XRefStream-Object from that pool. The objects returned from the pool aren't in the same order as read. This cause that in some cases the older Object overwrite the newer one. And this cause that the pdfbox can't find the right objects and use the older one instead.
> If a user try to parse such a document, he will got an indeterminate state. older and newer objects are mixed.
> In my case, a document catalog was overwrote by an old one and i can't see the changes that was made with the incremental update.
> A patch and a sample pdf will come soon.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (PDFBOX-1042) Wrong XRefStream order while parsing incremental updated PDF with XRefStreams

Posted by "Adam Nichols (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-1042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052294#comment-13052294 ] 

Adam Nichols commented on PDFBOX-1042:
--------------------------------------

For reference: section 7.5.6 is what specifies that "a file may have several copies on an object with the same object identifier (object number and generation number)."  This section also explains how to deal with this situation.

> Wrong XRefStream order while parsing incremental updated PDF with XRefStreams
> -----------------------------------------------------------------------------
>
>                 Key: PDFBOX-1042
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1042
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 1.5.0
>            Reporter: Thomas Chojecki
>            Priority: Critical
>
> A PDF can contain two types of XRef-Entries.
> Most files use XRefTables for object references.
> Web-Optimized (linearized) pdf document uses XRefStreams. This is a compresed XRefTable as ObjectStream. The PDFParser parse this objects the same way as other objects and put them into an object pool (HashMap). If the document was incremental updated, more XRefStreams would be in the pdf document and all will be put into the object pool.
> The XRefStreamParser begin to parse the XRefStreams and try to gain all XRefStream-Object from that pool. The objects returned from the pool aren't in the same order as read. This cause that in some cases the older Object overwrite the newer one. And this cause that the pdfbox can't find the right objects and use the older one instead.
> If a user try to parse such a document, he will got an indeterminate state. older and newer objects are mixed.
> In my case, a document catalog was overwrote by an old one and i can't see the changes that was made with the incremental update.
> A patch and a sample pdf will come soon.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Closed] (PDFBOX-1042) Wrong XRefStream order while parsing incremental updated PDF with XRefStreams

Posted by "Thomas Chojecki (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-1042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thomas Chojecki closed PDFBOX-1042.
-----------------------------------

    Resolution: Duplicate

Already fixed PDFBOX-1016

> Wrong XRefStream order while parsing incremental updated PDF with XRefStreams
> -----------------------------------------------------------------------------
>
>                 Key: PDFBOX-1042
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1042
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 1.5.0
>            Reporter: Thomas Chojecki
>            Priority: Critical
>
> A PDF can contain two types of XRef-Entries.
> Most files use XRefTables for object references.
> Web-Optimized (linearized) pdf document uses XRefStreams. This is a compresed XRefTable as ObjectStream. The PDFParser parse this objects the same way as other objects and put them into an object pool (HashMap). If the document was incremental updated, more XRefStreams would be in the pdf document and all will be put into the object pool.
> The XRefStreamParser begin to parse the XRefStreams and try to gain all XRefStream-Object from that pool. The objects returned from the pool aren't in the same order as read. This cause that in some cases the older Object overwrite the newer one. And this cause that the pdfbox can't find the right objects and use the older one instead.
> If a user try to parse such a document, he will got an indeterminate state. older and newer objects are mixed.
> In my case, a document catalog was overwrote by an old one and i can't see the changes that was made with the incremental update.
> A patch and a sample pdf will come soon.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira