You are viewing a plain text version of this content. The canonical link for it is here.

Posted to fop-dev@xmlgraphics.apache.org by Michael Rubin <mr...@thunderhead.com> on 2011/05/25 09:45:40 UTC

Retrieving Objects question

Hello there. In the PDFPages class the kids are stored as reference
strings (e.g. "23 0 R"). Each of these objects are PDFPage objects. Do
you know if there is a method somewhere that I can retrieve the PDF java
object based on the reference string?

(I am aiming to add support for some of those kids being other PDFPages
nodes to create a more balanced page tree.)

Thanks.

-Mike





Michael Rubin
Developer

T: +44 20 8238 7400
F: +44 20 8238 7401

mrubin@Thunderhead.com

The contents of this e-mail are intended for the named addressee only. It contains information that may be confidential. Unless you are the named addressee or an authorized designee, you may not copy or use it, or disclose it to anyone else. If you received it in error please notify us 
immediately and then destroy it.

Re: Retrieving Objects question

Posted by "Andreas L. Delmelle" <an...@telenet.be>.

On 08 Jun 2011, at 21:14, Andreas L. Delmelle wrote:

> <snip />
>> <snip />
>> My current questions are:
>> 
>> -Why are the page objects flushed straight away? (Memory constraints?)
> 
> Very likely to save memory indeed. More with the intention of just flushing "as soon as possible", to support full streaming processing if the document structure allows it. Theoretically, in a document consisting of single-page fo:page-sequences, without any cross-references, you should see relatively low memory usage even if the document is 10000+ pages, precisely because the pages are all written to the output immediately, long before the root page tree, which only retains their object references.
                                           ^^^^^^^^^^^^^^^^^
Just felt this needed clarification: *PDF* object references (which, in Java are merely Strings, not references to the PDFPage objects).

Re: Retrieving Objects question

Posted by "Andreas L. Delmelle" <an...@telenet.be>.

On 09 Jun 2011, at 09:49, Michael Rubin wrote:

Hi Mike

> Thanks a lot for your reply Andreas. Yes if all I had to do was move references around then my work would already be complete and submitted for review. However, that catch is that the Page objects also have Parent references which also need to be updated when they get moved from one page tree node to another. But since they have been written out already this cannot be done. So the pages effectively become immovable (or else the parent references will not match the kids references as they will be out of date - which was why acroread could not open the pages).

Aah, OK, now I understand that catch better. Thanks for clarifying! 
In the meantime, I'll give it some more thought too, and if I find anything useful to add, I'll follow up here.


Regards

Andreas
---

Re: Retrieving Objects question

Posted by Michael Rubin <mr...@thunderhead.com>.

Thanks a lot for your reply Andreas. Yes if all I had to do was move references around then my work would already be complete and submitted for review. However, that catch is that the Page objects also have Parent references which also need to be updated when they get moved from one page tree node to another. But since they have been written out already this cannot be done. So the pages effectively become immovable (or else the parent references will not match the kids references as they will be out of date - which was why acroread could not open the pages).

Delaying writing the page objects would mean the parent references can be updated correctly, and the problem would be solved. But, that has a potential memory usage toll.

Today I will continue with my attempt to link every page to a node of its own (stored in a flat list), then re-order the nodes according to the page index of the page inside. Then build up the balanced page tree from those nodes up. That's the plan anyway... (I'll also be interested time permitting in looking more closely at what happened when the 2 page sequences ended up with mixed up pages...)

Thanks!

-Mike


On 08/06/11 20:14, Andreas L. Delmelle wrote:
> On 08 Jun 2011, at 17:15, Michael Rubin wrote:
>
> Hi Mike
>
>> Hello there. Thought I'd post an update. Admittedly I feel like I've found a bit of a catch 22 situation. I successfully completed my code to generate the balanced page tree on the fly and it works fine with a single page sequence. However, this morning I discovered that this code does not appear to work for multiple page sequences in a flow. (2x 101 page sequences, I got pages 1-9, 102, 10-101 then 103-end in that order...) I guess this is where pages can come in in a different order anyway then, and why the current indexing / nulls system is there.
> Ouch! I had not considered that to be the purpose. Without looking closer, I would say something like: page 10 contains a forward reference to page 102, and all pages in between are only flushed after the reference can be been resolved (?)
>
>> (And shows that I am still learning the ropes as I go along...)
> Yep, and also shows that I am not intimately familiar with *all* of the codebase myself. ;-)
>
>> So I re-examined trying to generate the page tree after the pages have been added into one big flat list. I can do this by, in PDFDocument.outputTrailer(), calling a method to balance the page tree before all the remaining objects are written out. This way pages can be attached to nodes, and the tree hierarchy built up to the root node. This is on paper a more elegant, efficient and easier solution to doing it on the fly. But I ran into the same problem again - the page objects are already written out.
> OK, here may be a gap in my understanding of it so far, but...
> Do you really _need_ the PDFPage object for some reason, or does its PDF reference suffice to build the page tree?
>  From what I know of PDF, that page tree would only contain the references to the actual page objects, no? As long as the PDFPages object is not written to the stream, you should be able to shuffle and play with the references all you want. All you need to keep track of, is to retain the natural order (= the page's index), as the object numbers will not necessarily reflect that.
> Unless I am mistaken about this, I do not see a compelling reason *not* to write the PDFPage object to the stream as soon as it's finished. We keep a mapping of reference-to-index alive in the 'main' (temporary?) PDFPages object.
> Note that notifyKidRegistered() only stores the reference; the natural index is translated into the position of the reference in the list. If you want to re-shape that into a structured tree/map, then by all means...
>
> Perhaps there is still a catch --sounds too simple somehow... :-/
>
>> <snip />
>> My current questions are:
>>
>> -Why are the page objects flushed straight away? (Memory constraints?)
> Very likely to save memory indeed. More with the intention of just flushing "as soon as possible", to support full streaming processing if the document structure allows it. Theoretically, in a document consisting of single-page fo:page-sequences, without any cross-references, you should see relatively low memory usage even if the document is 10000+ pages, precisely because the pages are all written to the output immediately, long before the root page tree, which only retains their object references.
>
>> -Is it safe and wise to delay flushing the page objects until the end?
> Safe? No issue here.
> Wise? That would obviously depend on the context.
> In documents with 1000s of pages, I can imagine we do not want to keep all of those pages in memory any longer than strictly necessary... I wouldn't mind too much if it were an option that users could switch on/off. However, if the process is hard coded as the *only* way FOP will render PDFs, such that it would affect *all* users, I am not so sure it is wise to do this.
>
> <snip />
>
>
> Regards
>
> Andreas
> ---
>





Michael Rubin
Developer

T: +44 20 8238 7400
F: +44 20 8238 7401

mrubin@Thunderhead.com

The contents of this e-mail are intended for the named addressee only. It contains information that may be confidential. Unless you are the named addressee or an authorized designee, you may not copy or use it, or disclose it to anyone else. If you received it in error please notify us 
immediately and then destroy it.

Re: Retrieving Objects question

Posted by "Andreas L. Delmelle" <an...@telenet.be>.

On 08 Jun 2011, at 17:15, Michael Rubin wrote:

Hi Mike

> Hello there. Thought I'd post an update. Admittedly I feel like I've found a bit of a catch 22 situation. I successfully completed my code to generate the balanced page tree on the fly and it works fine with a single page sequence. However, this morning I discovered that this code does not appear to work for multiple page sequences in a flow. (2x 101 page sequences, I got pages 1-9, 102, 10-101 then 103-end in that order...) I guess this is where pages can come in in a different order anyway then, and why the current indexing / nulls system is there.

Ouch! I had not considered that to be the purpose. Without looking closer, I would say something like: page 10 contains a forward reference to page 102, and all pages in between are only flushed after the reference can be been resolved (?)

> (And shows that I am still learning the ropes as I go along...)

Yep, and also shows that I am not intimately familiar with *all* of the codebase myself. ;-)

> So I re-examined trying to generate the page tree after the pages have been added into one big flat list. I can do this by, in PDFDocument.outputTrailer(), calling a method to balance the page tree before all the remaining objects are written out. This way pages can be attached to nodes, and the tree hierarchy built up to the root node. This is on paper a more elegant, efficient and easier solution to doing it on the fly. But I ran into the same problem again - the page objects are already written out.

OK, here may be a gap in my understanding of it so far, but...
Do you really _need_ the PDFPage object for some reason, or does its PDF reference suffice to build the page tree?
From what I know of PDF, that page tree would only contain the references to the actual page objects, no? As long as the PDFPages object is not written to the stream, you should be able to shuffle and play with the references all you want. All you need to keep track of, is to retain the natural order (= the page's index), as the object numbers will not necessarily reflect that.
Unless I am mistaken about this, I do not see a compelling reason *not* to write the PDFPage object to the stream as soon as it's finished. We keep a mapping of reference-to-index alive in the 'main' (temporary?) PDFPages object.
Note that notifyKidRegistered() only stores the reference; the natural index is translated into the position of the reference in the list. If you want to re-shape that into a structured tree/map, then by all means...

Perhaps there is still a catch --sounds too simple somehow... :-/

> <snip />
> My current questions are:
> 
> -Why are the page objects flushed straight away? (Memory constraints?)

Very likely to save memory indeed. More with the intention of just flushing "as soon as possible", to support full streaming processing if the document structure allows it. Theoretically, in a document consisting of single-page fo:page-sequences, without any cross-references, you should see relatively low memory usage even if the document is 10000+ pages, precisely because the pages are all written to the output immediately, long before the root page tree, which only retains their object references. 

> -Is it safe and wise to delay flushing the page objects until the end?

Safe? No issue here.
Wise? That would obviously depend on the context. 
In documents with 1000s of pages, I can imagine we do not want to keep all of those pages in memory any longer than strictly necessary... I wouldn't mind too much if it were an option that users could switch on/off. However, if the process is hard coded as the *only* way FOP will render PDFs, such that it would affect *all* users, I am not so sure it is wise to do this.

<snip />

Regards

Andreas
---

Re: Retrieving Objects question

Posted by Michael Rubin <mr...@thunderhead.com>.

Hello there. Thought I'd post an update. Admittedly I feel like I've 
found a bit of a catch 22 situation. I successfully completed my code to 
generate the balanced page tree on the fly and it works fine with a 
single page sequence. However, this morning I discovered that this code 
does not appear to work for multiple page sequences in a flow. (2x 101 
page sequences, I got pages 1-9, 102, 10-101 then 103-end in that 
order...) I guess this is where pages can come in in a different order 
anyway then, and why the current indexing / nulls system is there. (And 
shows that I am still learning the ropes as I go along...)

So I re-examined trying to generate the page tree after the pages have 
been added into one big flat list. I can do this by, in 
PDFDocument.outputTrailer(), calling a method to balance the page tree 
before all the remaining objects are written out. This way pages can be 
attached to nodes, and the tree hierarchy built up to the root node. 
This is on paper a more elegant, efficient and easier solution to doing 
it on the fly. But I ran into the same problem again - the page objects 
are already written out.

Looking at the code I see the pages get written out / flushed as soon as 
they are created. One page gets written out before the next page is 
started. So moving pages from one node to another is impossible without 
breaking the PDF. The only way round this currently is to assign pages 
to nodes as they get created, but then this breaks the ordering system 
in the notifyKidsRegistered() method which needs a flat list. Hence the 
catch 22.

My current questions are:

-Why are the page objects flushed straight away? (Memory constraints?)
-Is it safe and wise to delay flushing the page objects until the end?
-If so then how do (or should) I do this? (Can I flush the page contents 
but not the page object itself to minimise memory usage?)
-If not then how can I fix pages into individual nodes at creation time 
without breaking it for multiple page sequences?

PDFDocumentHandler.endPage() is where 'flushPDFDoc()' is called as part 
of the page generation process. The next page isn't added until after 
this point.

The only workaround I can think of at the moment, having spoken to my 
colleagues, is to add pages to their own individual page tree nodes, 
then sort and arrange the nodes into a balanced tree. However this is 
less than ideal with twice as many nodes as needed. (Although my manager 
seems happy to go with this.) I haven't yet finished testing this 
permutation (still debugging) but happy to ditch it if I can work out 
how to delay writing out the page objects until I have re-arranged them 
as in the 2nd paragraph. (It would be nice to maintain potential support 
for out of order pages after all...)

Thanks a lot for your time!

-Mike

On 06/06/11 19:48, Andreas L. Delmelle wrote:
> On 06 Jun 2011, at 10:59, Michael Rubin wrote:
>
> Hi Mike
>
>> Thanks for your reply Andreas.
>>
>> Currently it is hardcoded to 10 nodes or leaves, but adding an xconf setting perhaps should be pretty easy and quick to do. However, having spoken to my manager, there isn't the business requirement currently to make it configurable, and given the current large array of options already available, the preference is to just keep it hardcoded for now. At the very least I'll make sure the maximum leaves / subnodes value is stored in a constant so if it is made configurable then only the constant needs to be paid attention to rather than multiple locations in the class.
> OK, sounds good. I must admit, I was playing devil's advocate here, and did not see any immediate reason to be able to change it either, but you can probably bet your life that _someone_ is going to come up with this requirement as soon as the feature is discovered... :-)
>
> <snip />
>> ... So for a 10,000 page doc there are going to be a lot of nulls in the page tree. For now setting the toPDFString() to ignore the nulls rather than throw an exception gets round this and allows the document to be correctly generated. In my tests all the pages are produced in the correct order. I was wondering though if there are any cases where the pages might not be passed in in the correct order (and hence might possibly explain why the notifyKidsRegistered() method was written in the way it is), and if so if that has any implications on the way I have written the balanced page tree code updates.
> I think the original idea was that PDF would, in the long run, also be able to do out-of-order rendering (i.e. if page N in a document would be completely resolved, and thus could be rendered, before page N-1 --in that case, the null reference would be needed as a placeholder for the not-yet-finished page).
> At any rate, AFAIR, this was never actually implemented for PDF, so that explains why you see all pages in the correct order every time.
>
> If it is cleaner to alter notifyKidRegistered() and avoid those nulls from being inserted in the first place, I would prefer that over just skipping them in toPDFString(). Not a must, though...
>
>
>
> Regards
>
> Andreas
> ---

Michael Rubin
Developer

T: +44 20 8238 7400
F: +44 20 8238 7401

mrubin@Thunderhead.com

The contents of this e-mail are intended for the named addressee only. It contains information that may be confidential. Unless you are the named addressee or an authorized designee, you may not copy or use it, or disclose it to anyone else. If you received it in error please notify us 
immediately and then destroy it.

Re: Retrieving Objects question

Posted by "Andreas L. Delmelle" <an...@telenet.be>.

On 06 Jun 2011, at 10:59, Michael Rubin wrote:

Hi Mike

> Thanks for your reply Andreas.
> 
> Currently it is hardcoded to 10 nodes or leaves, but adding an xconf setting perhaps should be pretty easy and quick to do. However, having spoken to my manager, there isn't the business requirement currently to make it configurable, and given the current large array of options already available, the preference is to just keep it hardcoded for now. At the very least I'll make sure the maximum leaves / subnodes value is stored in a constant so if it is made configurable then only the constant needs to be paid attention to rather than multiple locations in the class.

OK, sounds good. I must admit, I was playing devil's advocate here, and did not see any immediate reason to be able to change it either, but you can probably bet your life that _someone_ is going to come up with this requirement as soon as the feature is discovered... :-)

<snip />
> ... So for a 10,000 page doc there are going to be a lot of nulls in the page tree. For now setting the toPDFString() to ignore the nulls rather than throw an exception gets round this and allows the document to be correctly generated. In my tests all the pages are produced in the correct order. I was wondering though if there are any cases where the pages might not be passed in in the correct order (and hence might possibly explain why the notifyKidsRegistered() method was written in the way it is), and if so if that has any implications on the way I have written the balanced page tree code updates.

I think the original idea was that PDF would, in the long run, also be able to do out-of-order rendering (i.e. if page N in a document would be completely resolved, and thus could be rendered, before page N-1 --in that case, the null reference would be needed as a placeholder for the not-yet-finished page).
At any rate, AFAIR, this was never actually implemented for PDF, so that explains why you see all pages in the correct order every time.

If it is cleaner to alter notifyKidRegistered() and avoid those nulls from being inserted in the first place, I would prefer that over just skipping them in toPDFString(). Not a must, though...

Regards

Andreas
---

Re: Retrieving Objects question

Posted by Michael Rubin <mr...@thunderhead.com>.

Thanks for your reply Andreas.

Currently it is hardcoded to 10 nodes or leaves, but adding an xconf 
setting perhaps should be pretty easy and quick to do. However, having 
spoken to my manager, there isn't the business requirement currently to 
make it configurable, and given the current large array of options 
already available, the preference is to just keep it hardcoded for now. 
At the very least I'll make sure the maximum leaves / subnodes value is 
stored in a constant so if it is made configurable then only the 
constant needs to be paid attention to rather than multiple locations in 
the class.

As far as I can tell the page objects are kept alive anyway by the 
references in the document object itself (atleast until the trailer is 
written). So me keeping references in the page tree object should not 
extend their life in any way.

Currently, if I take a 20 page document, then there are two sets of 10 
pages, one in each node, each node being children of the root node. For 
the first 10 pages the kids list is something like {1 0 R, 2 0 R, 3 0 R, 
4 0 R, 5 0 R, 6 0 R, 7 0 R, 8 0 R, 9 0 R, 10 0 R} (object numbers not 
intended to be realistic for this example). But for the second 10 pages 
the kids list is {null, null, null, null, null, null, null, null, null, 
null, 11 0 R, 12 0 R, 13 0 R, 14 0 R, 15 0 R, 16 0 R, 17 0 R, 18 0 R, 19 
0 R, 20 0 R} since the page index (which is zero based) makes the page 
get placed in that index position on the tree, any previous unused 
indexes being filled with null. So for a 10,000 page doc there are going 
to be a lot of nulls in the page tree. For now setting the toPDFString() 
to ignore the nulls rather than throw an exception gets round this and 
allows the document to be correctly generated. In my tests all the pages 
are produced in the correct order. I was wondering though if there are 
any cases where the pages might not be passed in in the correct order 
(and hence might possibly explain why the notifyKidsRegistered() method 
was written in the way it is), and if so if that has any implications on 
the way I have written the balanced page tree code updates.

Thanks.

-Mike

On 03/06/11 22:38, Andreas L. Delmelle wrote:
> On 03 Jun 2011, at 10:54, Michael Rubin wrote:
>
> Hi Mike
>
>> Thanks a lot for your reply last week Andreas. Sorry for the delay. Been away and offline... FYI to follow up on the work I was doing:
> <snip />
>> So for example a 101 page document will have a root PDFPages node with two sub-nodes underneath. The first will contain a count of 100, and have 10 sub-nodes, each containing 10 pages. The second will simply contain 1 page. More new pages will get added to the second sub-node (moving pages down to new sub-nodes to avoid more than 10 pages per node) until it's count reaches 100 too, then another node created. Once 10 nodes under the root exist (at 1000 pages) they will get moved down below a new root level sub-node with a count of 1000, and a new root level sub-node created, and so on.
> Cool! Impressive work. Will the number of pages per node be configurable?
>
>> Next task is to write a JUnit test since one appears not to exist... I guess remaining thoughts currently are:
>>
>> - Wondering if keeping references to a page tree object's sub-nodes or leaves is the best way or can I improve it further? (Bearing in mind memory usage and performance.)
> It depends a bit on whether you are thereby keeping PDFPage objects alive longer than necessary. The current design only stores the pages' referencePDF, so that seems safe.
>
>> - Was wondering if the trailer objects list is the right place to write the new sub-node PDFPages objects. (But if writing an object to the objects list - addObject() instead of addTrailerObject() - it gets written out too soon before I have added all the pages.) But given how it writes the objects out before writing the xref and trailer it seems OK and parses and shows fine in PDFBox/PDFDebugger and the evince PDF Reader in ubuntu.
> I would think that that is the correct place, although I must admit, I would have to check the PDF Spec to be certain.
>
>> - When registering the pages themselves via notifyKidsRegistered() method it extracts the page index number and puts the reference at that index in the kids list, filling empty spaces ahead of it with nulls. So when counting kids and writing out the pdf code text I had to ignore nulls and 'gaps' in the kids list since not all the kids are in the same list any more (spread across multiple page tree nodes). I was wondering why this method was written like this, and doesn't simply append new pages to the end of the list all the time.
> AFAICT, what it is designed to do is make sure that the page is entered at the correct index in the list of kids. It would only create null entries if the list is not yet large enough. I have a feeling this is just by design, taking into account a single page tree node only (see the javadoc of the PDFPages class...)
>
>
> Regards
>
> Andreas
> ---

Michael Rubin
Developer

T: +44 20 8238 7400
F: +44 20 8238 7401

mrubin@Thunderhead.com

The contents of this e-mail are intended for the named addressee only. It contains information that may be confidential. Unless you are the named addressee or an authorized designee, you may not copy or use it, or disclose it to anyone else. If you received it in error please notify us 
immediately and then destroy it.

Re: Retrieving Objects question

Posted by "Andreas L. Delmelle" <an...@telenet.be>.

On 03 Jun 2011, at 10:54, Michael Rubin wrote:

Hi Mike

> Thanks a lot for your reply last week Andreas. Sorry for the delay. Been away and offline... FYI to follow up on the work I was doing:

<snip />
> So for example a 101 page document will have a root PDFPages node with two sub-nodes underneath. The first will contain a count of 100, and have 10 sub-nodes, each containing 10 pages. The second will simply contain 1 page. More new pages will get added to the second sub-node (moving pages down to new sub-nodes to avoid more than 10 pages per node) until it's count reaches 100 too, then another node created. Once 10 nodes under the root exist (at 1000 pages) they will get moved down below a new root level sub-node with a count of 1000, and a new root level sub-node created, and so on.

Cool! Impressive work. Will the number of pages per node be configurable?

> Next task is to write a JUnit test since one appears not to exist... I guess remaining thoughts currently are:
> 
> - Wondering if keeping references to a page tree object's sub-nodes or leaves is the best way or can I improve it further? (Bearing in mind memory usage and performance.)

It depends a bit on whether you are thereby keeping PDFPage objects alive longer than necessary. The current design only stores the pages' referencePDF, so that seems safe.

> - Was wondering if the trailer objects list is the right place to write the new sub-node PDFPages objects. (But if writing an object to the objects list - addObject() instead of addTrailerObject() - it gets written out too soon before I have added all the pages.) But given how it writes the objects out before writing the xref and trailer it seems OK and parses and shows fine in PDFBox/PDFDebugger and the evince PDF Reader in ubuntu.

I would think that that is the correct place, although I must admit, I would have to check the PDF Spec to be certain.

> - When registering the pages themselves via notifyKidsRegistered() method it extracts the page index number and puts the reference at that index in the kids list, filling empty spaces ahead of it with nulls. So when counting kids and writing out the pdf code text I had to ignore nulls and 'gaps' in the kids list since not all the kids are in the same list any more (spread across multiple page tree nodes). I was wondering why this method was written like this, and doesn't simply append new pages to the end of the list all the time.

AFAICT, what it is designed to do is make sure that the page is entered at the correct index in the list of kids. It would only create null entries if the list is not yet large enough. I have a feeling this is just by design, taking into account a single page tree node only (see the javadoc of the PDFPages class...)


Regards

Andreas
---

Re: Retrieving Objects question

Posted by Michael Rubin <mr...@thunderhead.com>.

Thanks a lot for your reply last week Andreas. Sorry for the delay. Been 
away and offline... FYI to follow up on the work I was doing:

In the end I saw that references are indeed kept by the PDFDocument. So 
I decided it wouldn't do any harm (or take up any significant extra 
memory) to keep references to the objects themselves when I am 
constructing the balanced page tree. I have since modified PDFPages (and 
a small change in PDFPage) and the first working draft completed late 
yesterday keeps a list of sub-nodes (PDFPages, managed internally via a 
recursive algorithm - external methods work as before to avoid 
regressions) or leaves (PDFPage) as well as the original kids (may be a 
PDFPage or a sub PDFPages object) with PDF references to all children. 
This eliminates an overhead of looking up each object (potentially many 
times). I have successfully run it with test .fo files up to 10001 pages 
(each just showing 'Page x/y' where x is current page and y is total 
page count, takes a while with that many pages but not surprised) 
verifying that a balanced tree gets produced (and not a flat tree of one 
page tree object containing 10001 pages!). When each subnode is created 
the PDFFactory.makePages() method stores it in the trailer. That way the 
objects are all written out at the end after I have added all the pages 
to the right places, just before the cross reference table and trailer 
themselves are written. So now there are never more than 10 pages or 10 
PDFPages (sub-nodes) per PDFPages object (I never mix sub-nodes and 
leaves on the same node). A similar structure to the page tree of the 
PDF 1.4 Reference document. Automatically generated on the fly.

So for example a 101 page document will have a root PDFPages node with 
two sub-nodes underneath. The first will contain a count of 100, and 
have 10 sub-nodes, each containing 10 pages. The second will simply 
contain 1 page. More new pages will get added to the second sub-node 
(moving pages down to new sub-nodes to avoid more than 10 pages per 
node) until it's count reaches 100 too, then another node created. Once 
10 nodes under the root exist (at 1000 pages) they will get moved down 
below a new root level sub-node with a count of 1000, and a new root 
level sub-node created, and so on.

Next task is to write a JUnit test since one appears not to exist... I 
guess remaining thoughts currently are:

- Wondering if keeping references to a page tree object's sub-nodes or 
leaves is the best way or can I improve it further? (Bearing in mind 
memory usage and performance.)
- Was wondering if the trailer objects list is the right place to write 
the new sub-node PDFPages objects. (But if writing an object to the 
objects list - addObject() instead of addTrailerObject() - it gets 
written out too soon before I have added all the pages.) But given how 
it writes the objects out before writing the xref and trailer it seems 
OK and parses and shows fine in PDFBox/PDFDebugger and the evince PDF 
Reader in ubuntu.
- When registering the pages themselves via notifyKidsRegistered() 
method it extracts the page index number and puts the reference at that 
index in the kids list, filling empty spaces ahead of it with nulls. So 
when counting kids and writing out the pdf code text I had to ignore 
nulls and 'gaps' in the kids list since not all the kids are in the same 
list any more (spread across multiple page tree nodes). I was wondering 
why this method was written like this, and doesn't simply append new 
pages to the end of the list all the time.

Once testing is complete I'll submit the code internally for the in-team 
committers to review as I did with the 128 bit encryption work last month...

Thanks!

-Mike

On 25/05/11 21:57, Andreas L. Delmelle wrote:
> On 25 May 2011, at 09:45, Michael Rubin wrote:
>
> Hi Mike
>
>> Hello there. In the PDFPages class the kids are stored as reference
>> strings (e.g. "23 0 R"). Each of these objects are PDFPage objects. Do
>> you know if there is a method somewhere that I can retrieve the PDF java
>> object based on the reference string?
> Not really, AFAIK. What you do have is various Collections of different subtypes of PDFObject, available by means of accessors on PDFDocument.
> I guess the closest you would get without too much effort is to obtain the one you're interested in, then iterate over its elements and check PDFObject.referencePDF() against the lookup string. You do have to know the type(s) of object you need in advance, though...
>
>> (I am aiming to add support for some of those kids being other PDFPages
>> nodes to create a more balanced page tree.)
> Interesting. Looking forward to seeing more.
>
>
> Regards
>
> Andreas
> ---

Michael Rubin
Developer

T: +44 20 8238 7400
F: +44 20 8238 7401

mrubin@Thunderhead.com

The contents of this e-mail are intended for the named addressee only. It contains information that may be confidential. Unless you are the named addressee or an authorized designee, you may not copy or use it, or disclose it to anyone else. If you received it in error please notify us 
immediately and then destroy it.

Re: Retrieving Objects question

Posted by "Andreas L. Delmelle" <an...@telenet.be>.

On 25 May 2011, at 09:45, Michael Rubin wrote:

Hi Mike

> Hello there. In the PDFPages class the kids are stored as reference
> strings (e.g. "23 0 R"). Each of these objects are PDFPage objects. Do
> you know if there is a method somewhere that I can retrieve the PDF java
> object based on the reference string?

Not really, AFAIK. What you do have is various Collections of different subtypes of PDFObject, available by means of accessors on PDFDocument. 
I guess the closest you would get without too much effort is to obtain the one you're interested in, then iterate over its elements and check PDFObject.referencePDF() against the lookup string. You do have to know the type(s) of object you need in advance, though...

> 
> (I am aiming to add support for some of those kids being other PDFPages
> nodes to create a more balanced page tree.)

Interesting. Looking forward to seeing more.

Regards

Andreas
---