You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Tim Reynolds <ti...@yahoo.com> on 2010/02/24 04:08:52 UTC

How to get Page Number from Bookmark conflicting results

Hi,

First, sorry if this is not the correct way to post, I was unable to locate the PDFBox forum
other than thru MarkMail.

Second, I know there are two post regarding this topic, I've read both of them.

 Getting page number for bookmarks   (BM-Thread)
How to get Page Number from a PDPage (PDP-Thread)

I'm using Adobe Acrobat 8 Pro. 
I am using PDFBox on a Hibernate Search In Action PDF, but i think this problem
would be the same on other Book PDFs

When I open  a pdf doc in acrobat there are two indication of which page I am on,
[ pageN ] ( page#  of total#),  in acrobat these two number DON't always match.
In fact the [pageN] matches the one you would see on the printed page, while
(page# of total#) is an indication of the total number of Sheets of pager in the book, 
ie including all the pages before page 1, TOC, copyright ...

It seems that when I use the code in the BM-Thread, references to the first 21 Sheets
all indicate Page 1 and thereafter this format provides the [pageN] value i.e. the number
that you would see on the printed page of a book.

With BM-Thread I see the following problems:
1. Bookmarks have PDAction = null
2. The book is divided into  Parts/Chapters/SubChapters/... There are bookmarks
    for each. However only Parts has PDAction, and each Part shows the same 
    Page (1) as in page = 1.
3. The Index bookmarks don't show the [page#], they show the sheet#

When I use the code from the PDP-Thread,  I get the number which is basically the 
sheet count, i.e page# from the above (page# of total#)  except that it doesn't work
once you get to the Book Index.

It seems like it would have been easier if I could get PDPageDestination from PDOutlineItem and then use  pdpageDest.getPageNumber(), but the type coming
back from PDOutlineItem.getDestination is PDNamedDestination.

Basically, I just modified the PrintBookmarks.java to get the results:

        while (current != null) {
            dest = current.getDestination();
            pdAction = current.getAction();
            if (pdAction != null) {
                // From BM-Thread
                COSObject targetPageRef = (COSObject) ((COSArray) current
                        .getAction().getCOSDictionary()
                        .getDictionaryObject("D")).get(0);
                String objStr = String.valueOf(targetPageRef.getObjectNumber()
                        .intValue());
                String genStr = String.valueOf(targetPageRef
                        .getGenerationNumber().intValue());
                szKey = objStr + "," + genStr;
                pageNumber = (Integer) getPageMap().get(objStr + "," + genStr);
            } else if (dest != null) {
                // From PDP-Thread
                PDPage pdp = current.findDestinationPage(document);
                document.getDocumentCatalog().getPages();
                List allpages = new ArrayList();
                document.getDocumentCatalog().getPages().getAllKids(allpages);
                pageNum = allpages.indexOf(pdp) + 1;

                if (dest instanceof PDNamedDestination) {
                    szDest = ((PDNamedDestination)dest).getNamedDestination();
                }

            }
            System.out.println(indentation + current.getTitle() + " ... page: "
                    + pageNumber + " key: " + szKey + " dest: " + szDest
                    + " pageNum: " + pageNum);
            printBookmark(current, indentation + "    ");
            current = current.getNextSibling();
        }

Sample output:

Hibernate Search ... page: 1 key: 30269,0 dest: null pageNum: null
contents ... page: 1 key: 30269,0 dest: G1.558638 pageNum: 8
preface ... page: 1 key: 30269,0 dest: G2.552675 pageNum: 16
acknowledgments ... page: 1 key: 30269,0 dest: G2.557271 pageNum: 18
about this book ... page: 1 key: 30269,0 dest: G2.557499 pageNum: 20
Part 1  Understanding Search Technology ... page: 1 key: 30269,0 dest: G3.1005308 pageNum: 26
    Chapter 1  State of the art ... page: null key: null dest: G3.998410 pageNum: 28
        1.1 What is search? ... page: null key: null dest: G3.998485 pageNum: 29
 
...
Part 5  Native Lucene, scoring, and the wheel ... page: 1 key: 30269,0 dest: G14.1040306 pageNum: 376
    Chapter 12  Document ranking ... page: null key: null dest: G14.1023742 pageNum: 378
...
        13.4 Summary ... page: null key: null dest: G15.1021300 pageNum: 465
appendix: Quick reference ... page: 1 key: 30269,0 dest: G16.998406 pageNum: 466
    Hibernate Search mapping annotations ... page: null key: null dest: G16.998426 pageNum: 466
    Hibernate Search APIs ... page: null key: null dest: G16.999345 pageNum: 468
    Lucene queries ... page: null key: null dest: G16.1001842 pageNum: 473
index ... page: 1 key: 30269,0 dest: G17.174043 pageNum: 476
    Symbols ... page: 476 key: 3672,0 dest: null pageNum: null
    Numerics ... page: 476 key: 3672,0 dest: null pageNum: null
    A ... page: 476 key: 3672,0 dest: null pageNum: null
    B ... page: 476 key: 3672,0 dest: null pageNum: null

...



Tim Reynolds

(timr_317@yahoo.com)


      

Re: How to get Page Number from Bookmark conflicting results

Posted by Ad...@swmc.com.
Here's the core code:
        PDDocument doc = null;
        try {
            doc = PDDocument.load("somefile.pdf");
            PDDocumentOutline root = 
doc.getDocumentCatalog().getDocumentOutline();
            if(root != null) { // if there's no outline, there are 
certainly no bookmarks!
                PDOutlineItem item = root.getFirstChild();
                processNodeAndChildren(item, doc);
            }
        } finally {
            try {
                if(doc != null)
                    doc.close();
            } catch(Exception ex) {
                // not much we can do about this...
            }
        }

Bookmarks are stored recursively in PDOutlineItem objects, so 
processNodeAndChildren() will process them this way.  I can't share all of 
the code for that function, but here are the key points (note this is not 
intended to be a copy/paste solution, it's just to give you an idea of how 
it works):
        while(item != null) {
                COSObject targetPageRef = null;
                if(item.getTitle() != null) {
                targetPageRef = 
(COSObject)((COSArray)item.getAction().getCOSDictionary().getDictionaryObject("D")).get(0); 
// may throw an exception
                // and if that doesn't work
                PDDestination dest = item.getDestination();
                if(dest != null)
                    targetPageRef = 
(COSObject)((COSArray)dest.getCOSObject()).get(0);
                String objStr = 
String.valueOf(targetPageRef.getObjectNumber().intValue());
                String genStr = 
String.valueOf(targetPageRef.getGenerationNumber().intValue());
                Integer pageNumber = 
(Integer)doc.getPageMap().get(objStr+","+genStr);
 
                processNodeAndChildren(item.getFirstChild(), doc);
                item = item.getNextSibling(); 
            }
        }

You'll also want to read the PDF specification on how bookmarks are 
stored.  Not all of them point to a page number!  I'm not aware of any 
method to currently determine a page number for a bookmark which doesn't 
points to a page (however, they seem to be rare).

Also, take a look at doc.getPageMap(), I think that'll help you with your 
second issue as well.

--Adam



From:
Tim Reynolds <ti...@yahoo.com>
To:
users@pdfbox.apache.org
Date:
02/23/2010 21:48
Subject:
How to get Page Number from Bookmark conflicting results



Hi,

First, sorry if this is not the correct way to post, I was unable to 
locate the PDFBox forum
other than thru MarkMail.

Second, I know there are two post regarding this topic, I've read both of 
them.

 Getting page number for bookmarks   (BM-Thread)
How to get Page Number from a PDPage (PDP-Thread)

I'm using Adobe Acrobat 8 Pro. 
I am using PDFBox on a Hibernate Search In Action PDF, but i think this 
problem
would be the same on other Book PDFs

When I open  a pdf doc in acrobat there are two indication of which page I 
am on,
[ pageN ] ( page#  of total#),  in acrobat these two number DON't always 
match.
In fact the [pageN] matches the one you would see on the printed page, 
while
(page# of total#) is an indication of the total number of Sheets of pager 
in the book, 
ie including all the pages before page 1, TOC, copyright ...

It seems that when I use the code in the BM-Thread, references to the 
first 21 Sheets
all indicate Page 1 and thereafter this format provides the [pageN] value 
i.e. the number
that you would see on the printed page of a book.

With BM-Thread I see the following problems:
1. Bookmarks have PDAction = null
2. The book is divided into  Parts/Chapters/SubChapters/... There are 
bookmarks
    for each. However only Parts has PDAction, and each Part shows the 
same 
    Page (1) as in page = 1.
3. The Index bookmarks don't show the [page#], they show the sheet#

When I use the code from the PDP-Thread,  I get the number which is 
basically the 
sheet count, i.e page# from the above (page# of total#)  except that it 
doesn't work
once you get to the Book Index.

It seems like it would have been easier if I could get PDPageDestination 
from PDOutlineItem and then use  pdpageDest.getPageNumber(), but the type 
coming
back from PDOutlineItem.getDestination is PDNamedDestination.

Basically, I just modified the PrintBookmarks.java to get the results:

        while (current != null) {
            dest = current.getDestination();
            pdAction = current.getAction();
            if (pdAction != null) {
                // From BM-Thread
                COSObject targetPageRef = (COSObject) ((COSArray) current
                        .getAction().getCOSDictionary()
                        .getDictionaryObject("D")).get(0);
                String objStr = 
String.valueOf(targetPageRef.getObjectNumber()
                        .intValue());
                String genStr = String.valueOf(targetPageRef
                        .getGenerationNumber().intValue());
                szKey = objStr + "," + genStr;
                pageNumber = (Integer) getPageMap().get(objStr + "," + 
genStr);
            } else if (dest != null) {
                // From PDP-Thread
                PDPage pdp = current.findDestinationPage(document);
                document.getDocumentCatalog().getPages();
                List allpages = new ArrayList();
                
document.getDocumentCatalog().getPages().getAllKids(allpages);
                pageNum = allpages.indexOf(pdp) + 1;

                if (dest instanceof PDNamedDestination) {
                    szDest = 
((PDNamedDestination)dest).getNamedDestination();
                }

            }
            System.out.println(indentation + current.getTitle() + " ... 
page: "
                    + pageNumber + " key: " + szKey + " dest: " + szDest
                    + " pageNum: " + pageNum);
            printBookmark(current, indentation + "    ");
            current = current.getNextSibling();
        }

Sample output:

Hibernate Search ... page: 1 key: 30269,0 dest: null pageNum: null
contents ... page: 1 key: 30269,0 dest: G1.558638 pageNum: 8
preface ... page: 1 key: 30269,0 dest: G2.552675 pageNum: 16
acknowledgments ... page: 1 key: 30269,0 dest: G2.557271 pageNum: 18
about this book ... page: 1 key: 30269,0 dest: G2.557499 pageNum: 20
Part 1  Understanding Search Technology ... page: 1 key: 30269,0 dest: 
G3.1005308 pageNum: 26
    Chapter 1  State of the art ... page: null key: null dest: G3.998410 
pageNum: 28
        1.1 What is search? ... page: null key: null dest: G3.998485 
pageNum: 29
 
...
Part 5  Native Lucene, scoring, and the wheel ... page: 1 key: 30269,0 
dest: G14.1040306 pageNum: 376
    Chapter 12  Document ranking ... page: null key: null dest: 
G14.1023742 pageNum: 378
...
        13.4 Summary ... page: null key: null dest: G15.1021300 pageNum: 
465
appendix: Quick reference ... page: 1 key: 30269,0 dest: G16.998406 
pageNum: 466
    Hibernate Search mapping annotations ... page: null key: null dest: 
G16.998426 pageNum: 466
    Hibernate Search APIs ... page: null key: null dest: G16.999345 
pageNum: 468
    Lucene queries ... page: null key: null dest: G16.1001842 pageNum: 473
index ... page: 1 key: 30269,0 dest: G17.174043 pageNum: 476
    Symbols ... page: 476 key: 3672,0 dest: null pageNum: null
    Numerics ... page: 476 key: 3672,0 dest: null pageNum: null
    A ... page: 476 key: 3672,0 dest: null pageNum: null
    B ... page: 476 key: 3672,0 dest: null pageNum: null

...



Tim Reynolds

(timr_317@yahoo.com)


 


?  Click here to submit conditions  

This email and any content within or attached hereto from  Sun West Mortgage Company, Inc.  is confidential and/or legally privileged. The information is intended only for the use of the individual or entity named on this email. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or the taking of any action in reliance on the contents of this email information is strictly prohibited, and that the documents should be returned to this office immediately by email. Receipt by anyone other than the intended recipient is not a waiver of any privilege. Please do not include your social security number, account number, or any other personal or financial information in the content of the email. Should you have any questions, please call  (800) 453 7884.