You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Tim Reynolds <ti...@yahoo.com> on 2010/02/24 04:08:52 UTC
How to get Page Number from Bookmark conflicting results
Hi,
First, sorry if this is not the correct way to post, I was unable to locate the PDFBox forum
other than thru MarkMail.
Second, I know there are two post regarding this topic, I've read both of them.
Getting page number for bookmarks (BM-Thread)
How to get Page Number from a PDPage (PDP-Thread)
I'm using Adobe Acrobat 8 Pro.
I am using PDFBox on a Hibernate Search In Action PDF, but i think this problem
would be the same on other Book PDFs
When I open a pdf doc in acrobat there are two indication of which page I am on,
[ pageN ] ( page# of total#), in acrobat these two number DON't always match.
In fact the [pageN] matches the one you would see on the printed page, while
(page# of total#) is an indication of the total number of Sheets of pager in the book,
ie including all the pages before page 1, TOC, copyright ...
It seems that when I use the code in the BM-Thread, references to the first 21 Sheets
all indicate Page 1 and thereafter this format provides the [pageN] value i.e. the number
that you would see on the printed page of a book.
With BM-Thread I see the following problems:
1. Bookmarks have PDAction = null
2. The book is divided into Parts/Chapters/SubChapters/... There are bookmarks
for each. However only Parts has PDAction, and each Part shows the same
Page (1) as in page = 1.
3. The Index bookmarks don't show the [page#], they show the sheet#
When I use the code from the PDP-Thread, I get the number which is basically the
sheet count, i.e page# from the above (page# of total#) except that it doesn't work
once you get to the Book Index.
It seems like it would have been easier if I could get PDPageDestination from PDOutlineItem and then use pdpageDest.getPageNumber(), but the type coming
back from PDOutlineItem.getDestination is PDNamedDestination.
Basically, I just modified the PrintBookmarks.java to get the results:
while (current != null) {
dest = current.getDestination();
pdAction = current.getAction();
if (pdAction != null) {
// From BM-Thread
COSObject targetPageRef = (COSObject) ((COSArray) current
.getAction().getCOSDictionary()
.getDictionaryObject("D")).get(0);
String objStr = String.valueOf(targetPageRef.getObjectNumber()
.intValue());
String genStr = String.valueOf(targetPageRef
.getGenerationNumber().intValue());
szKey = objStr + "," + genStr;
pageNumber = (Integer) getPageMap().get(objStr + "," + genStr);
} else if (dest != null) {
// From PDP-Thread
PDPage pdp = current.findDestinationPage(document);
document.getDocumentCatalog().getPages();
List allpages = new ArrayList();
document.getDocumentCatalog().getPages().getAllKids(allpages);
pageNum = allpages.indexOf(pdp) + 1;
if (dest instanceof PDNamedDestination) {
szDest = ((PDNamedDestination)dest).getNamedDestination();
}
}
System.out.println(indentation + current.getTitle() + " ... page: "
+ pageNumber + " key: " + szKey + " dest: " + szDest
+ " pageNum: " + pageNum);
printBookmark(current, indentation + " ");
current = current.getNextSibling();
}
Sample output:
Hibernate Search ... page: 1 key: 30269,0 dest: null pageNum: null
contents ... page: 1 key: 30269,0 dest: G1.558638 pageNum: 8
preface ... page: 1 key: 30269,0 dest: G2.552675 pageNum: 16
acknowledgments ... page: 1 key: 30269,0 dest: G2.557271 pageNum: 18
about this book ... page: 1 key: 30269,0 dest: G2.557499 pageNum: 20
Part 1 Understanding Search Technology ... page: 1 key: 30269,0 dest: G3.1005308 pageNum: 26
Chapter 1 State of the art ... page: null key: null dest: G3.998410 pageNum: 28
1.1 What is search? ... page: null key: null dest: G3.998485 pageNum: 29
...
Part 5 Native Lucene, scoring, and the wheel ... page: 1 key: 30269,0 dest: G14.1040306 pageNum: 376
Chapter 12 Document ranking ... page: null key: null dest: G14.1023742 pageNum: 378
...
13.4 Summary ... page: null key: null dest: G15.1021300 pageNum: 465
appendix: Quick reference ... page: 1 key: 30269,0 dest: G16.998406 pageNum: 466
Hibernate Search mapping annotations ... page: null key: null dest: G16.998426 pageNum: 466
Hibernate Search APIs ... page: null key: null dest: G16.999345 pageNum: 468
Lucene queries ... page: null key: null dest: G16.1001842 pageNum: 473
index ... page: 1 key: 30269,0 dest: G17.174043 pageNum: 476
Symbols ... page: 476 key: 3672,0 dest: null pageNum: null
Numerics ... page: 476 key: 3672,0 dest: null pageNum: null
A ... page: 476 key: 3672,0 dest: null pageNum: null
B ... page: 476 key: 3672,0 dest: null pageNum: null
...
Tim Reynolds
(timr_317@yahoo.com)
Re: How to get Page Number from Bookmark conflicting results
Posted by Ad...@swmc.com.
Here's the core code:
PDDocument doc = null;
try {
doc = PDDocument.load("somefile.pdf");
PDDocumentOutline root =
doc.getDocumentCatalog().getDocumentOutline();
if(root != null) { // if there's no outline, there are
certainly no bookmarks!
PDOutlineItem item = root.getFirstChild();
processNodeAndChildren(item, doc);
}
} finally {
try {
if(doc != null)
doc.close();
} catch(Exception ex) {
// not much we can do about this...
}
}
Bookmarks are stored recursively in PDOutlineItem objects, so
processNodeAndChildren() will process them this way. I can't share all of
the code for that function, but here are the key points (note this is not
intended to be a copy/paste solution, it's just to give you an idea of how
it works):
while(item != null) {
COSObject targetPageRef = null;
if(item.getTitle() != null) {
targetPageRef =
(COSObject)((COSArray)item.getAction().getCOSDictionary().getDictionaryObject("D")).get(0);
// may throw an exception
// and if that doesn't work
PDDestination dest = item.getDestination();
if(dest != null)
targetPageRef =
(COSObject)((COSArray)dest.getCOSObject()).get(0);
String objStr =
String.valueOf(targetPageRef.getObjectNumber().intValue());
String genStr =
String.valueOf(targetPageRef.getGenerationNumber().intValue());
Integer pageNumber =
(Integer)doc.getPageMap().get(objStr+","+genStr);
processNodeAndChildren(item.getFirstChild(), doc);
item = item.getNextSibling();
}
}
You'll also want to read the PDF specification on how bookmarks are
stored. Not all of them point to a page number! I'm not aware of any
method to currently determine a page number for a bookmark which doesn't
points to a page (however, they seem to be rare).
Also, take a look at doc.getPageMap(), I think that'll help you with your
second issue as well.
--Adam
From:
Tim Reynolds <ti...@yahoo.com>
To:
users@pdfbox.apache.org
Date:
02/23/2010 21:48
Subject:
How to get Page Number from Bookmark conflicting results
Hi,
First, sorry if this is not the correct way to post, I was unable to
locate the PDFBox forum
other than thru MarkMail.
Second, I know there are two post regarding this topic, I've read both of
them.
Getting page number for bookmarks (BM-Thread)
How to get Page Number from a PDPage (PDP-Thread)
I'm using Adobe Acrobat 8 Pro.
I am using PDFBox on a Hibernate Search In Action PDF, but i think this
problem
would be the same on other Book PDFs
When I open a pdf doc in acrobat there are two indication of which page I
am on,
[ pageN ] ( page# of total#), in acrobat these two number DON't always
match.
In fact the [pageN] matches the one you would see on the printed page,
while
(page# of total#) is an indication of the total number of Sheets of pager
in the book,
ie including all the pages before page 1, TOC, copyright ...
It seems that when I use the code in the BM-Thread, references to the
first 21 Sheets
all indicate Page 1 and thereafter this format provides the [pageN] value
i.e. the number
that you would see on the printed page of a book.
With BM-Thread I see the following problems:
1. Bookmarks have PDAction = null
2. The book is divided into Parts/Chapters/SubChapters/... There are
bookmarks
for each. However only Parts has PDAction, and each Part shows the
same
Page (1) as in page = 1.
3. The Index bookmarks don't show the [page#], they show the sheet#
When I use the code from the PDP-Thread, I get the number which is
basically the
sheet count, i.e page# from the above (page# of total#) except that it
doesn't work
once you get to the Book Index.
It seems like it would have been easier if I could get PDPageDestination
from PDOutlineItem and then use pdpageDest.getPageNumber(), but the type
coming
back from PDOutlineItem.getDestination is PDNamedDestination.
Basically, I just modified the PrintBookmarks.java to get the results:
while (current != null) {
dest = current.getDestination();
pdAction = current.getAction();
if (pdAction != null) {
// From BM-Thread
COSObject targetPageRef = (COSObject) ((COSArray) current
.getAction().getCOSDictionary()
.getDictionaryObject("D")).get(0);
String objStr =
String.valueOf(targetPageRef.getObjectNumber()
.intValue());
String genStr = String.valueOf(targetPageRef
.getGenerationNumber().intValue());
szKey = objStr + "," + genStr;
pageNumber = (Integer) getPageMap().get(objStr + "," +
genStr);
} else if (dest != null) {
// From PDP-Thread
PDPage pdp = current.findDestinationPage(document);
document.getDocumentCatalog().getPages();
List allpages = new ArrayList();
document.getDocumentCatalog().getPages().getAllKids(allpages);
pageNum = allpages.indexOf(pdp) + 1;
if (dest instanceof PDNamedDestination) {
szDest =
((PDNamedDestination)dest).getNamedDestination();
}
}
System.out.println(indentation + current.getTitle() + " ...
page: "
+ pageNumber + " key: " + szKey + " dest: " + szDest
+ " pageNum: " + pageNum);
printBookmark(current, indentation + " ");
current = current.getNextSibling();
}
Sample output:
Hibernate Search ... page: 1 key: 30269,0 dest: null pageNum: null
contents ... page: 1 key: 30269,0 dest: G1.558638 pageNum: 8
preface ... page: 1 key: 30269,0 dest: G2.552675 pageNum: 16
acknowledgments ... page: 1 key: 30269,0 dest: G2.557271 pageNum: 18
about this book ... page: 1 key: 30269,0 dest: G2.557499 pageNum: 20
Part 1 Understanding Search Technology ... page: 1 key: 30269,0 dest:
G3.1005308 pageNum: 26
Chapter 1 State of the art ... page: null key: null dest: G3.998410
pageNum: 28
1.1 What is search? ... page: null key: null dest: G3.998485
pageNum: 29
...
Part 5 Native Lucene, scoring, and the wheel ... page: 1 key: 30269,0
dest: G14.1040306 pageNum: 376
Chapter 12 Document ranking ... page: null key: null dest:
G14.1023742 pageNum: 378
...
13.4 Summary ... page: null key: null dest: G15.1021300 pageNum:
465
appendix: Quick reference ... page: 1 key: 30269,0 dest: G16.998406
pageNum: 466
Hibernate Search mapping annotations ... page: null key: null dest:
G16.998426 pageNum: 466
Hibernate Search APIs ... page: null key: null dest: G16.999345
pageNum: 468
Lucene queries ... page: null key: null dest: G16.1001842 pageNum: 473
index ... page: 1 key: 30269,0 dest: G17.174043 pageNum: 476
Symbols ... page: 476 key: 3672,0 dest: null pageNum: null
Numerics ... page: 476 key: 3672,0 dest: null pageNum: null
A ... page: 476 key: 3672,0 dest: null pageNum: null
B ... page: 476 key: 3672,0 dest: null pageNum: null
...
Tim Reynolds
(timr_317@yahoo.com)
? Click here to submit conditions
This email and any content within or attached hereto from Sun West Mortgage Company, Inc. is confidential and/or legally privileged. The information is intended only for the use of the individual or entity named on this email. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or the taking of any action in reliance on the contents of this email information is strictly prohibited, and that the documents should be returned to this office immediately by email. Receipt by anyone other than the intended recipient is not a waiver of any privilege. Please do not include your social security number, account number, or any other personal or financial information in the content of the email. Should you have any questions, please call (800) 453 7884.