You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by David Patterson <pa...@gmail.com> on 2017/05/15 13:20:09 UTC

More questions about page iteration

I've now got my code working to iterate through a PDDocument and process it
page by page.

Next hurdle: Is there a way to get the page number as printed? I've got
page numbers like "TOC-1", "TOC-2", "Page 1", ...

How much work is it to get the "TOC-1"?

Thanks.

Dave Patterson

Re: More questions about page iteration

Posted by David Patterson <pa...@gmail.com>.

Tilman,

You don't need to research it for me. If I ever need to use PDFBox to
create content, I'll research how to make pageLabels then. For now, I'm
able to proceed without dealing with that detail.

You have been very helpful.

Dave Patterson

On Tue, May 16, 2017 at 9:33 AM, Tilman Hausherr <TH...@t-online.de>
wrote:

> Am 16.05.2017 um 15:30 schrieb David Patterson:
>
>> Tilman,
>>
>> Thanks. That was what I had come to realize when the PageLabels were null.
>>
>> Just out of curiosity, how do page labels get created?
>>
>
> I don't know, if somebody would really need to create these, I would have
> to research it by looking at the API.
>
> Tilman
>
>
>
>
>> Dave Patterson
>>
>> On Tue, May 16, 2017 at 9:26 AM, Tilman Hausherr <TH...@t-online.de>
>> wrote:
>>
>> Sadly for you, that one has nothing to do with page labels. It's really
>>> just a footer on the page. And there is no concept of "footer" in PDF.
>>> It's
>>> just text at the bottom.
>>>
>>> Tilman
>>>
>>>
>>> Am 16.05.2017 um 15:21 schrieb David Patterson:
>>>
>>> They show up when I print the PDF or open it to read it. I want to
>>>> extract
>>>> the Table of Contents from each of > 100 PDFs so I can make a
>>>> super-Table
>>>> of Contents and allow users to search for the document they need to
>>>> read.
>>>> (The file name of the desired contents is not obvious, and so with a
>>>> consolidated Table of Contents, a more novice user can find the content
>>>> they want to read and open the correct document to see the text. These
>>>> are
>>>> Standard Operating Procedures for a 24x7 production facility and the
>>>> operators might need to review what to do in case of a problem.
>>>>
>>>> I was hoping that in the transition from Word (where the documents are
>>>> authored, the saving as a PDF and combining them into Portfolios some
>>>> part
>>>> of the process would have identified it as a page label, but I guess
>>>> that
>>>> did not happen.
>>>>
>>>> I'm able to find the text of that string since it only occurs in the
>>>> footer
>>>> of the page.
>>>>
>>>> Thanks.
>>>>
>>>> Dave Patterson
>>>>
>>>> On Tue, May 16, 2017 at 8:42 AM, Tilman Hausherr <THausherr@t-online.de
>>>> >
>>>> wrote:
>>>>
>>>> Am 16.05.2017 um 14:35 schrieb David Patterson:
>>>>
>>>>> Tilman,
>>>>>
>>>>>> The code I tried is:
>>>>>>
>>>>>> byte[] bytes = // content of file as a byte array
>>>>>> PDDocument pdDocument = PDDocument.load( bytes );
>>>>>> PDDocumentCatalog cat2 = pdDocument.getDocumentCatalog();
>>>>>> PDPageLabels pageLabels = cat2.getPageLabels();
>>>>>> if ( pageLabels == null ) {
>>>>>> System.out.println( "Page labels missing " );
>>>>>> }
>>>>>>
>>>>>>
>>>>>> I'm getting "Page labels missing" on each document.
>>>>>>
>>>>>> Then lets go back to the beginning. You mentioned "I've got page
>>>>>> numbers
>>>>>>
>>>>> like "TOC-1", "TOC-2", "Page 1"". Where did these show up?
>>>>>
>>>>> Tilman
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> I have no idea of, or control over the process used to convert a Word
>>>>>
>>>>>> file
>>>>>> into a PDF. I just inherited a bunch of PDFs that I'm trying to
>>>>>> interpret.
>>>>>>
>>>>>> Dave Patterson
>>>>>>
>>>>>> On Mon, May 15, 2017 at 1:57 PM, Tilman Hausherr <
>>>>>> THausherr@t-online.de
>>>>>> wrote:
>>>>>>
>>>>>> Am 15.05.2017 um 19:11 schrieb David Patterson:
>>>>>>
>>>>>> Alas, after testing with my documents, the PageLabels is null. :-(
>>>>>>>
>>>>>>> But you said it has "TOC-1". This sounds like pagelabels. You can
>>>>>>>> also
>>>>>>>>
>>>>>>>> try
>>>>>>> with PDFDebugger, it will show the labels if there are some.
>>>>>>>
>>>>>>> Tilman
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Thank you for the help and encouragement.
>>>>>>>
>>>>>>> Dave Patterson
>>>>>>>>
>>>>>>>> On Mon, May 15, 2017 at 12:34 PM, Tilman Hausherr <
>>>>>>>> THausherr@t-online.de>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Am 15.05.2017 um 18:30 schrieb David Patterson:
>>>>>>>>
>>>>>>>> Tilman,
>>>>>>>>
>>>>>>>>> Thank you very much. (I feel bad asking some of the questions, but
>>>>>>>>>
>>>>>>>>>> the
>>>>>>>>>> data
>>>>>>>>>> is stored in "out of the way" corners that are hard to find.
>>>>>>>>>>
>>>>>>>>>> Don't :-)
>>>>>>>>>>
>>>>>>>>>> Is there any documentation that explains how the linkages work?
>>>>>>>>>>
>>>>>>>>> Would
>>>>>>>>> it
>>>>>>>>>
>>>>>>>>> help to have the PDF Standard Document?
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Yes. I read there all the time. The PDFBox API closely follows the
>>>>>>>>>> PDF
>>>>>>>>>>
>>>>>>>>>> specification. So here it's linked from the document catalog, so
>>>>>>>>> the
>>>>>>>>> methods used are in the PDDocumentCatalog class. But asking was a
>>>>>>>>> good
>>>>>>>>> decision as this got you that convenience method (that is in
>>>>>>>>> PDFDebugger).
>>>>>>>>>
>>>>>>>>> Tilman
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks.
>>>>>>>>>
>>>>>>>>> Dave Patterson
>>>>>>>>>
>>>>>>>>>> On Mon, May 15, 2017 at 12:13 PM, Tilman Hausherr <
>>>>>>>>>> THausherr@t-online.de>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Am 15.05.2017 um 15:20 schrieb David Patterson:
>>>>>>>>>>
>>>>>>>>>> I've now got my code working to iterate through a PDDocument and
>>>>>>>>>>
>>>>>>>>>> process
>>>>>>>>>>>
>>>>>>>>>>> it
>>>>>>>>>>>
>>>>>>>>>>> page by page.
>>>>>>>>>>>>
>>>>>>>>>>>> Next hurdle: Is there a way to get the page number as printed?
>>>>>>>>>>>> I've
>>>>>>>>>>>> got
>>>>>>>>>>>> page numbers like "TOC-1", "TOC-2", "Page 1", ...
>>>>>>>>>>>>
>>>>>>>>>>>> How much work is it to get the "TOC-1"?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>
>>>>>>>>>>>> Dave Patterson
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>          /**
>>>>>>>>>>>>
>>>>>>>>>>>>           * Convenience method to get the page label if
>>>>>>>>>>>> available.
>>>>>>>>>>>>
>>>>>>>>>>>>           *
>>>>>>>>>>>           * @param document
>>>>>>>>>>>           * @param pageIndex 0-based page number.
>>>>>>>>>>>           * @return a page label or null if not available.
>>>>>>>>>>>           */
>>>>>>>>>>>          public static String getPageLabel(PDDocument document,
>>>>>>>>>>> int
>>>>>>>>>>> pageIndex)
>>>>>>>>>>>          {
>>>>>>>>>>>              PDPageLabels pageLabels;
>>>>>>>>>>>              try
>>>>>>>>>>>              {
>>>>>>>>>>>                  pageLabels = document.getDocumentCatalog().
>>>>>>>>>>> getPageLabels();
>>>>>>>>>>>              }
>>>>>>>>>>>              catch (IOException ex)
>>>>>>>>>>>              {
>>>>>>>>>>>                  return ex.getMessage();
>>>>>>>>>>>              }
>>>>>>>>>>>              if (pageLabels != null)
>>>>>>>>>>>              {
>>>>>>>>>>>                  String[] labels = pageLabels.getLabelsByPageIndi
>>>>>>>>>>> ces();
>>>>>>>>>>>                  if (labels[pageIndex] != null)
>>>>>>>>>>>                  {
>>>>>>>>>>>                      return labels[pageIndex];
>>>>>>>>>>>                  }
>>>>>>>>>>>              }
>>>>>>>>>>>              return null;
>>>>>>>>>>>          }
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ------------------------------------------------------------
>>>>>>>>>>> ---------
>>>>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ------------------------------------------------------------
>>>>>>>>>>> ---------
>>>>>>>>>>>
>>>>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>>>>>>
>>>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ------------------------------------------------------------
>>>>>>>>> ---------
>>>>>>>>>
>>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>>>
>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ------------------------------------------------------------
>>>>>>> ---------
>>>>>>>
>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>>>
>>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Re: More questions about page iteration

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 16.05.2017 um 15:30 schrieb David Patterson:
> Tilman,
>
> Thanks. That was what I had come to realize when the PageLabels were null.
>
> Just out of curiosity, how do page labels get created?

I don't know, if somebody would really need to create these, I would 
have to research it by looking at the API.

Tilman


>
> Dave Patterson
>
> On Tue, May 16, 2017 at 9:26 AM, Tilman Hausherr <TH...@t-online.de>
> wrote:
>
>> Sadly for you, that one has nothing to do with page labels. It's really
>> just a footer on the page. And there is no concept of "footer" in PDF. It's
>> just text at the bottom.
>>
>> Tilman
>>
>>
>> Am 16.05.2017 um 15:21 schrieb David Patterson:
>>
>>> They show up when I print the PDF or open it to read it. I want to extract
>>> the Table of Contents from each of > 100 PDFs so I can make a super-Table
>>> of Contents and allow users to search for the document they need to read.
>>> (The file name of the desired contents is not obvious, and so with a
>>> consolidated Table of Contents, a more novice user can find the content
>>> they want to read and open the correct document to see the text. These are
>>> Standard Operating Procedures for a 24x7 production facility and the
>>> operators might need to review what to do in case of a problem.
>>>
>>> I was hoping that in the transition from Word (where the documents are
>>> authored, the saving as a PDF and combining them into Portfolios some part
>>> of the process would have identified it as a page label, but I guess that
>>> did not happen.
>>>
>>> I'm able to find the text of that string since it only occurs in the
>>> footer
>>> of the page.
>>>
>>> Thanks.
>>>
>>> Dave Patterson
>>>
>>> On Tue, May 16, 2017 at 8:42 AM, Tilman Hausherr <TH...@t-online.de>
>>> wrote:
>>>
>>> Am 16.05.2017 um 14:35 schrieb David Patterson:
>>>> Tilman,
>>>>> The code I tried is:
>>>>>
>>>>> byte[] bytes = // content of file as a byte array
>>>>> PDDocument pdDocument = PDDocument.load( bytes );
>>>>> PDDocumentCatalog cat2 = pdDocument.getDocumentCatalog();
>>>>> PDPageLabels pageLabels = cat2.getPageLabels();
>>>>> if ( pageLabels == null ) {
>>>>> System.out.println( "Page labels missing " );
>>>>> }
>>>>>
>>>>>
>>>>> I'm getting "Page labels missing" on each document.
>>>>>
>>>>> Then lets go back to the beginning. You mentioned "I've got page numbers
>>>> like "TOC-1", "TOC-2", "Page 1"". Where did these show up?
>>>>
>>>> Tilman
>>>>
>>>>
>>>>
>>>>
>>>> I have no idea of, or control over the process used to convert a Word
>>>>> file
>>>>> into a PDF. I just inherited a bunch of PDFs that I'm trying to
>>>>> interpret.
>>>>>
>>>>> Dave Patterson
>>>>>
>>>>> On Mon, May 15, 2017 at 1:57 PM, Tilman Hausherr <THausherr@t-online.de
>>>>> wrote:
>>>>>
>>>>> Am 15.05.2017 um 19:11 schrieb David Patterson:
>>>>>
>>>>>> Alas, after testing with my documents, the PageLabels is null. :-(
>>>>>>
>>>>>>> But you said it has "TOC-1". This sounds like pagelabels. You can also
>>>>>>>
>>>>>> try
>>>>>> with PDFDebugger, it will show the labels if there are some.
>>>>>>
>>>>>> Tilman
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thank you for the help and encouragement.
>>>>>>
>>>>>>> Dave Patterson
>>>>>>>
>>>>>>> On Mon, May 15, 2017 at 12:34 PM, Tilman Hausherr <
>>>>>>> THausherr@t-online.de>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Am 15.05.2017 um 18:30 schrieb David Patterson:
>>>>>>>
>>>>>>> Tilman,
>>>>>>>> Thank you very much. (I feel bad asking some of the questions, but
>>>>>>>>> the
>>>>>>>>> data
>>>>>>>>> is stored in "out of the way" corners that are hard to find.
>>>>>>>>>
>>>>>>>>> Don't :-)
>>>>>>>>>
>>>>>>>>> Is there any documentation that explains how the linkages work?
>>>>>>>> Would
>>>>>>>> it
>>>>>>>>
>>>>>>>> help to have the PDF Standard Document?
>>>>>>>>>
>>>>>>>>> Yes. I read there all the time. The PDFBox API closely follows the
>>>>>>>>> PDF
>>>>>>>>>
>>>>>>>> specification. So here it's linked from the document catalog, so the
>>>>>>>> methods used are in the PDDocumentCatalog class. But asking was a
>>>>>>>> good
>>>>>>>> decision as this got you that convenience method (that is in
>>>>>>>> PDFDebugger).
>>>>>>>>
>>>>>>>> Tilman
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks.
>>>>>>>>
>>>>>>>> Dave Patterson
>>>>>>>>> On Mon, May 15, 2017 at 12:13 PM, Tilman Hausherr <
>>>>>>>>> THausherr@t-online.de>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Am 15.05.2017 um 15:20 schrieb David Patterson:
>>>>>>>>>
>>>>>>>>> I've now got my code working to iterate through a PDDocument and
>>>>>>>>>
>>>>>>>>>> process
>>>>>>>>>>
>>>>>>>>>> it
>>>>>>>>>>
>>>>>>>>>>> page by page.
>>>>>>>>>>>
>>>>>>>>>>> Next hurdle: Is there a way to get the page number as printed?
>>>>>>>>>>> I've
>>>>>>>>>>> got
>>>>>>>>>>> page numbers like "TOC-1", "TOC-2", "Page 1", ...
>>>>>>>>>>>
>>>>>>>>>>> How much work is it to get the "TOC-1"?
>>>>>>>>>>>
>>>>>>>>>>> Thanks.
>>>>>>>>>>>
>>>>>>>>>>> Dave Patterson
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>          /**
>>>>>>>>>>>
>>>>>>>>>>>           * Convenience method to get the page label if available.
>>>>>>>>>>>
>>>>>>>>>>           *
>>>>>>>>>>           * @param document
>>>>>>>>>>           * @param pageIndex 0-based page number.
>>>>>>>>>>           * @return a page label or null if not available.
>>>>>>>>>>           */
>>>>>>>>>>          public static String getPageLabel(PDDocument document, int
>>>>>>>>>> pageIndex)
>>>>>>>>>>          {
>>>>>>>>>>              PDPageLabels pageLabels;
>>>>>>>>>>              try
>>>>>>>>>>              {
>>>>>>>>>>                  pageLabels = document.getDocumentCatalog().
>>>>>>>>>> getPageLabels();
>>>>>>>>>>              }
>>>>>>>>>>              catch (IOException ex)
>>>>>>>>>>              {
>>>>>>>>>>                  return ex.getMessage();
>>>>>>>>>>              }
>>>>>>>>>>              if (pageLabels != null)
>>>>>>>>>>              {
>>>>>>>>>>                  String[] labels = pageLabels.getLabelsByPageIndi
>>>>>>>>>> ces();
>>>>>>>>>>                  if (labels[pageIndex] != null)
>>>>>>>>>>                  {
>>>>>>>>>>                      return labels[pageIndex];
>>>>>>>>>>                  }
>>>>>>>>>>              }
>>>>>>>>>>              return null;
>>>>>>>>>>          }
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ------------------------------------------------------------
>>>>>>>>>> ---------
>>>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ------------------------------------------------------------
>>>>>>>>>> ---------
>>>>>>>>>>
>>>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ------------------------------------------------------------
>>>>>>>> ---------
>>>>>>>>
>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>
>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>
>>>>
>>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: More questions about page iteration

Posted by David Patterson <pa...@gmail.com>.

Tilman,

Thanks. That was what I had come to realize when the PageLabels were null.

Just out of curiosity, how do page labels get created?

Dave Patterson

On Tue, May 16, 2017 at 9:26 AM, Tilman Hausherr <TH...@t-online.de>
wrote:

> Sadly for you, that one has nothing to do with page labels. It's really
> just a footer on the page. And there is no concept of "footer" in PDF. It's
> just text at the bottom.
>
> Tilman
>
>
> Am 16.05.2017 um 15:21 schrieb David Patterson:
>
>> They show up when I print the PDF or open it to read it. I want to extract
>> the Table of Contents from each of > 100 PDFs so I can make a super-Table
>> of Contents and allow users to search for the document they need to read.
>> (The file name of the desired contents is not obvious, and so with a
>> consolidated Table of Contents, a more novice user can find the content
>> they want to read and open the correct document to see the text. These are
>> Standard Operating Procedures for a 24x7 production facility and the
>> operators might need to review what to do in case of a problem.
>>
>> I was hoping that in the transition from Word (where the documents are
>> authored, the saving as a PDF and combining them into Portfolios some part
>> of the process would have identified it as a page label, but I guess that
>> did not happen.
>>
>> I'm able to find the text of that string since it only occurs in the
>> footer
>> of the page.
>>
>> Thanks.
>>
>> Dave Patterson
>>
>> On Tue, May 16, 2017 at 8:42 AM, Tilman Hausherr <TH...@t-online.de>
>> wrote:
>>
>> Am 16.05.2017 um 14:35 schrieb David Patterson:
>>>
>>> Tilman,
>>>>
>>>> The code I tried is:
>>>>
>>>> byte[] bytes = // content of file as a byte array
>>>> PDDocument pdDocument = PDDocument.load( bytes );
>>>> PDDocumentCatalog cat2 = pdDocument.getDocumentCatalog();
>>>> PDPageLabels pageLabels = cat2.getPageLabels();
>>>> if ( pageLabels == null ) {
>>>> System.out.println( "Page labels missing " );
>>>> }
>>>>
>>>>
>>>> I'm getting "Page labels missing" on each document.
>>>>
>>>> Then lets go back to the beginning. You mentioned "I've got page numbers
>>> like "TOC-1", "TOC-2", "Page 1"". Where did these show up?
>>>
>>> Tilman
>>>
>>>
>>>
>>>
>>> I have no idea of, or control over the process used to convert a Word
>>>> file
>>>> into a PDF. I just inherited a bunch of PDFs that I'm trying to
>>>> interpret.
>>>>
>>>> Dave Patterson
>>>>
>>>> On Mon, May 15, 2017 at 1:57 PM, Tilman Hausherr <THausherr@t-online.de
>>>> >
>>>> wrote:
>>>>
>>>> Am 15.05.2017 um 19:11 schrieb David Patterson:
>>>>
>>>>> Alas, after testing with my documents, the PageLabels is null. :-(
>>>>>
>>>>>> But you said it has "TOC-1". This sounds like pagelabels. You can also
>>>>>>
>>>>> try
>>>>> with PDFDebugger, it will show the labels if there are some.
>>>>>
>>>>> Tilman
>>>>>
>>>>>
>>>>>
>>>>> Thank you for the help and encouragement.
>>>>>
>>>>>> Dave Patterson
>>>>>>
>>>>>> On Mon, May 15, 2017 at 12:34 PM, Tilman Hausherr <
>>>>>> THausherr@t-online.de>
>>>>>> wrote:
>>>>>>
>>>>>> Am 15.05.2017 um 18:30 schrieb David Patterson:
>>>>>>
>>>>>> Tilman,
>>>>>>>
>>>>>>> Thank you very much. (I feel bad asking some of the questions, but
>>>>>>>> the
>>>>>>>> data
>>>>>>>> is stored in "out of the way" corners that are hard to find.
>>>>>>>>
>>>>>>>> Don't :-)
>>>>>>>>
>>>>>>>> Is there any documentation that explains how the linkages work?
>>>>>>> Would
>>>>>>> it
>>>>>>>
>>>>>>> help to have the PDF Standard Document?
>>>>>>>>
>>>>>>>>
>>>>>>>> Yes. I read there all the time. The PDFBox API closely follows the
>>>>>>>> PDF
>>>>>>>>
>>>>>>> specification. So here it's linked from the document catalog, so the
>>>>>>> methods used are in the PDDocumentCatalog class. But asking was a
>>>>>>> good
>>>>>>> decision as this got you that convenience method (that is in
>>>>>>> PDFDebugger).
>>>>>>>
>>>>>>> Tilman
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Thanks.
>>>>>>>
>>>>>>> Dave Patterson
>>>>>>>>
>>>>>>>> On Mon, May 15, 2017 at 12:13 PM, Tilman Hausherr <
>>>>>>>> THausherr@t-online.de>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Am 15.05.2017 um 15:20 schrieb David Patterson:
>>>>>>>>
>>>>>>>> I've now got my code working to iterate through a PDDocument and
>>>>>>>>
>>>>>>>>> process
>>>>>>>>>
>>>>>>>>> it
>>>>>>>>>
>>>>>>>>>> page by page.
>>>>>>>>>>
>>>>>>>>>> Next hurdle: Is there a way to get the page number as printed?
>>>>>>>>>> I've
>>>>>>>>>> got
>>>>>>>>>> page numbers like "TOC-1", "TOC-2", "Page 1", ...
>>>>>>>>>>
>>>>>>>>>> How much work is it to get the "TOC-1"?
>>>>>>>>>>
>>>>>>>>>> Thanks.
>>>>>>>>>>
>>>>>>>>>> Dave Patterson
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>         /**
>>>>>>>>>>
>>>>>>>>>>          * Convenience method to get the page label if available.
>>>>>>>>>>
>>>>>>>>>          *
>>>>>>>>>          * @param document
>>>>>>>>>          * @param pageIndex 0-based page number.
>>>>>>>>>          * @return a page label or null if not available.
>>>>>>>>>          */
>>>>>>>>>         public static String getPageLabel(PDDocument document, int
>>>>>>>>> pageIndex)
>>>>>>>>>         {
>>>>>>>>>             PDPageLabels pageLabels;
>>>>>>>>>             try
>>>>>>>>>             {
>>>>>>>>>                 pageLabels = document.getDocumentCatalog().
>>>>>>>>> getPageLabels();
>>>>>>>>>             }
>>>>>>>>>             catch (IOException ex)
>>>>>>>>>             {
>>>>>>>>>                 return ex.getMessage();
>>>>>>>>>             }
>>>>>>>>>             if (pageLabels != null)
>>>>>>>>>             {
>>>>>>>>>                 String[] labels = pageLabels.getLabelsByPageIndi
>>>>>>>>> ces();
>>>>>>>>>                 if (labels[pageIndex] != null)
>>>>>>>>>                 {
>>>>>>>>>                     return labels[pageIndex];
>>>>>>>>>                 }
>>>>>>>>>             }
>>>>>>>>>             return null;
>>>>>>>>>         }
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ------------------------------------------------------------
>>>>>>>>> ---------
>>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ------------------------------------------------------------
>>>>>>>>> ---------
>>>>>>>>>
>>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>>>
>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ------------------------------------------------------------
>>>>>>> ---------
>>>>>>>
>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>>>
>>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Re: More questions about page iteration

Posted by Tilman Hausherr <TH...@t-online.de>.

Sadly for you, that one has nothing to do with page labels. It's really 
just a footer on the page. And there is no concept of "footer" in PDF. 
It's just text at the bottom.

Tilman

Am 16.05.2017 um 15:21 schrieb David Patterson:
> They show up when I print the PDF or open it to read it. I want to extract
> the Table of Contents from each of > 100 PDFs so I can make a super-Table
> of Contents and allow users to search for the document they need to read.
> (The file name of the desired contents is not obvious, and so with a
> consolidated Table of Contents, a more novice user can find the content
> they want to read and open the correct document to see the text. These are
> Standard Operating Procedures for a 24x7 production facility and the
> operators might need to review what to do in case of a problem.
>
> I was hoping that in the transition from Word (where the documents are
> authored, the saving as a PDF and combining them into Portfolios some part
> of the process would have identified it as a page label, but I guess that
> did not happen.
>
> I'm able to find the text of that string since it only occurs in the footer
> of the page.
>
> Thanks.
>
> Dave Patterson
>
> On Tue, May 16, 2017 at 8:42 AM, Tilman Hausherr <TH...@t-online.de>
> wrote:
>
>> Am 16.05.2017 um 14:35 schrieb David Patterson:
>>
>>> Tilman,
>>>
>>> The code I tried is:
>>>
>>> byte[] bytes = // content of file as a byte array
>>> PDDocument pdDocument = PDDocument.load( bytes );
>>> PDDocumentCatalog cat2 = pdDocument.getDocumentCatalog();
>>> PDPageLabels pageLabels = cat2.getPageLabels();
>>> if ( pageLabels == null ) {
>>> System.out.println( "Page labels missing " );
>>> }
>>>
>>>
>>> I'm getting "Page labels missing" on each document.
>>>
>> Then lets go back to the beginning. You mentioned "I've got page numbers
>> like "TOC-1", "TOC-2", "Page 1"". Where did these show up?
>>
>> Tilman
>>
>>
>>
>>
>>> I have no idea of, or control over the process used to convert a Word file
>>> into a PDF. I just inherited a bunch of PDFs that I'm trying to interpret.
>>>
>>> Dave Patterson
>>>
>>> On Mon, May 15, 2017 at 1:57 PM, Tilman Hausherr <TH...@t-online.de>
>>> wrote:
>>>
>>> Am 15.05.2017 um 19:11 schrieb David Patterson:
>>>> Alas, after testing with my documents, the PageLabels is null. :-(
>>>>> But you said it has "TOC-1". This sounds like pagelabels. You can also
>>>> try
>>>> with PDFDebugger, it will show the labels if there are some.
>>>>
>>>> Tilman
>>>>
>>>>
>>>>
>>>> Thank you for the help and encouragement.
>>>>> Dave Patterson
>>>>>
>>>>> On Mon, May 15, 2017 at 12:34 PM, Tilman Hausherr <
>>>>> THausherr@t-online.de>
>>>>> wrote:
>>>>>
>>>>> Am 15.05.2017 um 18:30 schrieb David Patterson:
>>>>>
>>>>>> Tilman,
>>>>>>
>>>>>>> Thank you very much. (I feel bad asking some of the questions, but the
>>>>>>> data
>>>>>>> is stored in "out of the way" corners that are hard to find.
>>>>>>>
>>>>>>> Don't :-)
>>>>>>>
>>>>>> Is there any documentation that explains how the linkages work? Would
>>>>>> it
>>>>>>
>>>>>>> help to have the PDF Standard Document?
>>>>>>>
>>>>>>>
>>>>>>> Yes. I read there all the time. The PDFBox API closely follows the PDF
>>>>>> specification. So here it's linked from the document catalog, so the
>>>>>> methods used are in the PDDocumentCatalog class. But asking was a good
>>>>>> decision as this got you that convenience method (that is in
>>>>>> PDFDebugger).
>>>>>>
>>>>>> Tilman
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>>> Dave Patterson
>>>>>>>
>>>>>>> On Mon, May 15, 2017 at 12:13 PM, Tilman Hausherr <
>>>>>>> THausherr@t-online.de>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Am 15.05.2017 um 15:20 schrieb David Patterson:
>>>>>>>
>>>>>>> I've now got my code working to iterate through a PDDocument and
>>>>>>>> process
>>>>>>>>
>>>>>>>> it
>>>>>>>>> page by page.
>>>>>>>>>
>>>>>>>>> Next hurdle: Is there a way to get the page number as printed? I've
>>>>>>>>> got
>>>>>>>>> page numbers like "TOC-1", "TOC-2", "Page 1", ...
>>>>>>>>>
>>>>>>>>> How much work is it to get the "TOC-1"?
>>>>>>>>>
>>>>>>>>> Thanks.
>>>>>>>>>
>>>>>>>>> Dave Patterson
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>         /**
>>>>>>>>>
>>>>>>>>>          * Convenience method to get the page label if available.
>>>>>>>>          *
>>>>>>>>          * @param document
>>>>>>>>          * @param pageIndex 0-based page number.
>>>>>>>>          * @return a page label or null if not available.
>>>>>>>>          */
>>>>>>>>         public static String getPageLabel(PDDocument document, int
>>>>>>>> pageIndex)
>>>>>>>>         {
>>>>>>>>             PDPageLabels pageLabels;
>>>>>>>>             try
>>>>>>>>             {
>>>>>>>>                 pageLabels = document.getDocumentCatalog().
>>>>>>>> getPageLabels();
>>>>>>>>             }
>>>>>>>>             catch (IOException ex)
>>>>>>>>             {
>>>>>>>>                 return ex.getMessage();
>>>>>>>>             }
>>>>>>>>             if (pageLabels != null)
>>>>>>>>             {
>>>>>>>>                 String[] labels = pageLabels.getLabelsByPageIndi
>>>>>>>> ces();
>>>>>>>>                 if (labels[pageIndex] != null)
>>>>>>>>                 {
>>>>>>>>                     return labels[pageIndex];
>>>>>>>>                 }
>>>>>>>>             }
>>>>>>>>             return null;
>>>>>>>>         }
>>>>>>>>
>>>>>>>>
>>>>>>>> ------------------------------------------------------------
>>>>>>>> ---------
>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ------------------------------------------------------------
>>>>>>>> ---------
>>>>>>>>
>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>
>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>
>>>>
>>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: More questions about page iteration

Posted by David Patterson <pa...@gmail.com>.

They show up when I print the PDF or open it to read it. I want to extract
the Table of Contents from each of > 100 PDFs so I can make a super-Table
of Contents and allow users to search for the document they need to read.
(The file name of the desired contents is not obvious, and so with a
consolidated Table of Contents, a more novice user can find the content
they want to read and open the correct document to see the text. These are
Standard Operating Procedures for a 24x7 production facility and the
operators might need to review what to do in case of a problem.

I was hoping that in the transition from Word (where the documents are
authored, the saving as a PDF and combining them into Portfolios some part
of the process would have identified it as a page label, but I guess that
did not happen.

I'm able to find the text of that string since it only occurs in the footer
of the page.

Thanks.

Dave Patterson

On Tue, May 16, 2017 at 8:42 AM, Tilman Hausherr <TH...@t-online.de>
wrote:

> Am 16.05.2017 um 14:35 schrieb David Patterson:
>
>> Tilman,
>>
>> The code I tried is:
>>
>> byte[] bytes = // content of file as a byte array
>> PDDocument pdDocument = PDDocument.load( bytes );
>> PDDocumentCatalog cat2 = pdDocument.getDocumentCatalog();
>> PDPageLabels pageLabels = cat2.getPageLabels();
>> if ( pageLabels == null ) {
>> System.out.println( "Page labels missing " );
>> }
>>
>>
>> I'm getting "Page labels missing" on each document.
>>
>
> Then lets go back to the beginning. You mentioned "I've got page numbers
> like "TOC-1", "TOC-2", "Page 1"". Where did these show up?
>
> Tilman
>
>
>
>
>> I have no idea of, or control over the process used to convert a Word file
>> into a PDF. I just inherited a bunch of PDFs that I'm trying to interpret.
>>
>> Dave Patterson
>>
>> On Mon, May 15, 2017 at 1:57 PM, Tilman Hausherr <TH...@t-online.de>
>> wrote:
>>
>> Am 15.05.2017 um 19:11 schrieb David Patterson:
>>>
>>> Alas, after testing with my documents, the PageLabels is null. :-(
>>>>
>>>> But you said it has "TOC-1". This sounds like pagelabels. You can also
>>> try
>>> with PDFDebugger, it will show the labels if there are some.
>>>
>>> Tilman
>>>
>>>
>>>
>>> Thank you for the help and encouragement.
>>>>
>>>> Dave Patterson
>>>>
>>>> On Mon, May 15, 2017 at 12:34 PM, Tilman Hausherr <
>>>> THausherr@t-online.de>
>>>> wrote:
>>>>
>>>> Am 15.05.2017 um 18:30 schrieb David Patterson:
>>>>
>>>>> Tilman,
>>>>>
>>>>>> Thank you very much. (I feel bad asking some of the questions, but the
>>>>>> data
>>>>>> is stored in "out of the way" corners that are hard to find.
>>>>>>
>>>>>> Don't :-)
>>>>>>
>>>>>
>>>>> Is there any documentation that explains how the linkages work? Would
>>>>> it
>>>>>
>>>>>> help to have the PDF Standard Document?
>>>>>>
>>>>>>
>>>>>> Yes. I read there all the time. The PDFBox API closely follows the PDF
>>>>> specification. So here it's linked from the document catalog, so the
>>>>> methods used are in the PDDocumentCatalog class. But asking was a good
>>>>> decision as this got you that convenience method (that is in
>>>>> PDFDebugger).
>>>>>
>>>>> Tilman
>>>>>
>>>>>
>>>>>
>>>>> Thanks.
>>>>>
>>>>>> Dave Patterson
>>>>>>
>>>>>> On Mon, May 15, 2017 at 12:13 PM, Tilman Hausherr <
>>>>>> THausherr@t-online.de>
>>>>>> wrote:
>>>>>>
>>>>>> Am 15.05.2017 um 15:20 schrieb David Patterson:
>>>>>>
>>>>>> I've now got my code working to iterate through a PDDocument and
>>>>>>> process
>>>>>>>
>>>>>>> it
>>>>>>>> page by page.
>>>>>>>>
>>>>>>>> Next hurdle: Is there a way to get the page number as printed? I've
>>>>>>>> got
>>>>>>>> page numbers like "TOC-1", "TOC-2", "Page 1", ...
>>>>>>>>
>>>>>>>> How much work is it to get the "TOC-1"?
>>>>>>>>
>>>>>>>> Thanks.
>>>>>>>>
>>>>>>>> Dave Patterson
>>>>>>>>
>>>>>>>>
>>>>>>>>        /**
>>>>>>>>
>>>>>>>>         * Convenience method to get the page label if available.
>>>>>>>         *
>>>>>>>         * @param document
>>>>>>>         * @param pageIndex 0-based page number.
>>>>>>>         * @return a page label or null if not available.
>>>>>>>         */
>>>>>>>        public static String getPageLabel(PDDocument document, int
>>>>>>> pageIndex)
>>>>>>>        {
>>>>>>>            PDPageLabels pageLabels;
>>>>>>>            try
>>>>>>>            {
>>>>>>>                pageLabels = document.getDocumentCatalog().
>>>>>>> getPageLabels();
>>>>>>>            }
>>>>>>>            catch (IOException ex)
>>>>>>>            {
>>>>>>>                return ex.getMessage();
>>>>>>>            }
>>>>>>>            if (pageLabels != null)
>>>>>>>            {
>>>>>>>                String[] labels = pageLabels.getLabelsByPageIndi
>>>>>>> ces();
>>>>>>>                if (labels[pageIndex] != null)
>>>>>>>                {
>>>>>>>                    return labels[pageIndex];
>>>>>>>                }
>>>>>>>            }
>>>>>>>            return null;
>>>>>>>        }
>>>>>>>
>>>>>>>
>>>>>>> ------------------------------------------------------------
>>>>>>> ---------
>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ------------------------------------------------------------
>>>>>>> ---------
>>>>>>>
>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>>>
>>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Re: More questions about page iteration

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 16.05.2017 um 14:35 schrieb David Patterson:
> Tilman,
>
> The code I tried is:
>
> byte[] bytes = // content of file as a byte array
> PDDocument pdDocument = PDDocument.load( bytes );
> PDDocumentCatalog cat2 = pdDocument.getDocumentCatalog();
> PDPageLabels pageLabels = cat2.getPageLabels();
> if ( pageLabels == null ) {
> System.out.println( "Page labels missing " );
> }
>
>
> I'm getting "Page labels missing" on each document.

Then lets go back to the beginning. You mentioned "I've got page numbers 
like "TOC-1", "TOC-2", "Page 1"". Where did these show up?

Tilman


>
> I have no idea of, or control over the process used to convert a Word file
> into a PDF. I just inherited a bunch of PDFs that I'm trying to interpret.
>
> Dave Patterson
>
> On Mon, May 15, 2017 at 1:57 PM, Tilman Hausherr <TH...@t-online.de>
> wrote:
>
>> Am 15.05.2017 um 19:11 schrieb David Patterson:
>>
>>> Alas, after testing with my documents, the PageLabels is null. :-(
>>>
>> But you said it has "TOC-1". This sounds like pagelabels. You can also try
>> with PDFDebugger, it will show the labels if there are some.
>>
>> Tilman
>>
>>
>>
>>> Thank you for the help and encouragement.
>>>
>>> Dave Patterson
>>>
>>> On Mon, May 15, 2017 at 12:34 PM, Tilman Hausherr <TH...@t-online.de>
>>> wrote:
>>>
>>> Am 15.05.2017 um 18:30 schrieb David Patterson:
>>>> Tilman,
>>>>> Thank you very much. (I feel bad asking some of the questions, but the
>>>>> data
>>>>> is stored in "out of the way" corners that are hard to find.
>>>>>
>>>>> Don't :-)
>>>>
>>>> Is there any documentation that explains how the linkages work? Would it
>>>>> help to have the PDF Standard Document?
>>>>>
>>>>>
>>>> Yes. I read there all the time. The PDFBox API closely follows the PDF
>>>> specification. So here it's linked from the document catalog, so the
>>>> methods used are in the PDDocumentCatalog class. But asking was a good
>>>> decision as this got you that convenience method (that is in
>>>> PDFDebugger).
>>>>
>>>> Tilman
>>>>
>>>>
>>>>
>>>> Thanks.
>>>>> Dave Patterson
>>>>>
>>>>> On Mon, May 15, 2017 at 12:13 PM, Tilman Hausherr <
>>>>> THausherr@t-online.de>
>>>>> wrote:
>>>>>
>>>>> Am 15.05.2017 um 15:20 schrieb David Patterson:
>>>>>
>>>>>> I've now got my code working to iterate through a PDDocument and
>>>>>> process
>>>>>>
>>>>>>> it
>>>>>>> page by page.
>>>>>>>
>>>>>>> Next hurdle: Is there a way to get the page number as printed? I've
>>>>>>> got
>>>>>>> page numbers like "TOC-1", "TOC-2", "Page 1", ...
>>>>>>>
>>>>>>> How much work is it to get the "TOC-1"?
>>>>>>>
>>>>>>> Thanks.
>>>>>>>
>>>>>>> Dave Patterson
>>>>>>>
>>>>>>>
>>>>>>>        /**
>>>>>>>
>>>>>>         * Convenience method to get the page label if available.
>>>>>>         *
>>>>>>         * @param document
>>>>>>         * @param pageIndex 0-based page number.
>>>>>>         * @return a page label or null if not available.
>>>>>>         */
>>>>>>        public static String getPageLabel(PDDocument document, int
>>>>>> pageIndex)
>>>>>>        {
>>>>>>            PDPageLabels pageLabels;
>>>>>>            try
>>>>>>            {
>>>>>>                pageLabels = document.getDocumentCatalog().
>>>>>> getPageLabels();
>>>>>>            }
>>>>>>            catch (IOException ex)
>>>>>>            {
>>>>>>                return ex.getMessage();
>>>>>>            }
>>>>>>            if (pageLabels != null)
>>>>>>            {
>>>>>>                String[] labels = pageLabels.getLabelsByPageIndices();
>>>>>>                if (labels[pageIndex] != null)
>>>>>>                {
>>>>>>                    return labels[pageIndex];
>>>>>>                }
>>>>>>            }
>>>>>>            return null;
>>>>>>        }
>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>
>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>
>>>>
>>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: More questions about page iteration

Posted by David Patterson <pa...@gmail.com>.

Tilman,

The code I tried is:

byte[] bytes = // content of file as a byte array
PDDocument pdDocument = PDDocument.load( bytes );
PDDocumentCatalog cat2 = pdDocument.getDocumentCatalog();
PDPageLabels pageLabels = cat2.getPageLabels();
if ( pageLabels == null ) {
System.out.println( "Page labels missing " );
}


I'm getting "Page labels missing" on each document.

I have no idea of, or control over the process used to convert a Word file
into a PDF. I just inherited a bunch of PDFs that I'm trying to interpret.

Dave Patterson

On Mon, May 15, 2017 at 1:57 PM, Tilman Hausherr <TH...@t-online.de>
wrote:

> Am 15.05.2017 um 19:11 schrieb David Patterson:
>
>> Alas, after testing with my documents, the PageLabels is null. :-(
>>
>
> But you said it has "TOC-1". This sounds like pagelabels. You can also try
> with PDFDebugger, it will show the labels if there are some.
>
> Tilman
>
>
>
>> Thank you for the help and encouragement.
>>
>> Dave Patterson
>>
>> On Mon, May 15, 2017 at 12:34 PM, Tilman Hausherr <TH...@t-online.de>
>> wrote:
>>
>> Am 15.05.2017 um 18:30 schrieb David Patterson:
>>>
>>> Tilman,
>>>>
>>>> Thank you very much. (I feel bad asking some of the questions, but the
>>>> data
>>>> is stored in "out of the way" corners that are hard to find.
>>>>
>>>> Don't :-)
>>>
>>>
>>> Is there any documentation that explains how the linkages work? Would it
>>>> help to have the PDF Standard Document?
>>>>
>>>>
>>> Yes. I read there all the time. The PDFBox API closely follows the PDF
>>> specification. So here it's linked from the document catalog, so the
>>> methods used are in the PDDocumentCatalog class. But asking was a good
>>> decision as this got you that convenience method (that is in
>>> PDFDebugger).
>>>
>>> Tilman
>>>
>>>
>>>
>>> Thanks.
>>>>
>>>> Dave Patterson
>>>>
>>>> On Mon, May 15, 2017 at 12:13 PM, Tilman Hausherr <
>>>> THausherr@t-online.de>
>>>> wrote:
>>>>
>>>> Am 15.05.2017 um 15:20 schrieb David Patterson:
>>>>
>>>>> I've now got my code working to iterate through a PDDocument and
>>>>> process
>>>>>
>>>>>> it
>>>>>> page by page.
>>>>>>
>>>>>> Next hurdle: Is there a way to get the page number as printed? I've
>>>>>> got
>>>>>> page numbers like "TOC-1", "TOC-2", "Page 1", ...
>>>>>>
>>>>>> How much work is it to get the "TOC-1"?
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>> Dave Patterson
>>>>>>
>>>>>>
>>>>>>       /**
>>>>>>
>>>>>        * Convenience method to get the page label if available.
>>>>>        *
>>>>>        * @param document
>>>>>        * @param pageIndex 0-based page number.
>>>>>        * @return a page label or null if not available.
>>>>>        */
>>>>>       public static String getPageLabel(PDDocument document, int
>>>>> pageIndex)
>>>>>       {
>>>>>           PDPageLabels pageLabels;
>>>>>           try
>>>>>           {
>>>>>               pageLabels = document.getDocumentCatalog().
>>>>> getPageLabels();
>>>>>           }
>>>>>           catch (IOException ex)
>>>>>           {
>>>>>               return ex.getMessage();
>>>>>           }
>>>>>           if (pageLabels != null)
>>>>>           {
>>>>>               String[] labels = pageLabels.getLabelsByPageIndices();
>>>>>               if (labels[pageIndex] != null)
>>>>>               {
>>>>>                   return labels[pageIndex];
>>>>>               }
>>>>>           }
>>>>>           return null;
>>>>>       }
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>>>
>>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Re: More questions about page iteration

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 15.05.2017 um 19:11 schrieb David Patterson:
> Alas, after testing with my documents, the PageLabels is null. :-(

But you said it has "TOC-1". This sounds like pagelabels. You can also 
try with PDFDebugger, it will show the labels if there are some.

Tilman

>
> Thank you for the help and encouragement.
>
> Dave Patterson
>
> On Mon, May 15, 2017 at 12:34 PM, Tilman Hausherr <TH...@t-online.de>
> wrote:
>
>> Am 15.05.2017 um 18:30 schrieb David Patterson:
>>
>>> Tilman,
>>>
>>> Thank you very much. (I feel bad asking some of the questions, but the
>>> data
>>> is stored in "out of the way" corners that are hard to find.
>>>
>> Don't :-)
>>
>>
>>> Is there any documentation that explains how the linkages work? Would it
>>> help to have the PDF Standard Document?
>>>
>>
>> Yes. I read there all the time. The PDFBox API closely follows the PDF
>> specification. So here it's linked from the document catalog, so the
>> methods used are in the PDDocumentCatalog class. But asking was a good
>> decision as this got you that convenience method (that is in PDFDebugger).
>>
>> Tilman
>>
>>
>>
>>> Thanks.
>>>
>>> Dave Patterson
>>>
>>> On Mon, May 15, 2017 at 12:13 PM, Tilman Hausherr <TH...@t-online.de>
>>> wrote:
>>>
>>> Am 15.05.2017 um 15:20 schrieb David Patterson:
>>>> I've now got my code working to iterate through a PDDocument and process
>>>>> it
>>>>> page by page.
>>>>>
>>>>> Next hurdle: Is there a way to get the page number as printed? I've got
>>>>> page numbers like "TOC-1", "TOC-2", "Page 1", ...
>>>>>
>>>>> How much work is it to get the "TOC-1"?
>>>>>
>>>>> Thanks.
>>>>>
>>>>> Dave Patterson
>>>>>
>>>>>
>>>>>       /**
>>>>        * Convenience method to get the page label if available.
>>>>        *
>>>>        * @param document
>>>>        * @param pageIndex 0-based page number.
>>>>        * @return a page label or null if not available.
>>>>        */
>>>>       public static String getPageLabel(PDDocument document, int
>>>> pageIndex)
>>>>       {
>>>>           PDPageLabels pageLabels;
>>>>           try
>>>>           {
>>>>               pageLabels = document.getDocumentCatalog().getPageLabels();
>>>>           }
>>>>           catch (IOException ex)
>>>>           {
>>>>               return ex.getMessage();
>>>>           }
>>>>           if (pageLabels != null)
>>>>           {
>>>>               String[] labels = pageLabels.getLabelsByPageIndices();
>>>>               if (labels[pageIndex] != null)
>>>>               {
>>>>                   return labels[pageIndex];
>>>>               }
>>>>           }
>>>>           return null;
>>>>       }
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>
>>>>
>>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: More questions about page iteration

Posted by David Patterson <pa...@gmail.com>.

Alas, after testing with my documents, the PageLabels is null. :-(

Thank you for the help and encouragement.

Dave Patterson

On Mon, May 15, 2017 at 12:34 PM, Tilman Hausherr <TH...@t-online.de>
wrote:

> Am 15.05.2017 um 18:30 schrieb David Patterson:
>
>> Tilman,
>>
>> Thank you very much. (I feel bad asking some of the questions, but the
>> data
>> is stored in "out of the way" corners that are hard to find.
>>
>
> Don't :-)
>
>
>> Is there any documentation that explains how the linkages work? Would it
>> help to have the PDF Standard Document?
>>
>
>
> Yes. I read there all the time. The PDFBox API closely follows the PDF
> specification. So here it's linked from the document catalog, so the
> methods used are in the PDDocumentCatalog class. But asking was a good
> decision as this got you that convenience method (that is in PDFDebugger).
>
> Tilman
>
>
>
>> Thanks.
>>
>> Dave Patterson
>>
>> On Mon, May 15, 2017 at 12:13 PM, Tilman Hausherr <TH...@t-online.de>
>> wrote:
>>
>> Am 15.05.2017 um 15:20 schrieb David Patterson:
>>>
>>> I've now got my code working to iterate through a PDDocument and process
>>>> it
>>>> page by page.
>>>>
>>>> Next hurdle: Is there a way to get the page number as printed? I've got
>>>> page numbers like "TOC-1", "TOC-2", "Page 1", ...
>>>>
>>>> How much work is it to get the "TOC-1"?
>>>>
>>>> Thanks.
>>>>
>>>> Dave Patterson
>>>>
>>>>
>>>>      /**
>>>       * Convenience method to get the page label if available.
>>>       *
>>>       * @param document
>>>       * @param pageIndex 0-based page number.
>>>       * @return a page label or null if not available.
>>>       */
>>>      public static String getPageLabel(PDDocument document, int
>>> pageIndex)
>>>      {
>>>          PDPageLabels pageLabels;
>>>          try
>>>          {
>>>              pageLabels = document.getDocumentCatalog().getPageLabels();
>>>          }
>>>          catch (IOException ex)
>>>          {
>>>              return ex.getMessage();
>>>          }
>>>          if (pageLabels != null)
>>>          {
>>>              String[] labels = pageLabels.getLabelsByPageIndices();
>>>              if (labels[pageIndex] != null)
>>>              {
>>>                  return labels[pageIndex];
>>>              }
>>>          }
>>>          return null;
>>>      }
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>>>
>>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Re: More questions about page iteration

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 15.05.2017 um 18:30 schrieb David Patterson:
> Tilman,
>
> Thank you very much. (I feel bad asking some of the questions, but the data
> is stored in "out of the way" corners that are hard to find.

Don't :-)

>
> Is there any documentation that explains how the linkages work? Would it
> help to have the PDF Standard Document?


Yes. I read there all the time. The PDFBox API closely follows the PDF 
specification. So here it's linked from the document catalog, so the 
methods used are in the PDDocumentCatalog class. But asking was a good 
decision as this got you that convenience method (that is in PDFDebugger).

Tilman

>
> Thanks.
>
> Dave Patterson
>
> On Mon, May 15, 2017 at 12:13 PM, Tilman Hausherr <TH...@t-online.de>
> wrote:
>
>> Am 15.05.2017 um 15:20 schrieb David Patterson:
>>
>>> I've now got my code working to iterate through a PDDocument and process
>>> it
>>> page by page.
>>>
>>> Next hurdle: Is there a way to get the page number as printed? I've got
>>> page numbers like "TOC-1", "TOC-2", "Page 1", ...
>>>
>>> How much work is it to get the "TOC-1"?
>>>
>>> Thanks.
>>>
>>> Dave Patterson
>>>
>>>
>>      /**
>>       * Convenience method to get the page label if available.
>>       *
>>       * @param document
>>       * @param pageIndex 0-based page number.
>>       * @return a page label or null if not available.
>>       */
>>      public static String getPageLabel(PDDocument document, int pageIndex)
>>      {
>>          PDPageLabels pageLabels;
>>          try
>>          {
>>              pageLabels = document.getDocumentCatalog().getPageLabels();
>>          }
>>          catch (IOException ex)
>>          {
>>              return ex.getMessage();
>>          }
>>          if (pageLabels != null)
>>          {
>>              String[] labels = pageLabels.getLabelsByPageIndices();
>>              if (labels[pageIndex] != null)
>>              {
>>                  return labels[pageIndex];
>>              }
>>          }
>>          return null;
>>      }
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: More questions about page iteration

Posted by David Patterson <pa...@gmail.com>.

Tilman,

Thank you very much. (I feel bad asking some of the questions, but the data
is stored in "out of the way" corners that are hard to find.

Is there any documentation that explains how the linkages work? Would it
help to have the PDF Standard Document?

Thanks.

Dave Patterson

On Mon, May 15, 2017 at 12:13 PM, Tilman Hausherr <TH...@t-online.de>
wrote:

> Am 15.05.2017 um 15:20 schrieb David Patterson:
>
>> I've now got my code working to iterate through a PDDocument and process
>> it
>> page by page.
>>
>> Next hurdle: Is there a way to get the page number as printed? I've got
>> page numbers like "TOC-1", "TOC-2", "Page 1", ...
>>
>> How much work is it to get the "TOC-1"?
>>
>> Thanks.
>>
>> Dave Patterson
>>
>>
>     /**
>      * Convenience method to get the page label if available.
>      *
>      * @param document
>      * @param pageIndex 0-based page number.
>      * @return a page label or null if not available.
>      */
>     public static String getPageLabel(PDDocument document, int pageIndex)
>     {
>         PDPageLabels pageLabels;
>         try
>         {
>             pageLabels = document.getDocumentCatalog().getPageLabels();
>         }
>         catch (IOException ex)
>         {
>             return ex.getMessage();
>         }
>         if (pageLabels != null)
>         {
>             String[] labels = pageLabels.getLabelsByPageIndices();
>             if (labels[pageIndex] != null)
>             {
>                 return labels[pageIndex];
>             }
>         }
>         return null;
>     }
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Re: More questions about page iteration

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 15.05.2017 um 15:20 schrieb David Patterson:
> I've now got my code working to iterate through a PDDocument and process it
> page by page.
>
> Next hurdle: Is there a way to get the page number as printed? I've got
> page numbers like "TOC-1", "TOC-2", "Page 1", ...
>
> How much work is it to get the "TOC-1"?
>
> Thanks.
>
> Dave Patterson
>

     /**
      * Convenience method to get the page label if available.
      *
      * @param document
      * @param pageIndex 0-based page number.
      * @return a page label or null if not available.
      */
     public static String getPageLabel(PDDocument document, int pageIndex)
     {
         PDPageLabels pageLabels;
         try
         {
             pageLabels = document.getDocumentCatalog().getPageLabels();
         }
         catch (IOException ex)
         {
             return ex.getMessage();
         }
         if (pageLabels != null)
         {
             String[] labels = pageLabels.getLabelsByPageIndices();
             if (labels[pageIndex] != null)
             {
                 return labels[pageIndex];
             }
         }
         return null;
     }


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org