You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by no...@gmail.com on 2015/05/08 17:17:55 UTC

Can't resolve page number




Hello,

I’m trying to parse a pdf file that I haven’t created, I’m using pdfBox v1.8.9.

My problem is that when trying to getText(doc) form a certain section of the pdf using setStartBookmark(item) and setEndBookmark(item) I get all the text rather than just the text from the specified section.

WhiIe trying to resolve this I realized that the writeText(doc, outputStream) method always calls resetEngine() method. That will reset all the parameters and delete the bookmarks I set.

So my first question is what is the correct way to get the text from a specified section of the pdf?

When I continued to try and resolve this I created a new class that extendsPDFTextStripper and I changed the getText() and writeText() methods (also changing their names) so that it won’t call the resetEngine() method while keeping the rest of the functionality (I also had to delete the if (getAddMoreFormatting()) section as the parameters are private, is that a problem?).

Now when I call the method I created I have a second problem, while it tries to determine the startBookmarkPageNumber in processPages method getPageNumber method returns -1. 

When I dug deeper I saw that in findDestinationPage method the rawDest is of type PDNamedDestination.

The problem is that when trying to get namesDict = doc.getDocumentCatalog().getNames() it returns null. That means that the names dictionary doesn’t exist. What can be done?

Just need to point out that in Acrobat the bookmarks all work.


Noam

Re: Can't resolve page number

Posted by John Hewson <jo...@jahewson.com>.

http://pdfbox.apache.org/building.html <http://pdfbox.apache.org/building.html>

— John

> On 11 May 2015, at 12:00, Noam Silver <no...@gmail.com> wrote:
> 
> Hi,
> It seems to be working (I probably have some bugs I need to sort out).
> Thanks!
> Where is the official repository (google gives multiple results)? In the
> snapshots directory Ill find the 1.8.10 snapshot .jar, right? Is there also
> the source code bundled or I need to download the whole directory for that?
> Sorry for the nub questions...
> Thanks for all your help!
> 
> Noam
> 
> 
> On Mon, May 11, 2015 at 7:33 PM, Tilman Hausherr <TH...@t-online.de>
> wrote:
> 
>> Hi,
>> 
>> Please give feedback about whether your problems are solved or not... or
>> if you have hit different roadblocks, or if you need the current source for
>> debugging.
>> 
>> Btw, that one from yesterday was already the "improved" version, it is now
>> also in the official repository and the snapshots directory.
>> 
>> Tilman
>> 
>> 
>> Am 11.05.2015 um 06:47 schrieb Noam Silver:
>> 
>>> Thanks! this is grate!
>>> 
>>> Noam
>>> 
>>> On Mon, May 11, 2015 at 1:32 AM, Tilman Hausherr <TH...@t-online.de>
>>> wrote:
>>> 
>>> Get it here:
>>>> http://home.snafu.de/tilman/tmp/pdfbox-app-1.8.10-SNAPSHOT.jar
>>>> 
>>>> Tilman
>>>> 
>>>> 
>>>> Am 11.05.2015 um 00:20 schrieb Noam Silver:
>>>> 
>>>> It will be grate if you can send me the updated .jar I am currently
>>>>> using
>>>>> pdfbox-app-1.8.9.jar.
>>>>> I am in a hurry I need to get this project working!
>>>>> Thanks,
>>>>> Noam
>>>>> 
>>>>> On Sun, May 10, 2015 at 11:50 PM, Tilman Hausherr <
>>>>> THausherr@t-online.de>
>>>>> wrote:
>>>>> 
>>>>>  Am 10.05.2015 um 21:59 schrieb Noam Silver:
>>>>> 
>>>>>>  Ok, this sounds like the problem. So, when you say be patient what do
>>>>>> 
>>>>>>> you
>>>>>>> mean?
>>>>>>> 
>>>>>>>  Wait 1-2 days.
>>>>>>> 
>>>>>> I have already managed to do a quick solution but this needs some
>>>>>> polishing. The "polishing" is internal only, i.e. you don't have to
>>>>>> change
>>>>>> anything in your code. It's the name lookup that has an additional
>>>>>> segment
>>>>>> now.
>>>>>> 
>>>>>> If you're in a hurry, two possibilities:
>>>>>> - I tell you what to change in the source code so you build it yourself
>>>>>> - I send you the jar file you need. (If you want that, tell me if you
>>>>>> use
>>>>>> the pdfbox.jar, or the pdfbox-app.jar)
>>>>>> 
>>>>>> Tilman
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>  Thanks so much for your help!
>>>>>> 
>>>>>>> Noam
>>>>>>> 
>>>>>>> 
>>>>>>> On Sun, May 10, 2015 at 10:13 PM, Tilman Hausherr <
>>>>>>> THausherr@t-online.de>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>   Am 10.05.2015 um 19:58 schrieb Noam Silver:
>>>>>>> 
>>>>>>>    The code which I was talking about with the *namesDict =
>>>>>>>> 
>>>>>>>> doc**.getDocumentCatalog().getNames()
>>>>>>>>> *returns *null *is part of the pdfbox code in the
>>>>>>>>> *findDestinationPage
>>>>>>>>> *method
>>>>>>>>> in the section of the *if( rawDest instanceof PDNamedDestination )*
>>>>>>>>> in
>>>>>>>>> the
>>>>>>>>> *PDOutlineItem*  class.
>>>>>>>>> It sems that there is an anomaly in this spacific pdf. Ill try to
>>>>>>>>> load
>>>>>>>>> the
>>>>>>>>> pdf with *loadNonSeq(file,null) *and see what's the difference.
>>>>>>>>> 
>>>>>>>>>   I looked at the file, and my first impression is that this is a
>>>>>>>>> bug
>>>>>>>>> in
>>>>>>>>> 
>>>>>>>>> PDFBox. In your file, the document catalog has a /Dests entry, but
>>>>>>>> PDFBox
>>>>>>>> is looking for a /Names entry, which itself has a /Dests entry. Your
>>>>>>>> document is using a PDF 1.1 concept, while PDFBox supports the 1.2
>>>>>>>> concept
>>>>>>>> only, but should of course support both.
>>>>>>>> 
>>>>>>>> Be patient...
>>>>>>>> 
>>>>>>>> Tilman
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> 
>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>> 
>>>> 
>>>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>> 
>>

Re: Can't resolve page number

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 11.05.2015 um 21:00 schrieb Noam Silver:
> Hi,
> It seems to be working (I probably have some bugs I need to sort out).
> Thanks!
> Where is the official repository (google gives multiple results)? In the
> snapshots directory Ill find the 1.8.10 snapshot .jar, right? Is there also
> the source code bundled or I need to download the whole directory for that?
> Sorry for the nub questions...
> Thanks for all your help!

You can get the snapshots (without source) here:
https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/1.8.10-SNAPSHOT/

if you want to download the sources and build yourself, use this for the 
unreleased 1.8.10 version:
http://svn.apache.org/repos/asf/pdfbox/branches/1.8

You need maven to build. The jars will be in "pdfbox 
reactor/app/target". If you want to build with sources, add "source:jar" 
to the maven goals. (If you have a good IDE, then it supports maven and 
svn out of the box :-))

If you get the error message "JCE unlimited strength jurisdiction policy 
files are not installed" when building, install the JCE files
http://www.oracle.com/technetwork/java/javase/downloads/jce-7-download-432124.html
or run maven with -DskipTests=true

The advantage is that you can debug in the source code. However, if you 
start to do this, then it is either a misunderstanding of yours, or a 
bug in PDFBox, so don't hesitate to ask.


Tilman

>
> Noam
>
>
> On Mon, May 11, 2015 at 7:33 PM, Tilman Hausherr <TH...@t-online.de>
> wrote:
>
>> Hi,
>>
>> Please give feedback about whether your problems are solved or not... or
>> if you have hit different roadblocks, or if you need the current source for
>> debugging.
>>
>> Btw, that one from yesterday was already the "improved" version, it is now
>> also in the official repository and the snapshots directory.
>>
>> Tilman
>>
>>
>> Am 11.05.2015 um 06:47 schrieb Noam Silver:
>>
>>> Thanks! this is grate!
>>>
>>> Noam
>>>
>>> On Mon, May 11, 2015 at 1:32 AM, Tilman Hausherr <TH...@t-online.de>
>>> wrote:
>>>
>>>   Get it here:
>>>> http://home.snafu.de/tilman/tmp/pdfbox-app-1.8.10-SNAPSHOT.jar
>>>>
>>>> Tilman
>>>>
>>>>
>>>> Am 11.05.2015 um 00:20 schrieb Noam Silver:
>>>>
>>>>   It will be grate if you can send me the updated .jar I am currently
>>>>> using
>>>>> pdfbox-app-1.8.9.jar.
>>>>> I am in a hurry I need to get this project working!
>>>>> Thanks,
>>>>> Noam
>>>>>
>>>>> On Sun, May 10, 2015 at 11:50 PM, Tilman Hausherr <
>>>>> THausherr@t-online.de>
>>>>> wrote:
>>>>>
>>>>>    Am 10.05.2015 um 21:59 schrieb Noam Silver:
>>>>>
>>>>>>    Ok, this sounds like the problem. So, when you say be patient what do
>>>>>>
>>>>>>> you
>>>>>>> mean?
>>>>>>>
>>>>>>>    Wait 1-2 days.
>>>>>>>
>>>>>> I have already managed to do a quick solution but this needs some
>>>>>> polishing. The "polishing" is internal only, i.e. you don't have to
>>>>>> change
>>>>>> anything in your code. It's the name lookup that has an additional
>>>>>> segment
>>>>>> now.
>>>>>>
>>>>>> If you're in a hurry, two possibilities:
>>>>>> - I tell you what to change in the source code so you build it yourself
>>>>>> - I send you the jar file you need. (If you want that, tell me if you
>>>>>> use
>>>>>> the pdfbox.jar, or the pdfbox-app.jar)
>>>>>>
>>>>>> Tilman
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>    Thanks so much for your help!
>>>>>>
>>>>>>> Noam
>>>>>>>
>>>>>>>
>>>>>>> On Sun, May 10, 2015 at 10:13 PM, Tilman Hausherr <
>>>>>>> THausherr@t-online.de>
>>>>>>> wrote:
>>>>>>>
>>>>>>>     Am 10.05.2015 um 19:58 schrieb Noam Silver:
>>>>>>>
>>>>>>>      The code which I was talking about with the *namesDict =
>>>>>>>>   doc**.getDocumentCatalog().getNames()
>>>>>>>>> *returns *null *is part of the pdfbox code in the
>>>>>>>>> *findDestinationPage
>>>>>>>>> *method
>>>>>>>>> in the section of the *if( rawDest instanceof PDNamedDestination )*
>>>>>>>>> in
>>>>>>>>> the
>>>>>>>>> *PDOutlineItem*  class.
>>>>>>>>> It sems that there is an anomaly in this spacific pdf. Ill try to
>>>>>>>>> load
>>>>>>>>> the
>>>>>>>>> pdf with *loadNonSeq(file,null) *and see what's the difference.
>>>>>>>>>
>>>>>>>>>     I looked at the file, and my first impression is that this is a
>>>>>>>>> bug
>>>>>>>>> in
>>>>>>>>>
>>>>>>>>>   PDFBox. In your file, the document catalog has a /Dests entry, but
>>>>>>>> PDFBox
>>>>>>>> is looking for a /Names entry, which itself has a /Dests entry. Your
>>>>>>>> document is using a PDF 1.1 concept, while PDFBox supports the 1.2
>>>>>>>> concept
>>>>>>>> only, but should of course support both.
>>>>>>>>
>>>>>>>> Be patient...
>>>>>>>>
>>>>>>>> Tilman
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>
>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>
>>>>>>
>>>>>>
>>>>>>   ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>
>>>>
>>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Can't resolve page number

Posted by Noam Silver <no...@gmail.com>.

Hi,
It seems to be working (I probably have some bugs I need to sort out).
Thanks!
Where is the official repository (google gives multiple results)? In the
snapshots directory Ill find the 1.8.10 snapshot .jar, right? Is there also
the source code bundled or I need to download the whole directory for that?
Sorry for the nub questions...
Thanks for all your help!

Noam


On Mon, May 11, 2015 at 7:33 PM, Tilman Hausherr <TH...@t-online.de>
wrote:

> Hi,
>
> Please give feedback about whether your problems are solved or not... or
> if you have hit different roadblocks, or if you need the current source for
> debugging.
>
> Btw, that one from yesterday was already the "improved" version, it is now
> also in the official repository and the snapshots directory.
>
> Tilman
>
>
> Am 11.05.2015 um 06:47 schrieb Noam Silver:
>
>> Thanks! this is grate!
>>
>> Noam
>>
>> On Mon, May 11, 2015 at 1:32 AM, Tilman Hausherr <TH...@t-online.de>
>> wrote:
>>
>>  Get it here:
>>> http://home.snafu.de/tilman/tmp/pdfbox-app-1.8.10-SNAPSHOT.jar
>>>
>>> Tilman
>>>
>>>
>>> Am 11.05.2015 um 00:20 schrieb Noam Silver:
>>>
>>>  It will be grate if you can send me the updated .jar I am currently
>>>> using
>>>> pdfbox-app-1.8.9.jar.
>>>> I am in a hurry I need to get this project working!
>>>> Thanks,
>>>> Noam
>>>>
>>>> On Sun, May 10, 2015 at 11:50 PM, Tilman Hausherr <
>>>> THausherr@t-online.de>
>>>> wrote:
>>>>
>>>>   Am 10.05.2015 um 21:59 schrieb Noam Silver:
>>>>
>>>>>   Ok, this sounds like the problem. So, when you say be patient what do
>>>>>
>>>>>> you
>>>>>> mean?
>>>>>>
>>>>>>   Wait 1-2 days.
>>>>>>
>>>>> I have already managed to do a quick solution but this needs some
>>>>> polishing. The "polishing" is internal only, i.e. you don't have to
>>>>> change
>>>>> anything in your code. It's the name lookup that has an additional
>>>>> segment
>>>>> now.
>>>>>
>>>>> If you're in a hurry, two possibilities:
>>>>> - I tell you what to change in the source code so you build it yourself
>>>>> - I send you the jar file you need. (If you want that, tell me if you
>>>>> use
>>>>> the pdfbox.jar, or the pdfbox-app.jar)
>>>>>
>>>>> Tilman
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>   Thanks so much for your help!
>>>>>
>>>>>> Noam
>>>>>>
>>>>>>
>>>>>> On Sun, May 10, 2015 at 10:13 PM, Tilman Hausherr <
>>>>>> THausherr@t-online.de>
>>>>>> wrote:
>>>>>>
>>>>>>    Am 10.05.2015 um 19:58 schrieb Noam Silver:
>>>>>>
>>>>>>     The code which I was talking about with the *namesDict =
>>>>>>>
>>>>>>>  doc**.getDocumentCatalog().getNames()
>>>>>>>> *returns *null *is part of the pdfbox code in the
>>>>>>>> *findDestinationPage
>>>>>>>> *method
>>>>>>>> in the section of the *if( rawDest instanceof PDNamedDestination )*
>>>>>>>> in
>>>>>>>> the
>>>>>>>> *PDOutlineItem*  class.
>>>>>>>> It sems that there is an anomaly in this spacific pdf. Ill try to
>>>>>>>> load
>>>>>>>> the
>>>>>>>> pdf with *loadNonSeq(file,null) *and see what's the difference.
>>>>>>>>
>>>>>>>>    I looked at the file, and my first impression is that this is a
>>>>>>>> bug
>>>>>>>> in
>>>>>>>>
>>>>>>>>  PDFBox. In your file, the document catalog has a /Dests entry, but
>>>>>>> PDFBox
>>>>>>> is looking for a /Names entry, which itself has a /Dests entry. Your
>>>>>>> document is using a PDF 1.1 concept, while PDFBox supports the 1.2
>>>>>>> concept
>>>>>>> only, but should of course support both.
>>>>>>>
>>>>>>> Be patient...
>>>>>>>
>>>>>>> Tilman
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>>
>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>
>>>>>
>>>>>
>>>>>  ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>>>
>>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Re: Can't resolve page number

Posted by Tilman Hausherr <TH...@t-online.de>.

Hi,

Please give feedback about whether your problems are solved or not... or 
if you have hit different roadblocks, or if you need the current source 
for debugging.

Btw, that one from yesterday was already the "improved" version, it is 
now also in the official repository and the snapshots directory.

Tilman

Am 11.05.2015 um 06:47 schrieb Noam Silver:
> Thanks! this is grate!
>
> Noam
>
> On Mon, May 11, 2015 at 1:32 AM, Tilman Hausherr <TH...@t-online.de>
> wrote:
>
>> Get it here:
>> http://home.snafu.de/tilman/tmp/pdfbox-app-1.8.10-SNAPSHOT.jar
>>
>> Tilman
>>
>>
>> Am 11.05.2015 um 00:20 schrieb Noam Silver:
>>
>>> It will be grate if you can send me the updated .jar I am currently using
>>> pdfbox-app-1.8.9.jar.
>>> I am in a hurry I need to get this project working!
>>> Thanks,
>>> Noam
>>>
>>> On Sun, May 10, 2015 at 11:50 PM, Tilman Hausherr <TH...@t-online.de>
>>> wrote:
>>>
>>>   Am 10.05.2015 um 21:59 schrieb Noam Silver:
>>>>   Ok, this sounds like the problem. So, when you say be patient what do
>>>>> you
>>>>> mean?
>>>>>
>>>>>   Wait 1-2 days.
>>>> I have already managed to do a quick solution but this needs some
>>>> polishing. The "polishing" is internal only, i.e. you don't have to
>>>> change
>>>> anything in your code. It's the name lookup that has an additional
>>>> segment
>>>> now.
>>>>
>>>> If you're in a hurry, two possibilities:
>>>> - I tell you what to change in the source code so you build it yourself
>>>> - I send you the jar file you need. (If you want that, tell me if you use
>>>> the pdfbox.jar, or the pdfbox-app.jar)
>>>>
>>>> Tilman
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>   Thanks so much for your help!
>>>>> Noam
>>>>>
>>>>>
>>>>> On Sun, May 10, 2015 at 10:13 PM, Tilman Hausherr <
>>>>> THausherr@t-online.de>
>>>>> wrote:
>>>>>
>>>>>    Am 10.05.2015 um 19:58 schrieb Noam Silver:
>>>>>
>>>>>>    The code which I was talking about with the *namesDict =
>>>>>>
>>>>>>> doc**.getDocumentCatalog().getNames()
>>>>>>> *returns *null *is part of the pdfbox code in the *findDestinationPage
>>>>>>> *method
>>>>>>> in the section of the *if( rawDest instanceof PDNamedDestination )* in
>>>>>>> the
>>>>>>> *PDOutlineItem*  class.
>>>>>>> It sems that there is an anomaly in this spacific pdf. Ill try to load
>>>>>>> the
>>>>>>> pdf with *loadNonSeq(file,null) *and see what's the difference.
>>>>>>>
>>>>>>>    I looked at the file, and my first impression is that this is a bug
>>>>>>> in
>>>>>>>
>>>>>> PDFBox. In your file, the document catalog has a /Dests entry, but
>>>>>> PDFBox
>>>>>> is looking for a /Names entry, which itself has a /Dests entry. Your
>>>>>> document is using a PDF 1.1 concept, while PDFBox supports the 1.2
>>>>>> concept
>>>>>> only, but should of course support both.
>>>>>>
>>>>>> Be patient...
>>>>>>
>>>>>> Tilman
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>   ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>
>>>>
>>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Can't resolve page number

Posted by Noam Silver <no...@gmail.com>.

Thanks! this is grate!

Noam

On Mon, May 11, 2015 at 1:32 AM, Tilman Hausherr <TH...@t-online.de>
wrote:

> Get it here:
> http://home.snafu.de/tilman/tmp/pdfbox-app-1.8.10-SNAPSHOT.jar
>
> Tilman
>
>
> Am 11.05.2015 um 00:20 schrieb Noam Silver:
>
>> It will be grate if you can send me the updated .jar I am currently using
>> pdfbox-app-1.8.9.jar.
>> I am in a hurry I need to get this project working!
>> Thanks,
>> Noam
>>
>> On Sun, May 10, 2015 at 11:50 PM, Tilman Hausherr <TH...@t-online.de>
>> wrote:
>>
>>  Am 10.05.2015 um 21:59 schrieb Noam Silver:
>>>
>>>  Ok, this sounds like the problem. So, when you say be patient what do
>>>> you
>>>> mean?
>>>>
>>>>  Wait 1-2 days.
>>>
>>> I have already managed to do a quick solution but this needs some
>>> polishing. The "polishing" is internal only, i.e. you don't have to
>>> change
>>> anything in your code. It's the name lookup that has an additional
>>> segment
>>> now.
>>>
>>> If you're in a hurry, two possibilities:
>>> - I tell you what to change in the source code so you build it yourself
>>> - I send you the jar file you need. (If you want that, tell me if you use
>>> the pdfbox.jar, or the pdfbox-app.jar)
>>>
>>> Tilman
>>>
>>>
>>>
>>>
>>>
>>>  Thanks so much for your help!
>>>>
>>>> Noam
>>>>
>>>>
>>>> On Sun, May 10, 2015 at 10:13 PM, Tilman Hausherr <
>>>> THausherr@t-online.de>
>>>> wrote:
>>>>
>>>>   Am 10.05.2015 um 19:58 schrieb Noam Silver:
>>>>
>>>>>   The code which I was talking about with the *namesDict =
>>>>>
>>>>>> doc**.getDocumentCatalog().getNames()
>>>>>> *returns *null *is part of the pdfbox code in the *findDestinationPage
>>>>>> *method
>>>>>> in the section of the *if( rawDest instanceof PDNamedDestination )* in
>>>>>> the
>>>>>> *PDOutlineItem*  class.
>>>>>> It sems that there is an anomaly in this spacific pdf. Ill try to load
>>>>>> the
>>>>>> pdf with *loadNonSeq(file,null) *and see what's the difference.
>>>>>>
>>>>>>   I looked at the file, and my first impression is that this is a bug
>>>>>> in
>>>>>>
>>>>> PDFBox. In your file, the document catalog has a /Dests entry, but
>>>>> PDFBox
>>>>> is looking for a /Names entry, which itself has a /Dests entry. Your
>>>>> document is using a PDF 1.1 concept, while PDFBox supports the 1.2
>>>>> concept
>>>>> only, but should of course support both.
>>>>>
>>>>> Be patient...
>>>>>
>>>>> Tilman
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>  ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>>>
>>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Re: Can't resolve page number

Posted by Tilman Hausherr <TH...@t-online.de>.

Get it here:
http://home.snafu.de/tilman/tmp/pdfbox-app-1.8.10-SNAPSHOT.jar

Tilman

Am 11.05.2015 um 00:20 schrieb Noam Silver:
> It will be grate if you can send me the updated .jar I am currently using
> pdfbox-app-1.8.9.jar.
> I am in a hurry I need to get this project working!
> Thanks,
> Noam
>
> On Sun, May 10, 2015 at 11:50 PM, Tilman Hausherr <TH...@t-online.de>
> wrote:
>
>> Am 10.05.2015 um 21:59 schrieb Noam Silver:
>>
>>> Ok, this sounds like the problem. So, when you say be patient what do you
>>> mean?
>>>
>> Wait 1-2 days.
>>
>> I have already managed to do a quick solution but this needs some
>> polishing. The "polishing" is internal only, i.e. you don't have to change
>> anything in your code. It's the name lookup that has an additional segment
>> now.
>>
>> If you're in a hurry, two possibilities:
>> - I tell you what to change in the source code so you build it yourself
>> - I send you the jar file you need. (If you want that, tell me if you use
>> the pdfbox.jar, or the pdfbox-app.jar)
>>
>> Tilman
>>
>>
>>
>>
>>
>>> Thanks so much for your help!
>>>
>>> Noam
>>>
>>>
>>> On Sun, May 10, 2015 at 10:13 PM, Tilman Hausherr <TH...@t-online.de>
>>> wrote:
>>>
>>>   Am 10.05.2015 um 19:58 schrieb Noam Silver:
>>>>   The code which I was talking about with the *namesDict =
>>>>> doc**.getDocumentCatalog().getNames()
>>>>> *returns *null *is part of the pdfbox code in the *findDestinationPage
>>>>> *method
>>>>> in the section of the *if( rawDest instanceof PDNamedDestination )* in
>>>>> the
>>>>> *PDOutlineItem*  class.
>>>>> It sems that there is an anomaly in this spacific pdf. Ill try to load
>>>>> the
>>>>> pdf with *loadNonSeq(file,null) *and see what's the difference.
>>>>>
>>>>>   I looked at the file, and my first impression is that this is a bug in
>>>> PDFBox. In your file, the document catalog has a /Dests entry, but PDFBox
>>>> is looking for a /Names entry, which itself has a /Dests entry. Your
>>>> document is using a PDF 1.1 concept, while PDFBox supports the 1.2
>>>> concept
>>>> only, but should of course support both.
>>>>
>>>> Be patient...
>>>>
>>>> Tilman
>>>>
>>>>
>>>>
>>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Can't resolve page number

Posted by Noam Silver <no...@gmail.com>.

It will be grate if you can send me the updated .jar I am currently using
pdfbox-app-1.8.9.jar.
I am in a hurry I need to get this project working!
Thanks,
Noam

On Sun, May 10, 2015 at 11:50 PM, Tilman Hausherr <TH...@t-online.de>
wrote:

> Am 10.05.2015 um 21:59 schrieb Noam Silver:
>
>> Ok, this sounds like the problem. So, when you say be patient what do you
>> mean?
>>
>
> Wait 1-2 days.
>
> I have already managed to do a quick solution but this needs some
> polishing. The "polishing" is internal only, i.e. you don't have to change
> anything in your code. It's the name lookup that has an additional segment
> now.
>
> If you're in a hurry, two possibilities:
> - I tell you what to change in the source code so you build it yourself
> - I send you the jar file you need. (If you want that, tell me if you use
> the pdfbox.jar, or the pdfbox-app.jar)
>
> Tilman
>
>
>
>
>
>> Thanks so much for your help!
>>
>> Noam
>>
>>
>> On Sun, May 10, 2015 at 10:13 PM, Tilman Hausherr <TH...@t-online.de>
>> wrote:
>>
>>  Am 10.05.2015 um 19:58 schrieb Noam Silver:
>>>
>>>  The code which I was talking about with the *namesDict =
>>>> doc**.getDocumentCatalog().getNames()
>>>> *returns *null *is part of the pdfbox code in the *findDestinationPage
>>>> *method
>>>> in the section of the *if( rawDest instanceof PDNamedDestination )* in
>>>> the
>>>> *PDOutlineItem*  class.
>>>> It sems that there is an anomaly in this spacific pdf. Ill try to load
>>>> the
>>>> pdf with *loadNonSeq(file,null) *and see what's the difference.
>>>>
>>>>  I looked at the file, and my first impression is that this is a bug in
>>> PDFBox. In your file, the document catalog has a /Dests entry, but PDFBox
>>> is looking for a /Names entry, which itself has a /Dests entry. Your
>>> document is using a PDF 1.1 concept, while PDFBox supports the 1.2
>>> concept
>>> only, but should of course support both.
>>>
>>> Be patient...
>>>
>>> Tilman
>>>
>>>
>>>
>>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Re: Can't resolve page number

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 10.05.2015 um 21:59 schrieb Noam Silver:
> Ok, this sounds like the problem. So, when you say be patient what do you
> mean?

Wait 1-2 days.

I have already managed to do a quick solution but this needs some 
polishing. The "polishing" is internal only, i.e. you don't have to 
change anything in your code. It's the name lookup that has an 
additional segment now.

If you're in a hurry, two possibilities:
- I tell you what to change in the source code so you build it yourself
- I send you the jar file you need. (If you want that, tell me if you 
use the pdfbox.jar, or the pdfbox-app.jar)

Tilman



>
> Thanks so much for your help!
>
> Noam
>
>
> On Sun, May 10, 2015 at 10:13 PM, Tilman Hausherr <TH...@t-online.de>
> wrote:
>
>> Am 10.05.2015 um 19:58 schrieb Noam Silver:
>>
>>> The code which I was talking about with the *namesDict =
>>> doc**.getDocumentCatalog().getNames()
>>> *returns *null *is part of the pdfbox code in the *findDestinationPage
>>> *method
>>> in the section of the *if( rawDest instanceof PDNamedDestination )* in the
>>> *PDOutlineItem*  class.
>>> It sems that there is an anomaly in this spacific pdf. Ill try to load the
>>> pdf with *loadNonSeq(file,null) *and see what's the difference.
>>>
>> I looked at the file, and my first impression is that this is a bug in
>> PDFBox. In your file, the document catalog has a /Dests entry, but PDFBox
>> is looking for a /Names entry, which itself has a /Dests entry. Your
>> document is using a PDF 1.1 concept, while PDFBox supports the 1.2 concept
>> only, but should of course support both.
>>
>> Be patient...
>>
>> Tilman
>>
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Can't resolve page number

Posted by Noam Silver <no...@gmail.com>.

Ok, this sounds like the problem. So, when you say be patient what do you
mean?

Thanks so much for your help!

Noam


On Sun, May 10, 2015 at 10:13 PM, Tilman Hausherr <TH...@t-online.de>
wrote:

> Am 10.05.2015 um 19:58 schrieb Noam Silver:
>
>> The code which I was talking about with the *namesDict =
>> doc**.getDocumentCatalog().getNames()
>> *returns *null *is part of the pdfbox code in the *findDestinationPage
>> *method
>> in the section of the *if( rawDest instanceof PDNamedDestination )* in the
>> *PDOutlineItem*  class.
>> It sems that there is an anomaly in this spacific pdf. Ill try to load the
>> pdf with *loadNonSeq(file,null) *and see what's the difference.
>>
>
> I looked at the file, and my first impression is that this is a bug in
> PDFBox. In your file, the document catalog has a /Dests entry, but PDFBox
> is looking for a /Names entry, which itself has a /Dests entry. Your
> document is using a PDF 1.1 concept, while PDFBox supports the 1.2 concept
> only, but should of course support both.
>
> Be patient...
>
> Tilman
>
>
>

Re: Can't resolve page number

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 10.05.2015 um 19:58 schrieb Noam Silver:
> The code which I was talking about with the *namesDict =
> doc**.getDocumentCatalog().getNames()
> *returns *null *is part of the pdfbox code in the *findDestinationPage *method
> in the section of the *if( rawDest instanceof PDNamedDestination )* in the
> *PDOutlineItem*  class.
> It sems that there is an anomaly in this spacific pdf. Ill try to load the
> pdf with *loadNonSeq(file,null) *and see what's the difference.

I looked at the file, and my first impression is that this is a bug in 
PDFBox. In your file, the document catalog has a /Dests entry, but 
PDFBox is looking for a /Names entry, which itself has a /Dests entry. 
Your document is using a PDF 1.1 concept, while PDFBox supports the 1.2 
concept only, but should of course support both.

Be patient...

Tilman

Re: Can't resolve page number

Posted by Noam Silver <no...@gmail.com>.

Thanks Tilman for all your great and fast work.
Unfortunately I can't share the pdf publicly, it's copyrighted.
My code for extracting the text is (simplified):

    public static void main(String[] args) throws IOException {
        PDDocument doc = null;
        boolean hasOutputPath = false;

        if (args.length != 1 && args.length != 2) {
            usage();
            System.exit(0);
        }
        if (args.length == 2) {
            hasOutputPath = true;
        }
        try {
            doc = PDDocument.load(args[0]);
            if (doc.isEncrypted())
            {
                StandardDecryptionMaterial sdm = new
StandardDecryptionMaterial("");
                doc.openProtection(sdm);
            }
        }
        catch (IOException e) {
            System.err.println("Error loading PDF file");
            e.printStackTrace();
            System.exit(0);
        }
        catch (BadSecurityHandlerException e) {
            e.printStackTrace();
            System.exit(0);
        }
        catch (CryptographyException e) {
            e.printStackTrace();
            System.exit(0);
        }

        TextParser parser = new TextParser(hasOutputPath? args[1]:
args[0]);//A class of mine to parse the text received

        PDDocumentOutline outlineRoot =
doc.getDocumentCatalog().getDocumentOutline();
        PDOutlineItem parentItem = outlineRoot.getFirstChild();

        String parentTitleName;
        String currentChildTitleName;
        String nextChildTitleName;

        PDFTextStripperExt stripper = new PDFTextStripperExt();
        boolean childrenWereParsed = false;

        while (parentItem != null) {
            parentTitleName = parentItem.getTitle();
            if (Pattern.matches(".*Commands", parentTitleName)) {
                PDOutlineItem item = parentItem.getFirstChild();
                while (item != null) {
                    currentChildTitleName = item.getTitle();
                    stripper.setStartBookmark(item);
                    if ((item = item.getNextSibling()) == null) {
                        nextChildTitleName = (parentItem =
parentItem.getNextSibling()).getTitle();/*need to check null on next parent
item but in this pdf case it won't happen*/
                        stripper.setEndBookmark(parentItem);
                    }
                    else {
                        nextChildTitleName = item.getTitle();
                        stripper.setEndBookmark(item);
                    }
                    parser.parseText(stripper.getTextBySpecification(doc),
currentChildTitleName, nextChildTitleName);
                    docCount++;
                }
                childrenWereParsed = true;
            }
            if (!childrenWereParsed) {
                parentItem = parentItem.getNextSibling();
            }
        }
    }
(there might be some syntax errors since I simplified the code, but this is
the main concept)

The code which I was talking about with the *namesDict =
doc**.getDocumentCatalog().getNames()
*returns *null *is part of the pdfbox code in the *findDestinationPage *method
in the section of the *if( rawDest instanceof PDNamedDestination )* in the
*PDOutlineItem* class.
It sems that there is an anomaly in this spacific pdf. Ill try to load the
pdf with *loadNonSeq(file,null) *and see what's the difference.

Noam



On Sun, May 10, 2015 at 5:37 PM, Tilman Hausherr <TH...@t-online.de>
wrote:

> Am 08.05.2015 um 17:17 schrieb noamsilver@gmail.com:
>
>> I’m trying to parse a pdf file that I haven’t created, I’m using pdfBox
>> v1.8.9.
>>
>> My problem is that when trying to getText(doc) form a certain section of
>> the pdf using setStartBookmark(item) and setEndBookmark(item) I get all the
>> text rather than just the text from the specified section.
>>
>> WhiIe trying to resolve this I realized that the writeText(doc,
>> outputStream) method always calls resetEngine() method. That will reset all
>> the parameters and delete the bookmarks I set.
>>
>> So my first question is what is the correct way to get the text from a
>> specified section of the pdf?
>>
>
> I've now hopefully fixed that problem in
> https://issues.apache.org/jira/browse/PDFBOX-2792
> a snapshot version will soon be available here:
>
> https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/1.8.10-SNAPSHOT/
>
>  When I continued to try and resolve this I created a new class that
>> extendsPDFTextStripper and I changed the getText() and writeText() methods
>> (also changing their names) so that it won’t call the resetEngine() method
>> while keeping the rest of the functionality (I also had to delete the if
>> (getAddMoreFormatting()) section as the parameters are private, is that a
>> problem?).
>>
>> Now when I call the method I created I have a second problem, while it
>> tries to determine the startBookmarkPageNumber in processPages method
>> getPageNumber method returns -1.
>>
>> When I dug deeper I saw that in findDestinationPage method the rawDest is
>> of type PDNamedDestination.
>>
>> The problem is that when trying to get namesDict =
>> doc.getDocumentCatalog().getNames() it returns null. That means that the
>> names dictionary doesn’t exist. What can be done?
>>
>> Just need to point out that in Acrobat the bookmarks all work.
>>
>
> I tested this on a document with names, and I didn't have that effect with
> 1.8.9, so whatever the problem is, it isn't a general problem, so I need
> the file.
>
> One thing to try is to load the document with loadNonSeq(file,null)
> instead of load().
>
> Tilman
>
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Re: Can't resolve page number

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 08.05.2015 um 17:17 schrieb noamsilver@gmail.com:
> I’m trying to parse a pdf file that I haven’t created, I’m using pdfBox v1.8.9.
>
> My problem is that when trying to getText(doc) form a certain section of the pdf using setStartBookmark(item) and setEndBookmark(item) I get all the text rather than just the text from the specified section.
>
> WhiIe trying to resolve this I realized that the writeText(doc, outputStream) method always calls resetEngine() method. That will reset all the parameters and delete the bookmarks I set.
>
> So my first question is what is the correct way to get the text from a specified section of the pdf?

I've now hopefully fixed that problem in
https://issues.apache.org/jira/browse/PDFBOX-2792
a snapshot version will soon be available here:
https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/1.8.10-SNAPSHOT/

> When I continued to try and resolve this I created a new class that extendsPDFTextStripper and I changed the getText() and writeText() methods (also changing their names) so that it won’t call the resetEngine() method while keeping the rest of the functionality (I also had to delete the if (getAddMoreFormatting()) section as the parameters are private, is that a problem?).
>
> Now when I call the method I created I have a second problem, while it tries to determine the startBookmarkPageNumber in processPages method getPageNumber method returns -1.
>
> When I dug deeper I saw that in findDestinationPage method the rawDest is of type PDNamedDestination.
>
> The problem is that when trying to get namesDict = doc.getDocumentCatalog().getNames() it returns null. That means that the names dictionary doesn’t exist. What can be done?
>
> Just need to point out that in Acrobat the bookmarks all work.

I tested this on a document with names, and I didn't have that effect 
with 1.8.9, so whatever the problem is, it isn't a general problem, so I 
need the file.

One thing to try is to load the document with loadNonSeq(file,null) 
instead of load().

Tilman






---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Can't resolve page number

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 09.05.2015 um 23:50 schrieb Tilman Hausherr:
>>
>> Hello,
>>
>> I’m trying to parse a pdf file that I haven’t created, I’m using 
>> pdfBox v1.8.9.
>>
>> My problem is that when trying to getText(doc) form a certain section 
>> of the pdf using setStartBookmark(item) and setEndBookmark(item) I 
>> get all the text rather than just the text from the specified section.
>>
>> WhiIe trying to resolve this I realized that the writeText(doc, 
>> outputStream) method always calls resetEngine() method. That will 
>> reset all the parameters and delete the bookmarks I set.
>
> That seems like a bug to me :-( 

the two lines that reset the bookmarks were added to resetEngine in 
PDFBOX-1808 in rev 1553175
https://svn.apache.org/viewvc/pdfbox/trunk/pdfbox/src/main/java/org/apache/pdfbox/util/PDFTextStripper.java?r1=1553175&r2=1553174&pathrev=1553175
that was meant to save some memory. (Andreas)

I also found another weird piece of code:

         if (startPage != null && endPage != null &&
             startBookmark.getCOSObject() == endBookmark.getCOSObject())
         {
             // this is a special case where both the start and end bookmark
             // are the same but point to nothing.  In this case
             // we will not extract any text.
             startBookmarkPageNumber = 0;
             endBookmarkPageNumber = 0;
         }

(should probably be startPage == null && endPage == null && ....)

  earlier, that segment was:

        if( startBookmarkPageNumber == -1 && startBookmark != null &&
                 endBookmarkPageNumber == -1 && endBookmark != null &&
                 startBookmark.getCOSObject() == 
endBookmark.getCOSObject() )
         {
             //this is a special case where both the start and end bookmark
             //are the same but point to nothing.  In this case
             //we will not extract any text.
             startBookmarkPageNumber = 0;
             endBookmarkPageNumber = 0;
         }

which makes more sense. The change was made last year in rev 1634252 as 
part of the pagetree refactoring. (John)

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Can't resolve page number

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 08.05.2015 um 17:17 schrieb noamsilver@gmail.com:
>
>
>
> Hello,
>
> I’m trying to parse a pdf file that I haven’t created, I’m using pdfBox v1.8.9.
>
> My problem is that when trying to getText(doc) form a certain section of the pdf using setStartBookmark(item) and setEndBookmark(item) I get all the text rather than just the text from the specified section.
>
> WhiIe trying to resolve this I realized that the writeText(doc, outputStream) method always calls resetEngine() method. That will reset all the parameters and delete the bookmarks I set.

That seems like a bug to me :-(

>
> So my first question is what is the correct way to get the text from a specified section of the pdf?

To get it to work, I suggest you get the page number from the 
bookmarks... oops, that is what you tried:

>
> When I continued to try and resolve this I created a new class that extendsPDFTextStripper and I changed the getText() and writeText() methods (also changing their names) so that it won’t call the resetEngine() method while keeping the rest of the functionality (I also had to delete the if (getAddMoreFormatting()) section as the parameters are private, is that a problem?).
>
> Now when I call the method I created I have a second problem, while it tries to determine the startBookmarkPageNumber in processPages method getPageNumber method returns -1.
>
> When I dug deeper I saw that in findDestinationPage method the rawDest is of type PDNamedDestination.
>
> The problem is that when trying to get namesDict = doc.getDocumentCatalog().getNames() it returns null. That means that the names dictionary doesn’t exist. What can be done?

Could you upload the document to a public place? I'll research what is 
going on. Some code would be nice too i.e. what you tried to far to 
(not) get the page number.

Tilman


>
> Just need to point out that in Acrobat the bookmarks all work.
>
>
> Noam


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org