You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by "Hesham G." <he...@gmail.com> on 2011/10/26 14:25:36 UTC

Detecting the footnotes in a PDF

Hello ,

Is there a way to detect the footnotes section in a PDF file ?
Here is a sample 2-pages PDF with footnotes: 
http://www.4shared.com/document/Q03u9SMc/pdf_with_footnotes.html


Best regards ,
Hesham

Re: Detecting the footnotes in a PDF

Posted by Jeremias Maerki <de...@jeremias-maerki.ch>.

Please note that I've never used text extraction from PDFBox myself, so
these are just a few ideas for a possible direction:

If you look at the PDFText2HTML class (take it as an example) you can
guess that PDFTextStripper seems to be easily subclassable so you can do
custom processing, like, for example, overriding
writeParagraphStart()/writeParagraphEnd() to be notified when a
paragraph starts and ends. Then you can intercept the text sent to the
writeString() method (accumulate string in a StringBuilder until you
have the full paragraph). Once you have a paragraph, you can use Java's
Regex feature to match a pattern. Something like:

if (Pattern.matches("^\(\d*\)\s", paragraphText)) {
    //I have (possibly) found a footnote
}

This would match "(77) " and "(1) " and so on. The "^" makes sure that
pattern is only matched at the beginning of a string.

The rest is up to you. Good luck.

On 26.10.2011 21:46:53 Hesham G. wrote:
> Jeremias ,
> 
> Thanks a lot ... That might be helpful, especially I want also to detect the number of the footnote. But how can I get the pattern "(<number>) in terms of the PDFBox language  ?
> 
> 
> Best regards ,
> Hesham
> 
> ---------------------------------------------
> Included message :
> 
> 
> > Not reliably, no, because the PDF is not tagged. Together with text
> > extraction you might be able to come up with some heuristics to identify
> > footnotes. Like looking for a pattern "(<number>) " at the beginning of
> > a paragraph, for example. HTH
> > 
> > 
> > On 26.10.2011 19:52:46 Hesham G. wrote:
> >> May be my question was not clear enough ... I meant is there a way to know that the current extracted part from the PDF page is the footnote section ?
> >> 
> >> 
> >> Best regards ,
> >> Hesham
> >> 
> >> 
> >> ---------------------------------------------
> >> Included message :
> >> 
> >> > I seee PDFBox (current trunk) extracting the footnote text correctly
> >> > from this PDF.  (I just ran the org.apache.pdfbox.ExtractText tool).
> >> > 
> >> > Mike McCandless
> >> > 
> >> > http://blog.mikemccandless.com
> >> > 
> >> > On Wed, Oct 26, 2011 at 8:25 AM, Hesham G. <he...@gmail.com> wrote:
> >> >> Hello ,
> >> >>
> >> >> Is there a way to detect the footnotes section in a PDF file ?
> >> >> Here is a sample 2-pages PDF with footnotes:
> >> >> http://www.4shared.com/document/Q03u9SMc/pdf_with_footnotes.html
> >> >>
> >> >>
> >> >> Best regards ,
> >> >> Hesham
> >> >>
> >> >
> > 
> > 
> > 
> > 
> > Jeremias Maerki
> > 
> >

Jeremias Maerki

Re: Detecting the footnotes in a PDF

Posted by "Hesham G." <he...@gmail.com>.

Jeremias ,

Thanks a lot ... That might be helpful, especially I want also to detect the number of the footnote. But how can I get the pattern "(<number>) in terms of the PDFBox language  ?


Best regards ,
Hesham

---------------------------------------------
Included message :


> Not reliably, no, because the PDF is not tagged. Together with text
> extraction you might be able to come up with some heuristics to identify
> footnotes. Like looking for a pattern "(<number>) " at the beginning of
> a paragraph, for example. HTH
> 
> 
> On 26.10.2011 19:52:46 Hesham G. wrote:
>> May be my question was not clear enough ... I meant is there a way to know that the current extracted part from the PDF page is the footnote section ?
>> 
>> 
>> Best regards ,
>> Hesham
>> 
>> 
>> ---------------------------------------------
>> Included message :
>> 
>> > I seee PDFBox (current trunk) extracting the footnote text correctly
>> > from this PDF.  (I just ran the org.apache.pdfbox.ExtractText tool).
>> > 
>> > Mike McCandless
>> > 
>> > http://blog.mikemccandless.com
>> > 
>> > On Wed, Oct 26, 2011 at 8:25 AM, Hesham G. <he...@gmail.com> wrote:
>> >> Hello ,
>> >>
>> >> Is there a way to detect the footnotes section in a PDF file ?
>> >> Here is a sample 2-pages PDF with footnotes:
>> >> http://www.4shared.com/document/Q03u9SMc/pdf_with_footnotes.html
>> >>
>> >>
>> >> Best regards ,
>> >> Hesham
>> >>
>> >
> 
> 
> 
> 
> Jeremias Maerki
> 
>

Re: Detecting the footnotes in a PDF

Posted by Jeremias Maerki <de...@jeremias-maerki.ch>.

Not reliably, no, because the PDF is not tagged. Together with text
extraction you might be able to come up with some heuristics to identify
footnotes. Like looking for a pattern "(<number>) " at the beginning of
a paragraph, for example. HTH


On 26.10.2011 19:52:46 Hesham G. wrote:
> May be my question was not clear enough ... I meant is there a way to know that the current extracted part from the PDF page is the footnote section ?
> 
> 
> Best regards ,
> Hesham
> 
> 
> ---------------------------------------------
> Included message :
> 
> > I seee PDFBox (current trunk) extracting the footnote text correctly
> > from this PDF.  (I just ran the org.apache.pdfbox.ExtractText tool).
> > 
> > Mike McCandless
> > 
> > http://blog.mikemccandless.com
> > 
> > On Wed, Oct 26, 2011 at 8:25 AM, Hesham G. <he...@gmail.com> wrote:
> >> Hello ,
> >>
> >> Is there a way to detect the footnotes section in a PDF file ?
> >> Here is a sample 2-pages PDF with footnotes:
> >> http://www.4shared.com/document/Q03u9SMc/pdf_with_footnotes.html
> >>
> >>
> >> Best regards ,
> >> Hesham
> >>
> >




Jeremias Maerki

Re: Detecting the footnotes in a PDF

Posted by "Hesham G." <he...@gmail.com>.

May be my question was not clear enough ... I meant is there a way to know that the current extracted part from the PDF page is the footnote section ?


Best regards ,
Hesham


---------------------------------------------
Included message :

> I seee PDFBox (current trunk) extracting the footnote text correctly
> from this PDF.  (I just ran the org.apache.pdfbox.ExtractText tool).
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> On Wed, Oct 26, 2011 at 8:25 AM, Hesham G. <he...@gmail.com> wrote:
>> Hello ,
>>
>> Is there a way to detect the footnotes section in a PDF file ?
>> Here is a sample 2-pages PDF with footnotes:
>> http://www.4shared.com/document/Q03u9SMc/pdf_with_footnotes.html
>>
>>
>> Best regards ,
>> Hesham
>>
>

Re: Detecting the footnotes in a PDF

Posted by Michael McCandless <lu...@mikemccandless.com>.

I seee PDFBox (current trunk) extracting the footnote text correctly
from this PDF.  (I just ran the org.apache.pdfbox.ExtractText tool).

Mike McCandless

http://blog.mikemccandless.com

On Wed, Oct 26, 2011 at 8:25 AM, Hesham G. <he...@gmail.com> wrote:
> Hello ,
>
> Is there a way to detect the footnotes section in a PDF file ?
> Here is a sample 2-pages PDF with footnotes:
> http://www.4shared.com/document/Q03u9SMc/pdf_with_footnotes.html
>
>
> Best regards ,
> Hesham
>