You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by Jason Harrop <jh...@gmail.com> on 2009/09/06 13:44:12 UTC

Paragraph boundary segmentation

Hi

I've been playing a little with adding paragraph markers in
PDFTextStripper.  I'm using a crude algorithm which estimates normal
line spacing, and inserts a paragraph marker when a greater spacing is
detected.

How best to do this.

A first question is how important it is to avoid iterating over the
TextPosition objects in a page a second time?  To throw a possible
heuristic out there:  Maybe normal line spacing could be judged from
the second five or so lines (not the first five, since a page is
likely to start with a heading).  Hopefully this will be good, but if
it looks wrong as the rest of the page is processed, do the whole page
again using what we learnt on the first pass.  Or do people have other
ideas on how they want to implement this?  And obviously there is more
to it than just line spacing...

Second, for HTML output, we want the paragraph marker to become a <p>
tag enclosing the paragraph.  PDFText2HTML start|endArticle wraps each
page in a <div>, and so a paragraph which crosses a page boundary is
going to raise issues.  Would it not be more natural from an HTML
perspective to mark the page segmentation with a point tag rather than
enclosing each page in a <div>?

Unfortunately I can't devote much time if any to this right now, but I
thought I'd share fwiw.

cheers

Jason

Re: Paragraph boundary segmentation

Posted by Ted Dunning <te...@gmail.com>.

That sounds like a great heuristic to me.  I don't think that it would be
all that bad to iterate through the page structure and create your own
representations of lines as you pass over the page once.

Then, if you look at the histogram of line spacings, I think you can find an
optimal break-point very easily.  Font, size and color may be important cues
as well for detecting headers and footers.

On Sun, Sep 6, 2009 at 4:44 AM, Jason Harrop <jh...@gmail.com> wrote:

> I've been playing a little with adding paragraph markers in
> PDFTextStripper.  I'm using a crude algorithm which estimates normal
> line spacing, and inserts a paragraph marker when a greater spacing is
> detected.
>
> How best to do this.
>
> A first question is how important it is to avoid iterating over the
> TextPosition objects in a page a second time?
>

-- 
Ted Dunning, CTO
DeepDyve

RE: Paragraph boundary segmentation

Posted by "Martinez, Mel" <m....@ll.mit.edu>.

Jason,

There are two attributes on the PDFTextStripper2 class (see https://issues.apache.org/jira/browse/PDFBOX-521) that control most of the paragraph separation detection:

setDropThreshold(float)  : controls the vertical threshold of whitespace allowed between lines beyond which a new paragraph is asserted.  Specified in multiples of current text height.  Default is 2.5f.  I.E. if the current line has dropped more than 2.5 times the current line height from the prior line, then it assumes this line is on a new paragraph.

setIndentThreshold(float) : controls the depth of indent used to detect indented paragraph starts.  This is in multiples of the space character
width for the current font.  The default is 2.0f.

If you are having trouble with double spaced lines being detected as separate paragraphs, then probably you want to increase the drop threshold.

There is currently no option through the ExtractText tool to feed those settings because they do not exist for the original PDFTextStripper class.

You could easily subclass the PDFTextStripper2 class and just set these to different values in the subclass.  You would then use the subclass instead from within ExtractText.

I will code up a rewrite that allows these to be set via -D variables so they could be set as command-line options.

-Mel

-----Original Message-----
From: Jason Harrop [mailto:jharrop@gmail.com] 
Sent: Tuesday, September 15, 2009 4:56 AM
To: pdfbox-dev@incubator.apache.org
Subject: Re: Paragraph boundary segmentation

Hi Mel

I tried your latest JIRA code on 2 documents (without tweaking any settings).

It did a nice job on the first document (although once we have
paragraphs recognised properly, i guess the next thing we will want is
header/footer recognition ;-) ).

The second document, which had line spacing set to double, resulted in
a paragraph per line.  I didn't adjust any manual settings, since I'd
like it to work without the user setting any parameters.

thanks

Jason

On Wed, Sep 9, 2009 at 2:59 AM, Martinez, Mel <m....@ll.mit.edu> wrote:
> Sounds good, I will do that.
>
> -----Original Message-----
> From: Andreas Lehmkühler [mailto:andreas@lehmi.de]
> Sent: Tuesday, September 08, 2009 11:48 AM
> To: pdfbox-dev@incubator.apache.org
> Subject: Re: Paragraph boundary segmentation
>
> Hi Mel
>
> Martinez, Mel schrieb:
>> SNIP
>> I am new to the PDFBox project and just signed onto the dev list so I don't yet know all your procedures for submitting code for consideration (I was once a committer on Tomcat 3, but that was a looong time ago).
>>
>> Although it is small, I don't want to just attach it to the dev list.  Is there a committer I could send it to?
> We really appreciate your offer to share your code with the PDFBox
> project and obviously there are others having the same needs than you.
> The easiest way to submit your code should be to create an issue for an
> improvement on JIRA [1] and attach your source to it.
>
>
> Thanks in advance
> Andreas Lehmkühler
>
> [1] https://issues.apache.org/jira/browse/PDFBOX
>

Re: Paragraph boundary segmentation

Posted by Philipp Koch <ph...@day.com>.

> HOW TO UNSUBSCRIBE?!
see http://incubator.apache.org/pdfbox/mailing-list.html#dev

regards,
philipp

On Tue, Sep 15, 2009 at 11:34 AM, Francisco Garrido
<fg...@pedagogiainteractiva.com> wrote:
> HOW TO UNSUBSCRIBE?!
>
> Francesc Garrido
> Àrea Tecnologia
> Pedagogia Interactiva, S.L.
>
> C/Marie Curie s/n
> Parc Tecnològic BCNord
> 08042 Barcelona
> T: +34 93 253 91 94 ; F: +34 93 291 76 91
> www.pedagogiainteractiva.com
>
> Advertència legal  /  Advertencia legal  /  Legal Notice
>
>
> -----Original Message-----
> From: Jason Harrop [mailto:jharrop@gmail.com]
> Sent: Tuesday, September 15, 2009 10:56 AM
> To: pdfbox-dev@incubator.apache.org
> Subject: Re: Paragraph boundary segmentation
>
> Hi Mel
>
> I tried your latest JIRA code on 2 documents (without tweaking any
> settings).
>
> It did a nice job on the first document (although once we have
> paragraphs recognised properly, i guess the next thing we will want is
> header/footer recognition ;-) ).
>
> The second document, which had line spacing set to double, resulted in
> a paragraph per line.  I didn't adjust any manual settings, since I'd
> like it to work without the user setting any parameters.
>
> thanks
>
> Jason
>
>
>
>
>
>
> On Wed, Sep 9, 2009 at 2:59 AM, Martinez, Mel <m....@ll.mit.edu> wrote:
>> Sounds good, I will do that.
>>
>> -----Original Message-----
>> From: Andreas Lehmkühler [mailto:andreas@lehmi.de]
>> Sent: Tuesday, September 08, 2009 11:48 AM
>> To: pdfbox-dev@incubator.apache.org
>> Subject: Re: Paragraph boundary segmentation
>>
>> Hi Mel
>>
>> Martinez, Mel schrieb:
>>> SNIP
>>> I am new to the PDFBox project and just signed onto the dev list so I
> don't yet know all your procedures for submitting code for consideration (I
> was once a committer on Tomcat 3, but that was a looong time ago).
>>>
>>> Although it is small, I don't want to just attach it to the dev list.  Is
> there a committer I could send it to?
>> We really appreciate your offer to share your code with the PDFBox
>> project and obviously there are others having the same needs than you.
>> The easiest way to submit your code should be to create an issue for an
>> improvement on JIRA [1] and attach your source to it.
>>
>>
>> Thanks in advance
>> Andreas Lehmkühler
>>
>> [1] https://issues.apache.org/jira/browse/PDFBOX
>>
>
>

RE: Paragraph boundary segmentation

Posted by Francisco Garrido <fg...@pedagogiainteractiva.com>.

HOW TO UNSUBSCRIBE?!

Francesc Garrido
Àrea Tecnologia
Pedagogia Interactiva, S.L.
 
C/Marie Curie s/n 
Parc Tecnològic BCNord
08042 Barcelona
T: +34 93 253 91 94 ; F: +34 93 291 76 91 
www.pedagogiainteractiva.com
 
Advertència legal  /  Advertencia legal  /  Legal Notice


-----Original Message-----
From: Jason Harrop [mailto:jharrop@gmail.com] 
Sent: Tuesday, September 15, 2009 10:56 AM
To: pdfbox-dev@incubator.apache.org
Subject: Re: Paragraph boundary segmentation

Hi Mel

I tried your latest JIRA code on 2 documents (without tweaking any
settings).

It did a nice job on the first document (although once we have
paragraphs recognised properly, i guess the next thing we will want is
header/footer recognition ;-) ).

The second document, which had line spacing set to double, resulted in
a paragraph per line.  I didn't adjust any manual settings, since I'd
like it to work without the user setting any parameters.

thanks

Jason






On Wed, Sep 9, 2009 at 2:59 AM, Martinez, Mel <m....@ll.mit.edu> wrote:
> Sounds good, I will do that.
>
> -----Original Message-----
> From: Andreas Lehmkühler [mailto:andreas@lehmi.de]
> Sent: Tuesday, September 08, 2009 11:48 AM
> To: pdfbox-dev@incubator.apache.org
> Subject: Re: Paragraph boundary segmentation
>
> Hi Mel
>
> Martinez, Mel schrieb:
>> SNIP
>> I am new to the PDFBox project and just signed onto the dev list so I
don't yet know all your procedures for submitting code for consideration (I
was once a committer on Tomcat 3, but that was a looong time ago).
>>
>> Although it is small, I don't want to just attach it to the dev list.  Is
there a committer I could send it to?
> We really appreciate your offer to share your code with the PDFBox
> project and obviously there are others having the same needs than you.
> The easiest way to submit your code should be to create an issue for an
> improvement on JIRA [1] and attach your source to it.
>
>
> Thanks in advance
> Andreas Lehmkühler
>
> [1] https://issues.apache.org/jira/browse/PDFBOX
>

Re: Paragraph boundary segmentation

Posted by Jason Harrop <jh...@gmail.com>.

Hi Mel

I tried your latest JIRA code on 2 documents (without tweaking any settings).

It did a nice job on the first document (although once we have
paragraphs recognised properly, i guess the next thing we will want is
header/footer recognition ;-) ).

The second document, which had line spacing set to double, resulted in
a paragraph per line.  I didn't adjust any manual settings, since I'd
like it to work without the user setting any parameters.

thanks

Jason






On Wed, Sep 9, 2009 at 2:59 AM, Martinez, Mel <m....@ll.mit.edu> wrote:
> Sounds good, I will do that.
>
> -----Original Message-----
> From: Andreas Lehmkühler [mailto:andreas@lehmi.de]
> Sent: Tuesday, September 08, 2009 11:48 AM
> To: pdfbox-dev@incubator.apache.org
> Subject: Re: Paragraph boundary segmentation
>
> Hi Mel
>
> Martinez, Mel schrieb:
>> SNIP
>> I am new to the PDFBox project and just signed onto the dev list so I don't yet know all your procedures for submitting code for consideration (I was once a committer on Tomcat 3, but that was a looong time ago).
>>
>> Although it is small, I don't want to just attach it to the dev list.  Is there a committer I could send it to?
> We really appreciate your offer to share your code with the PDFBox
> project and obviously there are others having the same needs than you.
> The easiest way to submit your code should be to create an issue for an
> improvement on JIRA [1] and attach your source to it.
>
>
> Thanks in advance
> Andreas Lehmkühler
>
> [1] https://issues.apache.org/jira/browse/PDFBOX
>

RE: Paragraph boundary segmentation

Posted by "Martinez, Mel" <m....@ll.mit.edu>.

Sounds good, I will do that.

-----Original Message-----
From: Andreas Lehmkühler [mailto:andreas@lehmi.de] 
Sent: Tuesday, September 08, 2009 11:48 AM
To: pdfbox-dev@incubator.apache.org
Subject: Re: Paragraph boundary segmentation

Hi Mel

Martinez, Mel schrieb:
> SNIP
> I am new to the PDFBox project and just signed onto the dev list so I don't yet know all your procedures for submitting code for consideration (I was once a committer on Tomcat 3, but that was a looong time ago).  
> 
> Although it is small, I don't want to just attach it to the dev list.  Is there a committer I could send it to?  
We really appreciate your offer to share your code with the PDFBox
project and obviously there are others having the same needs than you.
The easiest way to submit your code should be to create an issue for an
improvement on JIRA [1] and attach your source to it.


Thanks in advance
Andreas Lehmkühler

[1] https://issues.apache.org/jira/browse/PDFBOX

Re: Paragraph boundary segmentation

Posted by Andreas Lehmkühler <an...@lehmi.de>.

Hi Mel

Martinez, Mel schrieb:
> SNIP
> I am new to the PDFBox project and just signed onto the dev list so I don't yet know all your procedures for submitting code for consideration (I was once a committer on Tomcat 3, but that was a looong time ago).  
> 
> Although it is small, I don't want to just attach it to the dev list.  Is there a committer I could send it to?  
We really appreciate your offer to share your code with the PDFBox
project and obviously there are others having the same needs than you.
The easiest way to submit your code should be to create an issue for an
improvement on JIRA [1] and attach your source to it.


Thanks in advance
Andreas Lehmkühler

[1] https://issues.apache.org/jira/browse/PDFBOX

RE: Paragraph boundary segmentation

Posted by "Martinez, Mel" <m....@ll.mit.edu>.

Hi Jason,

My group has a similar need to have better demarcation of whitespace in
the extracted text.  So I spent a couple of days last week on this problem.

I have created a subclass of PDFTextStripper that seems to do a decent job of properly detecting paragraph breaks, both through drop (vertical) and indent.  It also detects items using a hanging indent (like bullet and numbered items) and delimits those as paragraphs.  It includes methods for inserting delimiters for paragraph starts and stops as well as page starts and page stops.  Basically, it is instrumented with a little more granularity than the default PDFTextStripper class.

I use a simpler heuristic of only looking back at the previous line start rather than the last "five or so". This does not require any additional scans of the document over what PDFTextStripper is doing right now..  It does apply a bit more logic on a given scan, depending on the conditions it finds (hanging indents require more thought than say, a straight vertical whitespace drop) so it is not quite as fast as the base class, but not too different, from my initial testing.  

I've tried to make it pretty configurable.  You can set the vertical drop and horizontal indent thresholds and (through subclassing) change the reg expressions used to test for list item starts.  And I've broken the separators out to distinct start and end delimiters instead of just single separator.

I have in turn subclassed this to get an improved PDFText2HTML stripper that tags chunks within a div with paragraph start ("<p>") and paragraph stop ("</p>") tags, instead of simply terminating every line with a line-break.  It also inserts form-feed attributes in the div used on page-breaks.  Basically, it takes advantage of the improved instrumentation to do a more logical tagging.

I've tested this against a fairly complex journal article (two-column, so don't use '-sort') that has multiple types of lists including bibliography, the PDF Spec reference document itself and a few miscellaneous PDFs of varying complexity.  Extracting the text from the PDF Spec (32MB pdf doc) takes quite a bit of memory - I had to allocate 256 MB to the VM to get that to work in a timely manner - but that's true of the default PDFTextStripper class as well.

I would be willing to submit these classes for consideration for inclusion.  I implemented them as sub-classes (named PDFTextStripper2 and PDFText2HTML2, respectively) rather than modification to the base versions in PDFBox since I am currently going to be stuck using older builds of PDFBox anyway.  However, the changes could easily merged up into the parent classes, if that is desired.  In addition to the two classes described did need to create one utility wrapper class in order to attach a couple of flags to the TextPosition.  That functionality could in theory be absorbed into the TextPosition class but one could argue also that they shouldn't.

The total source code for the PDFTextStripper2.java subclass is 26.8K, while the PDFText2HTML2.java class is 6.07K and the utility PositionWrapper.java is 2.01K for a total of just ~36K of code.

I am new to the PDFBox project and just signed onto the dev list so I don't yet know all your procedures for submitting code for consideration (I was once a committer on Tomcat 3, but that was a looong time ago).  

Although it is small, I don't want to just attach it to the dev list.  Is there a committer I could send it to?  

Jason - would you like a copy to see if it meets your needs?  You can test it by simply modifying (or rewriting) the org.apache.pdfbox.ExtractText class to use these instead of the default.

Let me know and I'll send it to you.

Cheers,

Mel

Dr. Mel Martinez
m.martinez@ll.mit.edu


-----Original Message-----
From: Jason Harrop [mailto:jharrop@gmail.com] 
Sent: Sunday, September 06, 2009 7:44 AM
To: pdfbox-dev@incubator.apache.org
Subject: Paragraph boundary segmentation

Hi

I've been playing a little with adding paragraph markers in
PDFTextStripper.  I'm using a crude algorithm which estimates normal
line spacing, and inserts a paragraph marker when a greater spacing is
detected.

How best to do this.

A first question is how important it is to avoid iterating over the
TextPosition objects in a page a second time?  To throw a possible
heuristic out there:  Maybe normal line spacing could be judged from
the second five or so lines (not the first five, since a page is
likely to start with a heading).  Hopefully this will be good, but if
it looks wrong as the rest of the page is processed, do the whole page
again using what we learnt on the first pass.  Or do people have other
ideas on how they want to implement this?  And obviously there is more
to it than just line spacing...

Second, for HTML output, we want the paragraph marker to become a <p>
tag enclosing the paragraph.  PDFText2HTML start|endArticle wraps each
page in a <div>, and so a paragraph which crosses a page boundary is
going to raise issues.  Would it not be more natural from an HTML
perspective to mark the page segmentation with a point tag rather than
enclosing each page in a <div>?

Unfortunately I can't devote much time if any to this right now, but I
thought I'd share fwiw.

cheers

Jason