You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Ted Dunning <td...@apache.org> on 2011/03/30 19:04:50 UTC

Fwd: Text Extraction with multi-column documents in PDFBox

---------- Forwarded message ----------
From: Ted Dunning <td...@apache.org>
Date: Wed, Mar 30, 2011 at 10:04 AM
Subject: Re: Text Extraction with multi-column documents in PDFBox
To: Jeremy Barkan <ba...@alum.mit.edu>


I haven't looked at that lately so I may be a bit wrong on details, but if
you look at the sample article that I posted, you can see how simply
following any heuristic for generating the flow based on position alone will
not work.  The text inset on the first page, for instance, will get the
columns all confused.  The current heuristics are probably fine for finding
individual lines, but not for splitting lines into columns and then
threading those lines into correct flows and marking those flows as text or
decoration.  Moreover, there are important cues given by font and size that
need to be used.  One such cue is whether the text is in the majority font.
 This alone is enough to separate about 90% of the main flow of the document
from other parts fo the document (for the journals I examined).   Most of
the remaining 10% can be had from considering geometrical cues in the
context of that initial assignment, but without the original assignment
based on fonts, the geometry isn't really strong enough.

I think that there is more to be done with what I started in that you can
look at how things came out from the first pass and use statistics
describing positions on the page and font/size/position transitions within a
single text type to refine the statistical model of the document.  That
would allow the flow to be recalculated, hopefully handling a few corner
cases more accurately.

My original goal was to simply remove the boiler-plate from the document and
leave a residue that would allow a high quality retrieval index to be
created.  The final results were nearly good enough to present as a
simplified, text-only surrogate for the document, but not quite.  They were
certainly quite readable, but not very pretty.


On Wed, Mar 30, 2011 at 9:54 AM, Jeremy Barkan <ba...@alum.mit.edu> wrote:

> How is what you describe similar or different than the charactersByArticle
> method of PDFTextStripper ?
>
>
>
> Thanks so much for your help
>
>
>
> Best Regards
>
>
>
> Jeremy
>
>
>
>
>
> *Jeremy Barkan*
>
>
>
> Tel: +972 2 6728069
>
> Mobile: +972 54 6321603
>
> Skype: jeremy_barkan
>
>
>
> *From:* Ted Dunning [mailto:tdunning@apache.org]
> *Sent:* 30 March 2011 17:55
> *To:* Jeremy Barkan
> *Subject:* Re: Text Extraction with multi-column documents in PDFBox
>
>
>
> Neither.
>
>
>
> Never.
>
>
>
> It would be very helpful to have it, though.
>
> On Wed, Mar 30, 2011 at 8:52 AM, Jeremy Barkan <ba...@alum.mit.edu>
> wrote:
>
> Thanks for getting back to me – I was looking into this kind of algorithm.
>
> Was this merged into PDFBox 1.4 or 1.5 ?
>
> I'm trying to decide if to implement this on my own on top of PDFBox or to
> use what PDFBox would have already implemented
>
>
>

RE: Text Extraction with multi-column documents in PDFBox

Posted by "Martinez, Mel - 1004 - MITLL" <m....@ll.mit.edu>.
Well ultimately this is going to be difficult because PDF is not a logical data store.  It is a rendering state engine.   SOME times the data objects in it are fortuitously arranged to fit a desired logical structure, but there is no guarantee of that.

 

If you have some foreknowledge about the structure of a given corpus of documents, you may be able to right some custom code that figures things out, but otherwise, PDF in general is simply not designed for that purpose.

 

In the documents I’ve been extracting, hyphen-breaks at end of line seem to be preserved and it seems like it would be straightforward to detect those to do reconstruction with.   However, the devil is in the details and your documents may not be as cooperative.

 

Good luck!

 

From: Ted Dunning [mailto:ted.dunning@gmail.com] 
Sent: Thursday, March 31, 2011 1:28 PM
To: dev@pdfbox.apache.org
Cc: Martinez, Mel - 1004 - MITLL; pdfbox-dev@incubator.apache.org; pdfbox-users@incubator.apache.org
Subject: Re: Text Extraction with multi-column documents in PDFBox

 

Yes.  This use of the native flow works about 50-80% of the time in my experience.  But it was waay to error prone to depend on and failed spectacularly for many critical data sources.  Even where it worked, the results were often not good enough.  For one thing, I needed real text flow so that I could reliably reverse engineer hyphenation (for text indexing).  I also needed to reliably remove headers, footers, page numbers, article titles and similar boilerplate across thousands of document sources without hand engineering each kind of document.

On Thu, Mar 31, 2011 at 8:58 AM, Martinez, Mel - 1004 - MITLL <m....@ll.mit.edu> wrote:

Ted,

A lot depends on how the PDF file was generated, but in general, so long as you leave the 'sort by position' attribute of the PDFBox' PDFTextStripper as 'false' (the default) then the text extraction will be (mostly) logical and not positional.

       PDFTextStripper myStripper = ...
       myStripper.setSortByPosition(false);  //not actually necessary since false is the default.

That is, if you have text in two columns on a page, the lines will be extracted by article and not cross columns.

 

Sort of.  As I mentioned, the quality across a bunch of data sources was just not good enough to even contemplate deployment.  Moreover, there was no way forward to improve the situation.

 

SOME PDFs can be (and unfortunately are) generated such that the text objects are not logically arranged by article and the extraction still messes up.  But in my experience on most documents it does a pretty good job, especially those generated from word processors.

 

I was working against documents from publishers.  My results were much worse than what you ahve seen, it sounds like.

 

The only recurring glitches tend to be where text in headers and footers gets inserted and sometimes a floating text box will be inserted in the extracted text quite far from where it appears on the page.  But the block of text from the box usually will at least be integral and not chopped up.

 

Only sometimes.  The rearrangements in practice are quite capricious.

 

The times when you may WANT to sort by position is when parsing text from PDFs that are more graphical in nature, such as those generated from PowerPoint type documents.   Even then though, it depends a lot on how the page is structured.   A bit of testing is usually necessary to figure out which setting works best with the particular PDF.

 

And my requirement was that I could not accept any magical knob turning.  My solution had to work across a huge range of sources.

 

As of 1.4 we have a lot of instrumentation that allows you to override / customize the demarcation between the following structural points:

Page
Article
Paragraph
Line
Word

 

That just doesn't really help.  I needed auto-tuning, line unbreaking and real flow following.

 


RE: Text Extraction with multi-column documents in PDFBox

Posted by "Martinez, Mel - 1004 - MITLL" <m....@ll.mit.edu>.
Well ultimately this is going to be difficult because PDF is not a logical data store.  It is a rendering state engine.   SOME times the data objects in it are fortuitously arranged to fit a desired logical structure, but there is no guarantee of that.

 

If you have some foreknowledge about the structure of a given corpus of documents, you may be able to right some custom code that figures things out, but otherwise, PDF in general is simply not designed for that purpose.

 

In the documents I’ve been extracting, hyphen-breaks at end of line seem to be preserved and it seems like it would be straightforward to detect those to do reconstruction with.   However, the devil is in the details and your documents may not be as cooperative.

 

Good luck!

 

From: Ted Dunning [mailto:ted.dunning@gmail.com] 
Sent: Thursday, March 31, 2011 1:28 PM
To: dev@pdfbox.apache.org
Cc: Martinez, Mel - 1004 - MITLL; pdfbox-dev@incubator.apache.org; pdfbox-users@incubator.apache.org
Subject: Re: Text Extraction with multi-column documents in PDFBox

 

Yes.  This use of the native flow works about 50-80% of the time in my experience.  But it was waay to error prone to depend on and failed spectacularly for many critical data sources.  Even where it worked, the results were often not good enough.  For one thing, I needed real text flow so that I could reliably reverse engineer hyphenation (for text indexing).  I also needed to reliably remove headers, footers, page numbers, article titles and similar boilerplate across thousands of document sources without hand engineering each kind of document.

On Thu, Mar 31, 2011 at 8:58 AM, Martinez, Mel - 1004 - MITLL <m....@ll.mit.edu> wrote:

Ted,

A lot depends on how the PDF file was generated, but in general, so long as you leave the 'sort by position' attribute of the PDFBox' PDFTextStripper as 'false' (the default) then the text extraction will be (mostly) logical and not positional.

       PDFTextStripper myStripper = ...
       myStripper.setSortByPosition(false);  //not actually necessary since false is the default.

That is, if you have text in two columns on a page, the lines will be extracted by article and not cross columns.

 

Sort of.  As I mentioned, the quality across a bunch of data sources was just not good enough to even contemplate deployment.  Moreover, there was no way forward to improve the situation.

 

SOME PDFs can be (and unfortunately are) generated such that the text objects are not logically arranged by article and the extraction still messes up.  But in my experience on most documents it does a pretty good job, especially those generated from word processors.

 

I was working against documents from publishers.  My results were much worse than what you ahve seen, it sounds like.

 

The only recurring glitches tend to be where text in headers and footers gets inserted and sometimes a floating text box will be inserted in the extracted text quite far from where it appears on the page.  But the block of text from the box usually will at least be integral and not chopped up.

 

Only sometimes.  The rearrangements in practice are quite capricious.

 

The times when you may WANT to sort by position is when parsing text from PDFs that are more graphical in nature, such as those generated from PowerPoint type documents.   Even then though, it depends a lot on how the page is structured.   A bit of testing is usually necessary to figure out which setting works best with the particular PDF.

 

And my requirement was that I could not accept any magical knob turning.  My solution had to work across a huge range of sources.

 

As of 1.4 we have a lot of instrumentation that allows you to override / customize the demarcation between the following structural points:

Page
Article
Paragraph
Line
Word

 

That just doesn't really help.  I needed auto-tuning, line unbreaking and real flow following.

 


Re: Text Extraction with multi-column documents in PDFBox

Posted by Ted Dunning <te...@gmail.com>.
Exactly.

That is why I reverted to looking at how the text sits on the page.  My
approaches would fall apart for wide classes of documents as well.  For
instance, mono-font documents kill the "body font" technique that I use.
 Image only OCR'ed documents are also a problem since they rarely have good
location or font information.

On Thu, Mar 31, 2011 at 12:46 PM, Martinez, Mel - 1004 - MITLL <
m.martinez@ll.mit.edu> wrote:

> If you have some foreknowledge about the structure of a given corpus of
> documents, you may be able to right some custom code that figures things
> out, but otherwise, PDF in general is simply not designed for that purpose.
>
>

Re: Text Extraction with multi-column documents in PDFBox

Posted by Ted Dunning <te...@gmail.com>.
Exactly.

That is why I reverted to looking at how the text sits on the page.  My
approaches would fall apart for wide classes of documents as well.  For
instance, mono-font documents kill the "body font" technique that I use.
 Image only OCR'ed documents are also a problem since they rarely have good
location or font information.

On Thu, Mar 31, 2011 at 12:46 PM, Martinez, Mel - 1004 - MITLL <
m.martinez@ll.mit.edu> wrote:

> If you have some foreknowledge about the structure of a given corpus of
> documents, you may be able to right some custom code that figures things
> out, but otherwise, PDF in general is simply not designed for that purpose.
>
>

Re: Text Extraction with multi-column documents in PDFBox

Posted by Ted Dunning <te...@gmail.com>.
Yes.  This use of the native flow works about 50-80% of the time in my
experience.  But it was waay to error prone to depend on and failed
spectacularly for many critical data sources.  Even where it worked, the
results were often not good enough.  For one thing, I needed real text flow
so that I could reliably reverse engineer hyphenation (for text indexing).
 I also needed to reliably remove headers, footers, page numbers, article
titles and similar boilerplate across thousands of document sources without
hand engineering each kind of document.

On Thu, Mar 31, 2011 at 8:58 AM, Martinez, Mel - 1004 - MITLL <
m.martinez@ll.mit.edu> wrote:

> Ted,
>
> A lot depends on how the PDF file was generated, but in general, so long as
> you leave the 'sort by position' attribute of the PDFBox' PDFTextStripper as
> 'false' (the default) then the text extraction will be (mostly) logical and
> not positional.
>
>        PDFTextStripper myStripper = ...
>        myStripper.setSortByPosition(false);  //not actually necessary since
> false is the default.
>
> That is, if you have text in two columns on a page, the lines will be
> extracted by article and not cross columns.
>

Sort of.  As I mentioned, the quality across a bunch of data sources was
just not good enough to even contemplate deployment.  Moreover, there was no
way forward to improve the situation.

SOME PDFs can be (and unfortunately are) generated such that the text
> objects are not logically arranged by article and the extraction still
> messes up.  But in my experience on most documents it does a pretty good
> job, especially those generated from word processors.
>

I was working against documents from publishers.  My results were much worse
than what you ahve seen, it sounds like.


> The only recurring glitches tend to be where text in headers and footers
> gets inserted and sometimes a floating text box will be inserted in the
> extracted text quite far from where it appears on the page.  But the block
> of text from the box usually will at least be integral and not chopped up.
>

Only sometimes.  The rearrangements in practice are quite capricious.


> The times when you may WANT to sort by position is when parsing text from
> PDFs that are more graphical in nature, such as those generated from
> PowerPoint type documents.   Even then though, it depends a lot on how the
> page is structured.   A bit of testing is usually necessary to figure out
> which setting works best with the particular PDF.
>

And my requirement was that I could not accept any magical knob turning.  My
solution had to work across a huge range of sources.


> As of 1.4 we have a lot of instrumentation that allows you to override /
> customize the demarcation between the following structural points:
>
> Page
> Article
> Paragraph
> Line
> Word
>

That just doesn't really help.  I needed auto-tuning, line unbreaking and
real flow following.

Re: Text Extraction with multi-column documents in PDFBox

Posted by Ted Dunning <te...@gmail.com>.
Yes.  This use of the native flow works about 50-80% of the time in my
experience.  But it was waay to error prone to depend on and failed
spectacularly for many critical data sources.  Even where it worked, the
results were often not good enough.  For one thing, I needed real text flow
so that I could reliably reverse engineer hyphenation (for text indexing).
 I also needed to reliably remove headers, footers, page numbers, article
titles and similar boilerplate across thousands of document sources without
hand engineering each kind of document.

On Thu, Mar 31, 2011 at 8:58 AM, Martinez, Mel - 1004 - MITLL <
m.martinez@ll.mit.edu> wrote:

> Ted,
>
> A lot depends on how the PDF file was generated, but in general, so long as
> you leave the 'sort by position' attribute of the PDFBox' PDFTextStripper as
> 'false' (the default) then the text extraction will be (mostly) logical and
> not positional.
>
>        PDFTextStripper myStripper = ...
>        myStripper.setSortByPosition(false);  //not actually necessary since
> false is the default.
>
> That is, if you have text in two columns on a page, the lines will be
> extracted by article and not cross columns.
>

Sort of.  As I mentioned, the quality across a bunch of data sources was
just not good enough to even contemplate deployment.  Moreover, there was no
way forward to improve the situation.

SOME PDFs can be (and unfortunately are) generated such that the text
> objects are not logically arranged by article and the extraction still
> messes up.  But in my experience on most documents it does a pretty good
> job, especially those generated from word processors.
>

I was working against documents from publishers.  My results were much worse
than what you ahve seen, it sounds like.


> The only recurring glitches tend to be where text in headers and footers
> gets inserted and sometimes a floating text box will be inserted in the
> extracted text quite far from where it appears on the page.  But the block
> of text from the box usually will at least be integral and not chopped up.
>

Only sometimes.  The rearrangements in practice are quite capricious.


> The times when you may WANT to sort by position is when parsing text from
> PDFs that are more graphical in nature, such as those generated from
> PowerPoint type documents.   Even then though, it depends a lot on how the
> page is structured.   A bit of testing is usually necessary to figure out
> which setting works best with the particular PDF.
>

And my requirement was that I could not accept any magical knob turning.  My
solution had to work across a huge range of sources.


> As of 1.4 we have a lot of instrumentation that allows you to override /
> customize the demarcation between the following structural points:
>
> Page
> Article
> Paragraph
> Line
> Word
>

That just doesn't really help.  I needed auto-tuning, line unbreaking and
real flow following.

RE: Text Extraction with multi-column documents in PDFBox

Posted by "Martinez, Mel - 1004 - MITLL" <m....@ll.mit.edu>.
Ted,

A lot depends on how the PDF file was generated, but in general, so long as you leave the 'sort by position' attribute of the PDFBox' PDFTextStripper as 'false' (the default) then the text extraction will be (mostly) logical and not positional.

	PDFTextStripper myStripper = ...
	myStripper.setSortByPosition(false);  //not actually necessary since false is the default.

That is, if you have text in two columns on a page, the lines will be extracted by article and not cross columns.

SOME PDFs can be (and unfortunately are) generated such that the text objects are not logically arranged by article and the extraction still messes up.  But in my experience on most documents it does a pretty good job, especially those generated from word processors.

The only recurring glitches tend to be where text in headers and footers gets inserted and sometimes a floating text box will be inserted in the extracted text quite far from where it appears on the page.  But the block of text from the box usually will at least be integral and not chopped up.

The times when you may WANT to sort by position is when parsing text from PDFs that are more graphical in nature, such as those generated from PowerPoint type documents.   Even then though, it depends a lot on how the page is structured.   A bit of testing is usually necessary to figure out which setting works best with the particular PDF.

As of 1.4 we have a lot of instrumentation that allows you to override / customize the demarcation between the following structural points:

Page
Article
Paragraph
Line
Word

All you have to do is apply the demarcations that you would prefer using the setters or for more complex cases subclass the stripper and override the behavior of the getters for the start/stop demarcations.

In my own usage I have used this to extract text into a simple xml format with above tags and this has been applied to thousands of documents from a variety of sources.  For the most part, this works pretty well.

Good luck,

Mel



-----Original Message-----
From: Ted Dunning [mailto:tdunning@apache.org] 
Sent: Wednesday, March 30, 2011 1:05 PM
To: pdfbox-dev@incubator.apache.org; pdfbox-users@incubator.apache.org
Subject: Fwd: Text Extraction with multi-column documents in PDFBox

---------- Forwarded message ----------
From: Ted Dunning <td...@apache.org>
Date: Wed, Mar 30, 2011 at 10:04 AM
Subject: Re: Text Extraction with multi-column documents in PDFBox
To: Jeremy Barkan <ba...@alum.mit.edu>


I haven't looked at that lately so I may be a bit wrong on details, but if
you look at the sample article that I posted, you can see how simply
following any heuristic for generating the flow based on position alone will
not work.  The text inset on the first page, for instance, will get the
columns all confused.  The current heuristics are probably fine for finding
individual lines, but not for splitting lines into columns and then
threading those lines into correct flows and marking those flows as text or
decoration.  Moreover, there are important cues given by font and size that
need to be used.  One such cue is whether the text is in the majority font.
 This alone is enough to separate about 90% of the main flow of the document
from other parts fo the document (for the journals I examined).   Most of
the remaining 10% can be had from considering geometrical cues in the
context of that initial assignment, but without the original assignment
based on fonts, the geometry isn't really strong enough.

I think that there is more to be done with what I started in that you can
look at how things came out from the first pass and use statistics
describing positions on the page and font/size/position transitions within a
single text type to refine the statistical model of the document.  That
would allow the flow to be recalculated, hopefully handling a few corner
cases more accurately.

My original goal was to simply remove the boiler-plate from the document and
leave a residue that would allow a high quality retrieval index to be
created.  The final results were nearly good enough to present as a
simplified, text-only surrogate for the document, but not quite.  They were
certainly quite readable, but not very pretty.


On Wed, Mar 30, 2011 at 9:54 AM, Jeremy Barkan <ba...@alum.mit.edu> wrote:

> How is what you describe similar or different than the charactersByArticle
> method of PDFTextStripper ?
>
>
>
> Thanks so much for your help
>
>
>
> Best Regards
>
>
>
> Jeremy
>
>
>
>
>
> *Jeremy Barkan*
>
>
>
> Tel: +972 2 6728069
>
> Mobile: +972 54 6321603
>
> Skype: jeremy_barkan
>
>
>
> *From:* Ted Dunning [mailto:tdunning@apache.org]
> *Sent:* 30 March 2011 17:55
> *To:* Jeremy Barkan
> *Subject:* Re: Text Extraction with multi-column documents in PDFBox
>
>
>
> Neither.
>
>
>
> Never.
>
>
>
> It would be very helpful to have it, though.
>
> On Wed, Mar 30, 2011 at 8:52 AM, Jeremy Barkan <ba...@alum.mit.edu>
> wrote:
>
> Thanks for getting back to me – I was looking into this kind of algorithm.
>
> Was this merged into PDFBox 1.4 or 1.5 ?
>
> I'm trying to decide if to implement this on my own on top of PDFBox or to
> use what PDFBox would have already implemented
>
>
>

RE: Text Extraction with multi-column documents in PDFBox

Posted by "Martinez, Mel - 1004 - MITLL" <m....@ll.mit.edu>.
Ted,

A lot depends on how the PDF file was generated, but in general, so long as you leave the 'sort by position' attribute of the PDFBox' PDFTextStripper as 'false' (the default) then the text extraction will be (mostly) logical and not positional.

	PDFTextStripper myStripper = ...
	myStripper.setSortByPosition(false);  //not actually necessary since false is the default.

That is, if you have text in two columns on a page, the lines will be extracted by article and not cross columns.

SOME PDFs can be (and unfortunately are) generated such that the text objects are not logically arranged by article and the extraction still messes up.  But in my experience on most documents it does a pretty good job, especially those generated from word processors.

The only recurring glitches tend to be where text in headers and footers gets inserted and sometimes a floating text box will be inserted in the extracted text quite far from where it appears on the page.  But the block of text from the box usually will at least be integral and not chopped up.

The times when you may WANT to sort by position is when parsing text from PDFs that are more graphical in nature, such as those generated from PowerPoint type documents.   Even then though, it depends a lot on how the page is structured.   A bit of testing is usually necessary to figure out which setting works best with the particular PDF.

As of 1.4 we have a lot of instrumentation that allows you to override / customize the demarcation between the following structural points:

Page
Article
Paragraph
Line
Word

All you have to do is apply the demarcations that you would prefer using the setters or for more complex cases subclass the stripper and override the behavior of the getters for the start/stop demarcations.

In my own usage I have used this to extract text into a simple xml format with above tags and this has been applied to thousands of documents from a variety of sources.  For the most part, this works pretty well.

Good luck,

Mel



-----Original Message-----
From: Ted Dunning [mailto:tdunning@apache.org] 
Sent: Wednesday, March 30, 2011 1:05 PM
To: pdfbox-dev@incubator.apache.org; pdfbox-users@incubator.apache.org
Subject: Fwd: Text Extraction with multi-column documents in PDFBox

---------- Forwarded message ----------
From: Ted Dunning <td...@apache.org>
Date: Wed, Mar 30, 2011 at 10:04 AM
Subject: Re: Text Extraction with multi-column documents in PDFBox
To: Jeremy Barkan <ba...@alum.mit.edu>


I haven't looked at that lately so I may be a bit wrong on details, but if
you look at the sample article that I posted, you can see how simply
following any heuristic for generating the flow based on position alone will
not work.  The text inset on the first page, for instance, will get the
columns all confused.  The current heuristics are probably fine for finding
individual lines, but not for splitting lines into columns and then
threading those lines into correct flows and marking those flows as text or
decoration.  Moreover, there are important cues given by font and size that
need to be used.  One such cue is whether the text is in the majority font.
 This alone is enough to separate about 90% of the main flow of the document
from other parts fo the document (for the journals I examined).   Most of
the remaining 10% can be had from considering geometrical cues in the
context of that initial assignment, but without the original assignment
based on fonts, the geometry isn't really strong enough.

I think that there is more to be done with what I started in that you can
look at how things came out from the first pass and use statistics
describing positions on the page and font/size/position transitions within a
single text type to refine the statistical model of the document.  That
would allow the flow to be recalculated, hopefully handling a few corner
cases more accurately.

My original goal was to simply remove the boiler-plate from the document and
leave a residue that would allow a high quality retrieval index to be
created.  The final results were nearly good enough to present as a
simplified, text-only surrogate for the document, but not quite.  They were
certainly quite readable, but not very pretty.


On Wed, Mar 30, 2011 at 9:54 AM, Jeremy Barkan <ba...@alum.mit.edu> wrote:

> How is what you describe similar or different than the charactersByArticle
> method of PDFTextStripper ?
>
>
>
> Thanks so much for your help
>
>
>
> Best Regards
>
>
>
> Jeremy
>
>
>
>
>
> *Jeremy Barkan*
>
>
>
> Tel: +972 2 6728069
>
> Mobile: +972 54 6321603
>
> Skype: jeremy_barkan
>
>
>
> *From:* Ted Dunning [mailto:tdunning@apache.org]
> *Sent:* 30 March 2011 17:55
> *To:* Jeremy Barkan
> *Subject:* Re: Text Extraction with multi-column documents in PDFBox
>
>
>
> Neither.
>
>
>
> Never.
>
>
>
> It would be very helpful to have it, though.
>
> On Wed, Mar 30, 2011 at 8:52 AM, Jeremy Barkan <ba...@alum.mit.edu>
> wrote:
>
> Thanks for getting back to me – I was looking into this kind of algorithm.
>
> Was this merged into PDFBox 1.4 or 1.5 ?
>
> I'm trying to decide if to implement this on my own on top of PDFBox or to
> use what PDFBox would have already implemented
>
>
>