You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Ted Dunning <te...@gmail.com> on 2010/01/29 00:24:56 UTC

trying to do better text extraction

I am working on text extraction from some text.  As you might expect,
results are pretty for very simple documents and very bad for some fancy
ones.

Two column documents with headers and footers and text insets are
particularly ugly.  Using the -sort option to TextExtract makes things much
worse since lines from the insets and columns are all mixed together.

I have an idea that I could build a classifier using simple machine learning
that would quickly get the idea of what is a header and footer and would be
able to block columns together.  Given a set of non-header blocks of text,
it should be pretty simple to discern the text flow.

Thus my problem is how to find out the locations and rough presentation
information about blocks of text in a PDF document.  If there is an easy way
to hook in during the text extraction process, that would be great.  Also,
if there is a way to get more verbose structural information out of the
textExtract system that would be great.

Does anybody have any suggestions?

-- 
Ted Dunning, CTO
DeepDyve