You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Torsten Petersdorf <to...@tele2.de> on 2009/01/02 21:46:11 UTC

Re: Extracting paper/book title from a PDF

He there,

as I wrote couple of weeks ago to this list, this is exactly what I'am 
doing in my Bachelor Thesis.
I don't refer to any meta data, so this is a best effort aproach.

My approach utilizes a custon extension of PDFTextStripper for that 
purpose. To get the title of a document I simply take the lines from 
page one with the biggest font size. Authors usually follow directly on 
the next lines. It was a bit of work to get the text in the right order, 
but you can use data from TextPosition to sort the text. The font size 
can be retrieved from there as well. For some documents you won't get 
useful results on the font size, you can use height or yscale then, but 
results tend to be less accurate if not using font size. Hopefully, we 
will see an improvement here in later versions of pdfbox.

To get the references, I use a key word search first to get the whole 
section from the text. I split it up, one line per reference and than 
the fun part begins. I use pattern matching with regex and substrings to 
extract title, authors and publication info from each line. I get pretty 
good results on some docs, worse for others,  but the goal of my work is 
not to get all information from every single document, but to build a 
tool allowing users to enter their own strategies for getting the 
information they desire. There also will be future work on that topic 
further improving my results.

To sum it up, this is not done in a couple lines of code. The 
PDFTextStripper extension has a good thousand lines and that is only 
preparing the text.

Torsten


Daniele Development-ML schrieb:
> Hello everybody,
> I'm using PDFBox to try to extract some specific text from a PDF file. In
> particular, I'm trying to detect the book title, author, and the
> bibliographic entries (the references) - the PDF file is printed through the
> pdftex command.
>
> Extracting the raw text doesn't help too much as no data is carried with
> that. I was therefore trying to browser the document structure and access
> the COS objects and get the text value through them. This may just and only
> work for the title, and the authors - which both might be written in a
> different paragraph.
>
> However, I'm getting a bit confused on the real feasibility of this approach
> and on the use of the documentTreeStructure and the COSDictionary.
>
> Has anybody ever faced/solved this problem?
> Any comments or suggestions, or pointers to examples? The examples in the
> distro seem not to cover this aspect fully, or perhaps I am wrong.
>
> Many thanks,
>
> Dan
>
>