You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by cs...@sina.com on 2015/03/23 09:52:20 UTC

ask for help

Dear sir/madam
I'm a chinese student. I want to use PDFbox to do some research in PDF extraction.
Now the most important thing for me is to extract the structurual information from PDFs. I know PDFbox is very powerfull. But  I do not know how to extract the information from a pdf. I've extract the plain txt from a pdf using PDFbox. And the plain txt can't satisfy my demand. For natural language processing, I need parsing the PDF, so I should not only extract the txt information, but also get the PDF's structure that means I should get the all the tags like Tj、Tm in a PDF. PDFbox has lots of APIs, I don't know how to get the value from every tag of each PDFobject. I know in PDF some tags in it, just like Tj、Tm and so on. I hope get every PDFobject's structural information just like font、fontsize and so on, so I can obtain some pattern just like the max font, and then I can find the "title" of each paper. To the object which has the content stream, i hope to decode the stream. Finally, I can abtain the object's pattern which  has content stream, then I can classify the objects to find which category I need.
Do you think its possible?
Could you give me some example to extract PDF, specially the extraction the object with stream, find max font-size object and decode the stream. I hope you can provide me some source codes extracting pdfs using PDFbox. Not just stripper.getText().
Thanks a billion!!! I hope you write to me soon!!!
sincerely,
 
dock CHEN

Re: ask for help

Posted by Peter Murray-Rust <pm...@cam.ac.uk>.

It is formally impossible to extract structural information from an
arbitrary PDF. The primitives can come in any order and only their position
on the page matters. We have written an Open Source heuristic program
http://bitbucket.org/petermr/pdf2svg which overrides PageDrawer and
captures the stream as medium-level primitives. This normalizes the stream
and creates an output of SVG. A further program
http://bitbucket.org/petermr/svg2xml uses heuristics based on whitespace
and bold headings to create structures such as titles.

We have developed it for academic PDFs (from scholarly publishers) which,
unhappily, are among the worst PDFs I have encountered. No Unicode (a
recent example of plus-minus was represented by underscore-plus. Bold is
often a shade of gray. double column PDF is often very hard to interpret.

We are developing a community effort to create templates for structuring.

P.

On Mon, Mar 23, 2015 at 8:52 AM, <cs...@sina.com> wrote:

> Dear sir/madam
> I'm a chinese student. I want to use PDFbox to do some research in PDF
> extraction.
> Now the most important thing for me is to extract the structurual
> information from PDFs. I know PDFbox is very powerfull. But  I do not know
> how to extract the information from a pdf. I've extract the plain txt from
> a pdf using PDFbox. And the plain txt can't satisfy my demand. For natural
> language processing, I need parsing the PDF, so I should not only extract
> the txt information, but also get the PDF's structure that means I should
> get the all the tags like Tj、Tm in a PDF. PDFbox has lots of APIs, I don't
> know how to get the value from every tag of each PDFobject. I know in PDF
> some tags in it, just like Tj、Tm and so on. I hope get every PDFobject's
> structural information just like font、fontsize and so on, so I can obtain
> some pattern just like the max font, and then I can find the "title" of
> each paper. To the object which has the content stream, i hope to decode
> the stream. Finally, I can abtain the object's pattern which  has content
> stream, then I can classify the objects to find which category I need.
> Do you think its possible?
> Could you give me some example to extract PDF, specially the extraction
> the object with stream, find max font-size object and decode the stream. I
> hope you can provide me some source codes extracting pdfs using PDFbox. Not
> just stripper.getText().
> Thanks a billion!!! I hope you write to me soon!!!
> sincerely,
>
> dock CHEN

-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069