You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by Johannes Koch <jo...@fit.fraunhofer.de> on 2009/12/18 13:50:17 UTC

Work on structure tree

Hi,

I would like to do some work on the PDF structure tree for pdfbox. Looks 
like this relates to feature request 
<https://issues.apache.org/jira/browse/PDFBOX-7>. I guess 
org.apache.pdfbox.pdmodel.documentinterchange.logicalstructure.PDStructureElement 
and PDStructureTreeRoot are the classes to add some higher-level methods 
providing structure information based on the lower-level COSDictionary.

As I'm no expert in PDF things (Is started to read the relevant sections 
in the PDF 32000-1:2008 spec), it would be nice to get some help when 
needed.

-- 
Johannes Koch
Fraunhofer Institute for Applied Information Technology FIT
Web Compliance Center
Schloss Birlinghoven, D-53757 Sankt Augustin, Germany
Phone: +49-2241-142628    Fax: +49-2241-142065

Re: Work on structure tree

Posted by Jeremias Maerki <de...@jeremias-maerki.ch>.
I got some experience with PDF's structure tree and tagged PDF when this
was implemented for Apache FOP. I can try to keep an eye out for your
questions. No promises on response times, though. At any rate, PDFBox
could profit from that especially in the text extraction department,
since tagged PDF allows to extract text much more reliably.

It's probably also a good idea if you get yourself a copy of Apache FOP
Trunk ([1], new functionality is not released, yet) so you can generate
simple test cases for yourself (if you know a little bit of XSL-FO).

[1] http://xmlgraphics.apache.org/fop/trunk/accessibility.html

On 18.12.2009 13:50:17 Johannes Koch wrote:
> Hi,
> 
> I would like to do some work on the PDF structure tree for pdfbox. Looks 
> like this relates to feature request 
> <https://issues.apache.org/jira/browse/PDFBOX-7>. I guess 
> org.apache.pdfbox.pdmodel.documentinterchange.logicalstructure.PDStructureElement 
> and PDStructureTreeRoot are the classes to add some higher-level methods 
> providing structure information based on the lower-level COSDictionary.
> 
> As I'm no expert in PDF things (Is started to read the relevant sections 
> in the PDF 32000-1:2008 spec), it would be nice to get some help when 
> needed.
> 
> -- 
> Johannes Koch
> Fraunhofer Institute for Applied Information Technology FIT
> Web Compliance Center
> Schloss Birlinghoven, D-53757 Sankt Augustin, Germany
> Phone: +49-2241-142628    Fax: +49-2241-142065




Jeremias Maerki