You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Chenping Ni <cn...@infogix.com> on 2010/03/30 20:56:32 UTC

how to read marked content from tagged pdf

I am using what is in SVN (pdfbox 1.0.1 SNAPSHOT)

(I have seen some new/updated code in 

  org.apache.pdfbox.pdmodel.documentinterchange.logicalstructure package

and org.apache.pdfbox.pdmodel.documentinterchange.markedcontent package
since 1.0.0 official release).

 

I have created marked_content.pdf (using iText code). I have seen these
tags fine from Acrobat.

 

The structure is like this:

 

(Top level) "Everything" (a Sect)

 It has three <p> children (I will call it p1 p2 and p3, actually there
are all just "P"s). Each "P" has some text in it.

 

 

P1:  1It was the best of times, it was the worst of times, 2it was the
age of wisdom, it was the age of foolishness, 3it was the epoch of
belief, it was the epoch of incredulity, 4it was the season of Light, it
was the season of Darkness, 5it was the spring of hope, it was the
winter of despair.

 

P2: 

 1We had everything before us, we had nothing before us, 2we were all
going direct to Heaven, we were all going direct 3the other way-in
short, the period was so far like the present 4period, that some of its
noisiest authorities insisted on its 5being received, for good or for
evil, in the superlative degree 6of comparison only.

 

P3:  It was the best of times.

 

How do we read the text back using PDFBox (SVN newest code)?

 

I have been using PDDocumentCatalog docCatalog =
pdfDocument.getDocumentCatalog();

And PDStructureTreeRoot structureTreeRoot =
docCatalog.getStructureTreeRoot();

 

I have looked at the content of structureTreeRoot, just cannot find
these text content.

 

Could anyone kindly tell me how to get to the text?

 

Thanks,

 

Tom



Chenping Ni | Infogix, Inc. 
phone 630-505-5415 |  fax 630-505-1812  
cni@infogix.com | www.infogix.com

NOTICE: This e-mail message and any included attachments are from Infogix, Inc. ("Infogix") and are intended solely for use by the individual(s) to whom the message was addressed. The information contained herein may include privileged or otherwise confidential information. Unauthorized review, forwarding, printing, copying, distributing, or using the information contained in this message is strictly prohibited. If you have received this message in error, or have reason to believe that you are not authorized to receive it, please promptly notify the sender by e-mail, delete the message from your computer, and do not copy or disclose the information to anyone else. If you properly received this e-mail as an addressee, please maintain its contents in confidence to protect confidentiality. Thank you.