You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Andreas Lehmkühler (JIRA)" <ji...@apache.org> on 2010/03/11 09:38:27 UTC
[jira] Resolved: (PDFBOX-7) extract information from tagged PDF

     [ https://issues.apache.org/jira/browse/PDFBOX-7?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler resolved PDFBOX-7.
-------------------------------------

       Resolution: Fixed
    Fix Version/s: 1.1.0

As the basic work is done I'll set this to resolved.

> extract information from tagged PDF
> -----------------------------------
>
>                 Key: PDFBOX-7
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-7
>             Project: PDFBox
>          Issue Type: New Feature
>          Components: PDModel
>             Fix For: 1.1.0
>
>         Attachments: PDFBOX-7_patch_00.txt, PDFBOX-7_patch_01.txt, PDFBOX-7_patch_02.txt, PDFBOX-7_patch_03.txt, PDFBOX-7_patch_04.txt, PDFMarkedContentExtractor.properties
>
>
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552835&aid=805623
> Originally submitted by benlitchfield on 2003-09-13 07:38.
> Add the ability to extract information from a tagged PDF 
> document.  See taggedPDF.pdf for an example.
> [comment on SourceForge]
> Originally sent by qumar.
> Logged In: YES 
> user_id=1468838
> Hi,
> we have to parse the PDF object structure tree; all
> structural elements are inside the object tree (see e.g.
> PDFReference 1.4 chapter 9.6 "Logical Structure").
> - parse the PDF page streams to extract drawing and text
> operations;these contain the actual content of the
> structural elements. This content is surrounded by BMC/EMC
> tags which contain information to which element object the
> contained content belongs.This is what i got from pdf reference.
> Regards,
> Qumar.
> [comment on SourceForge]
> Originally sent by benlitchfield.
> Logged In: YES 
> user_id=601708
> http://www.irs.gov/pub/irs-access/f1040ez_accessible.pdf
> would be a good form to start with.
> If you notice they are putting labels on the form fields.  
> these labels contain meta data critical to building tax 
> software in rapid fashion.  Without this meta data, the 
> name of the form field is meaningless. It would be nice to 
> extract this information so I can combine it with other 
> data about the field (name, type, location, etc).  I 
> already know PDFBox can extract the other information about 
> the fields.  I haven't done it with PDFBox, but I did it 
> with iText.
> [comment on SourceForge]
> Originally sent by benlitchfield.
> Logged In: YES 
> user_id=601708
> More comments from users
> Tagged PDF will be a big thing in government because 
> federal government procurement of Acrobat publishing 
> technology falls under Section 508.  States will likely 
> follow.
>  see:
> www.section508.gov
> http://www.irs.gov/pub/irs-access/
> or
> ftp://ftp.irs.gov/pub/irs-access/
> [comment on SourceForge]
> Originally sent by qumar.
> Logged In: YES 
> user_id=1468838
> Hi,
>  i was seeing the specification of pdf and came to know the
> structure information of pdf will be in PDSEdit
> layer,PDSEdit Layer gives access to structure tree with in a
> pdf and methods methods and objects are prefixed by PDS.So
> how can we get access to PDSEdit layer of pdf.
> [comment on SourceForge]
> Originally sent by qumar.
> Logged In: YES 
> user_id=1468838
> It would be nice if pdfbox can provide the ability to
> extract information from tagged PDF.As Adobre Acrobat Reader
> provides the tags for the pdf, pdfbox should also try to get
> the tagged pdfs.
> for example if iwe have a pdf file with a para1 under
> header1 and para2 under header 2 and a table with rows and
> columns.something like 
>  
> Header1 
> This is a para 1 ,it describes about a disease.  
> Header2 
> This is a para2,describes remedies of disease. 
> Table 
> A B  
> C D 
>  
>  
> Now the tagged pdf looks like below in adobe acrobat reader
>  
> <Heading 1> 
> Header1 
> <Normal>  
> This is a para 1 ,it describes about a disease. 
> <Heading 1> 
> Header1 
> <Normal>  
> This is a para2,describes remedies of disease. 
> <Heading 1> 
> Table 
> <Table> 
> <TBody> 
> <TR> 
> <TD> 
> <Normal> 
> A 
> <TD> 
> <Normal> 
> B 
> <TR> 
> <TD> 
> <Normal> 
> C 
> <TD> 
> <Normal> 
> D 
> how can we extract the Heading1 ,Heading 2 and tabular data
> using pdfbox.
> This is a good feature which should be added to the armory
> pdfbox.
> Please provide this feature.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.