You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Andreas Lehmkühler (JIRA)" <ji...@apache.org> on 2010/03/11 09:38:27 UTC
[jira] Resolved: (PDFBOX-7) extract information from tagged PDF
[ https://issues.apache.org/jira/browse/PDFBOX-7?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andreas Lehmkühler resolved PDFBOX-7.
-------------------------------------
Resolution: Fixed
Fix Version/s: 1.1.0
As the basic work is done I'll set this to resolved.
> extract information from tagged PDF
> -----------------------------------
>
> Key: PDFBOX-7
> URL: https://issues.apache.org/jira/browse/PDFBOX-7
> Project: PDFBox
> Issue Type: New Feature
> Components: PDModel
> Fix For: 1.1.0
>
> Attachments: PDFBOX-7_patch_00.txt, PDFBOX-7_patch_01.txt, PDFBOX-7_patch_02.txt, PDFBOX-7_patch_03.txt, PDFBOX-7_patch_04.txt, PDFMarkedContentExtractor.properties
>
>
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552835&aid=805623
> Originally submitted by benlitchfield on 2003-09-13 07:38.
> Add the ability to extract information from a tagged PDF
> document. See taggedPDF.pdf for an example.
> [comment on SourceForge]
> Originally sent by qumar.
> Logged In: YES
> user_id=1468838
> Hi,
> we have to parse the PDF object structure tree; all
> structural elements are inside the object tree (see e.g.
> PDFReference 1.4 chapter 9.6 "Logical Structure").
> - parse the PDF page streams to extract drawing and text
> operations;these contain the actual content of the
> structural elements. This content is surrounded by BMC/EMC
> tags which contain information to which element object the
> contained content belongs.This is what i got from pdf reference.
> Regards,
> Qumar.
> [comment on SourceForge]
> Originally sent by benlitchfield.
> Logged In: YES
> user_id=601708
> http://www.irs.gov/pub/irs-access/f1040ez_accessible.pdf
> would be a good form to start with.
> If you notice they are putting labels on the form fields.
> these labels contain meta data critical to building tax
> software in rapid fashion. Without this meta data, the
> name of the form field is meaningless. It would be nice to
> extract this information so I can combine it with other
> data about the field (name, type, location, etc). I
> already know PDFBox can extract the other information about
> the fields. I haven't done it with PDFBox, but I did it
> with iText.
> [comment on SourceForge]
> Originally sent by benlitchfield.
> Logged In: YES
> user_id=601708
> More comments from users
> Tagged PDF will be a big thing in government because
> federal government procurement of Acrobat publishing
> technology falls under Section 508. States will likely
> follow.
> see:
> www.section508.gov
> http://www.irs.gov/pub/irs-access/
> or
> ftp://ftp.irs.gov/pub/irs-access/
> [comment on SourceForge]
> Originally sent by qumar.
> Logged In: YES
> user_id=1468838
> Hi,
> i was seeing the specification of pdf and came to know the
> structure information of pdf will be in PDSEdit
> layer,PDSEdit Layer gives access to structure tree with in a
> pdf and methods methods and objects are prefixed by PDS.So
> how can we get access to PDSEdit layer of pdf.
> [comment on SourceForge]
> Originally sent by qumar.
> Logged In: YES
> user_id=1468838
> It would be nice if pdfbox can provide the ability to
> extract information from tagged PDF.As Adobre Acrobat Reader
> provides the tags for the pdf, pdfbox should also try to get
> the tagged pdfs.
> for example if iwe have a pdf file with a para1 under
> header1 and para2 under header 2 and a table with rows and
> columns.something like
>
> Header1
> This is a para 1 ,it describes about a disease.
> Header2
> This is a para2,describes remedies of disease.
> Table
> A B
> C D
>
>
> Now the tagged pdf looks like below in adobe acrobat reader
>
> <Heading 1>
> Header1
> <Normal>
> This is a para 1 ,it describes about a disease.
> <Heading 1>
> Header1
> <Normal>
> This is a para2,describes remedies of disease.
> <Heading 1>
> Table
> <Table>
> <TBody>
> <TR>
> <TD>
> <Normal>
> A
> <TD>
> <Normal>
> B
> <TR>
> <TD>
> <Normal>
> C
> <TD>
> <Normal>
> D
> how can we extract the Heading1 ,Heading 2 and tabular data
> using pdfbox.
> This is a good feature which should be added to the armory
> pdfbox.
> Please provide this feature.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.