You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "David Cole (JIRA)" <ji...@apache.org> on 2013/10/09 16:54:42 UTC

[jira] [Comment Edited] (TIKA-1178) Improve docx multiple section handling - headers and footers

    [ https://issues.apache.org/jira/browse/TIKA-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13790293#comment-13790293 ] 

David Cole edited comment on TIKA-1178 at 10/9/13 2:53 PM:
-----------------------------------------------------------

Issue description updated. 
test files added. 
sample code, actual output,  and expected output added.


was (Author: testaccount118):
test files mentioned in the description.

> Improve docx multiple section handling - headers and footers
> ------------------------------------------------------------
>
>                 Key: TIKA-1178
>                 URL: https://issues.apache.org/jira/browse/TIKA-1178
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: David Cole
>            Priority: Minor
>              Labels: docx, parsing, sectPr
>         Attachments: 3pages_1section_FirstEvenOddHeaderFooter_mod.docx, 3pages_3sections_defaultHeaderFooter_mod.docx
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Currently docx to plain text is only accurate for single page files. First off, the sectPr tag right above the closing body tag is not the overall document property; it is the section property of the last section(if there is only one, then yes it is the overall document property per say). right now if I had a large docx file (let's say a book which i broke each chapter into it's own section) then i would get the last chapter's header as the beginning document's header.
> Addressing sectPr tags inside paragraphs:
> why are we wrapping the paragraph with the header and footer?
> we should be buffering up pages as we read the docx file, until we hit a section property where we decide how to wrap what we just consumed. I realize that it is difficult to determine page breaks when it is caused by overflow (not explicit page breaks). 
> The time for completion is really dependent on how much improvement we want to add in this area.
> Just for reference, my assumptions on open office xml structure interpretation come from the documentation on this site: http://www.ecma-international.org/publications/standards/Ecma-376.htm
> UPDATE:
> sample code, test files, and output.
>     InputStream in = new FileInputStream(test);
>     
>     ContentHandler handler = new BodyContentHandler();
>     Metadata metadata = new Metadata();
>     
>     OOXMLExtractorFactory.parse(in, handler, metadata, new ParseContext());
>     String text = handler.toString();
>     System.out.println(text);
> given a file with 3 pages, a section on each page, and a default header and footer (odd) for each section. for reading convenience, the text listed below describes itself. ie. "Header1" means the first page header text, ect.
> Here is a sample file(3pages_3sections_defaultHeaderFooter_mod.docx):
> Header 1
> First paragraph on page 1
> Second paragraph on page 1
> Footer 1
> Header 2
> First paragraph on page 2
> Second paragraph on page 2
> Footer 2
> Header 3
> First paragraph on page 3
> Second paragraph on page 3
> Footer 3
> the output I get is:
> Header 3
> First paragraph on page 1
> Header 1
> Second paragraph on page 1
> Footer 1
> First paragraph on page 2
> Header 2
> Second paragraph on page 2
> Footer 2
> First paragraph on page 3
> Second paragraph on page 3
> Footer 3
> Here is another file with only 1 section with first, odd, and even headers used (3pages_1section_FirstEvenOddHeaderFooter_mod.docx):
> First page header
> First paragraph on page 1
> Second paragraph on page 1
> First page footer
> Second page header (even)
> First paragraph on page 2
> Second paragraph on page 2
> Second page footer (even)
> Third page header (odd)
> First paragraph on page 3
> Second paragraph on page 3
> Third page footer (odd)
> actual output:
> First page header
> Second page header (even)
> Third page header (odd)
> First paragraph on page 1
> Second paragraph on page 1
> First paragraph on page 2
> Second paragraph on page 2
> First paragraph on page 3
> Second paragraph on page 3
> First page footer
> Second page footer (even)
> Third page footer (odd)



--
This message was sent by Atlassian JIRA
(v6.1#6144)