You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Nick Burch (JIRA)" <ji...@apache.org> on 2013/10/05 18:37:42 UTC

[jira] [Commented] (TIKA-1178) Improve docx multiple section handling - headers and footers

    [ https://issues.apache.org/jira/browse/TIKA-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13787256#comment-13787256 ] 

Nick Burch commented on TIKA-1178:
----------------------------------

Are you able to produce a small test document that shows up the problem? I guess something with a few sections, some with headers/footers, some without, some with one but not the other, and probably some even/odd/first stuff too. Might be handy to include in the page bit a human readable description of what should be there. That should make writing a comprehensive unit test for this easy, and from there we can start thinking about how to fix it...!

(I'm not sure we can ever fully fix it, as the word format isn't a page based format like things such as PDF, but we can probably get better)

> Improve docx multiple section handling - headers and footers
> ------------------------------------------------------------
>
>                 Key: TIKA-1178
>                 URL: https://issues.apache.org/jira/browse/TIKA-1178
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: David Cole
>            Priority: Minor
>              Labels: docx, parsing, sectPr
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Currently docx to plain text is only accurate for single page files. First off, the sectPr tag right above the closing body tag is not the overall document property; it is the section property of the last section(if there is only one, then yes it is the overall document property per say). right now if I had a large docx file (let's say a book which i broke each chapter into it's own section) then i would get the last chapter's header as the beginning document's header.
> Addressing sectPr tags inside paragraphs:
> why are we wrapping the paragraph with the header and footer?
> we should be buffering up pages as we read the docx file, until we hit a section property where we decide how to wrap what we just consumed. I realize that it is difficult to determine page breaks when it is caused by overflow (not explicit page breaks). 
> The time for completion is really dependent on how much improvement we want to add in this area.
> Just for reference, my assumptions on open office xml structure interpretation come from the documentation on this site: http://www.ecma-international.org/publications/standards/Ecma-376.htm



--
This message was sent by Atlassian JIRA
(v6.1#6144)