You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Gabriel Valencia (JIRA)" <ji...@apache.org> on 2012/04/27 19:10:49 UTC

[jira] [Created] (TIKA-906) Headers, footers, and footnotes not extracted from Pages documents

Gabriel Valencia created TIKA-906:
-------------------------------------

             Summary: Headers, footers, and footnotes not extracted from Pages documents
                 Key: TIKA-906
                 URL: https://issues.apache.org/jira/browse/TIKA-906
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.0
         Environment: Windows 7
            Reporter: Gabriel Valencia
         Attachments: testPagesHeadersFootersFootnotesJIRA.pages

Tika does not extract anything from the header or footer area and also does not extract footnotes.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-906) Headers, footers, and footnotes not extracted from Pages documents

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann updated TIKA-906:
-----------------------------------


- push to 1.3
                
> Headers, footers, and footnotes not extracted from Pages documents
> ------------------------------------------------------------------
>
>                 Key: TIKA-906
>                 URL: https://issues.apache.org/jira/browse/TIKA-906
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.0
>         Environment: Windows 7
>            Reporter: Gabriel Valencia
>              Labels: iWork
>             Fix For: 1.3
>
>         Attachments: testPagesHeadersFootersFootnotesJIRA.pages
>
>
> Tika does not extract anything from the header or footer area and also does not extract footnotes.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-906) Headers, footers, and footnotes not extracted from Pages documents

Posted by "Gabriel Valencia (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gabriel Valencia updated TIKA-906:
----------------------------------

    Attachment: testPagesHeadersFootersFootnotesJIRA.pages

Contains header text, footer text (including automatic page numbering), and some footnotes.
                
> Headers, footers, and footnotes not extracted from Pages documents
> ------------------------------------------------------------------
>
>                 Key: TIKA-906
>                 URL: https://issues.apache.org/jira/browse/TIKA-906
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.0
>         Environment: Windows 7
>            Reporter: Gabriel Valencia
>              Labels: iwork
>         Attachments: testPagesHeadersFootersFootnotesJIRA.pages
>
>
> Tika does not extract anything from the header or footer area and also does not extract footnotes.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-906) Headers, footers, and footnotes not extracted from Pages documents

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13264059#comment-13264059 ] 

Nick Burch commented on TIKA-906:
---------------------------------

Support added in r1331618. We can now get headers, footers and footnotes, assuming a file only has one set of each, with the default names. (If a file has multiple styles with different ones, the code will likely just end up with the last one)

Note that we are rapidly approaching the point when the current model for the parser won't cope. At that point, we'll need to start holding things like styles, headers, footers etc properly, track state more as we process the file (a single state level isn't really enough), be aware of styles applied to text etc.
                
> Headers, footers, and footnotes not extracted from Pages documents
> ------------------------------------------------------------------
>
>                 Key: TIKA-906
>                 URL: https://issues.apache.org/jira/browse/TIKA-906
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.0
>         Environment: Windows 7
>            Reporter: Gabriel Valencia
>              Labels: iWork
>             Fix For: 1.2
>
>         Attachments: testPagesHeadersFootersFootnotesJIRA.pages
>
>
> Tika does not extract anything from the header or footer area and also does not extract footnotes.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-906) Headers, footers, and footnotes not extracted from Pages documents

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann updated TIKA-906:
-----------------------------------

    Fix Version/s:     (was: 1.2)
                   1.3

- push to 1.3
                
> Headers, footers, and footnotes not extracted from Pages documents
> ------------------------------------------------------------------
>
>                 Key: TIKA-906
>                 URL: https://issues.apache.org/jira/browse/TIKA-906
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.0
>         Environment: Windows 7
>            Reporter: Gabriel Valencia
>              Labels: iWork
>             Fix For: 1.3
>
>         Attachments: testPagesHeadersFootersFootnotesJIRA.pages
>
>
> Tika does not extract anything from the header or footer area and also does not extract footnotes.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Reopened] (TIKA-906) Headers, footers, and footnotes not extracted from Pages documents

Posted by "Gabriel Valencia (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gabriel Valencia reopened TIKA-906:
-----------------------------------


Going to reopen in light of the automatic page number issue.
                
> Headers, footers, and footnotes not extracted from Pages documents
> ------------------------------------------------------------------
>
>                 Key: TIKA-906
>                 URL: https://issues.apache.org/jira/browse/TIKA-906
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.0
>         Environment: Windows 7
>            Reporter: Gabriel Valencia
>              Labels: iWork
>             Fix For: 1.2
>
>         Attachments: testPagesHeadersFootersFootnotesJIRA.pages
>
>
> Tika does not extract anything from the header or footer area and also does not extract footnotes.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (TIKA-906) Headers, footers, and footnotes not extracted from Pages documents

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nick Burch resolved TIKA-906.
-----------------------------

       Resolution: Fixed
    Fix Version/s: 1.2
    
> Headers, footers, and footnotes not extracted from Pages documents
> ------------------------------------------------------------------
>
>                 Key: TIKA-906
>                 URL: https://issues.apache.org/jira/browse/TIKA-906
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.0
>         Environment: Windows 7
>            Reporter: Gabriel Valencia
>              Labels: iWork
>             Fix For: 1.2
>
>         Attachments: testPagesHeadersFootersFootnotesJIRA.pages
>
>
> Tika does not extract anything from the header or footer area and also does not extract footnotes.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-906) Headers, footers, and footnotes not extracted from Pages documents

Posted by "Ray Gauss II (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13424828#comment-13424828 ] 

Ray Gauss II commented on TIKA-906:
-----------------------------------

AutoPageNumberUtilsTest,java is missing a license header and causing rat to fail.

Shall I add the header?
                
> Headers, footers, and footnotes not extracted from Pages documents
> ------------------------------------------------------------------
>
>                 Key: TIKA-906
>                 URL: https://issues.apache.org/jira/browse/TIKA-906
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.0
>         Environment: Windows 7
>            Reporter: Gabriel Valencia
>              Labels: iWork
>             Fix For: 1.2
>
>         Attachments: testPagesHeadersFootersFootnotesJIRA.pages
>
>
> Tika does not extract anything from the header or footer area and also does not extract footnotes.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-906) Headers, footers, and footnotes not extracted from Pages documents

Posted by "Gabriel Valencia (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13266797#comment-13266797 ] 

Gabriel Valencia commented on TIKA-906:
---------------------------------------

This document also had automatic page numbering in the footer, but that doesn't get parsed. It's contained in the sf:p in the sf:footer as an sf:page-number. However, it only has one of them even though there are 2 pages. I guess the rest are automatically added by Pages.
                
> Headers, footers, and footnotes not extracted from Pages documents
> ------------------------------------------------------------------
>
>                 Key: TIKA-906
>                 URL: https://issues.apache.org/jira/browse/TIKA-906
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.0
>         Environment: Windows 7
>            Reporter: Gabriel Valencia
>              Labels: iWork
>             Fix For: 1.2
>
>         Attachments: testPagesHeadersFootersFootnotesJIRA.pages
>
>
> Tika does not extract anything from the header or footer area and also does not extract footnotes.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-906) Headers, footers, and footnotes not extracted from Pages documents

Posted by "Dave Meikle (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13425243#comment-13425243 ] 

Dave Meikle commented on TIKA-906:
----------------------------------

Sorry - I missed the header the first time.  Added it now in r1367301.

Thanks for spotting Ray.
                
> Headers, footers, and footnotes not extracted from Pages documents
> ------------------------------------------------------------------
>
>                 Key: TIKA-906
>                 URL: https://issues.apache.org/jira/browse/TIKA-906
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.0
>         Environment: Windows 7
>            Reporter: Gabriel Valencia
>              Labels: iWork
>             Fix For: 1.2
>
>         Attachments: testPagesHeadersFootersFootnotesJIRA.pages
>
>
> Tika does not extract anything from the header or footer area and also does not extract footnotes.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (TIKA-906) Headers, footers, and footnotes not extracted from Pages documents

Posted by "Dave Meikle (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dave Meikle resolved TIKA-906.
------------------------------

    Resolution: Fixed
    
> Headers, footers, and footnotes not extracted from Pages documents
> ------------------------------------------------------------------
>
>                 Key: TIKA-906
>                 URL: https://issues.apache.org/jira/browse/TIKA-906
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.0
>         Environment: Windows 7
>            Reporter: Gabriel Valencia
>              Labels: iWork
>             Fix For: 1.3
>
>         Attachments: testPagesHeadersFootersFootnotesJIRA.pages
>
>
> Tika does not extract anything from the header or footer area and also does not extract footnotes.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-906) Headers, footers, and footnotes not extracted from Pages documents

Posted by "Gabriel Valencia (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gabriel Valencia updated TIKA-906:
----------------------------------

    Issue Type: Improvement  (was: Bug)
    
> Headers, footers, and footnotes not extracted from Pages documents
> ------------------------------------------------------------------
>
>                 Key: TIKA-906
>                 URL: https://issues.apache.org/jira/browse/TIKA-906
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.0
>         Environment: Windows 7
>            Reporter: Gabriel Valencia
>              Labels: iwork
>         Attachments: testPagesHeadersFootersFootnotesJIRA.pages
>
>
> Tika does not extract anything from the header or footer area and also does not extract footnotes.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-906) Headers, footers, and footnotes not extracted from Pages documents

Posted by "Dave Meikle (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409095#comment-13409095 ] 

Dave Meikle commented on TIKA-906:
----------------------------------

Support for AutoPageNumbers added in r1358856.
                
> Headers, footers, and footnotes not extracted from Pages documents
> ------------------------------------------------------------------
>
>                 Key: TIKA-906
>                 URL: https://issues.apache.org/jira/browse/TIKA-906
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.0
>         Environment: Windows 7
>            Reporter: Gabriel Valencia
>              Labels: iWork
>             Fix For: 1.3
>
>         Attachments: testPagesHeadersFootersFootnotesJIRA.pages
>
>
> Tika does not extract anything from the header or footer area and also does not extract footnotes.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-906) Headers, footers, and footnotes not extracted from Pages documents

Posted by "Dave Meikle (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dave Meikle updated TIKA-906:
-----------------------------

    Fix Version/s:     (was: 1.3)
                   1.2
    
> Headers, footers, and footnotes not extracted from Pages documents
> ------------------------------------------------------------------
>
>                 Key: TIKA-906
>                 URL: https://issues.apache.org/jira/browse/TIKA-906
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.0
>         Environment: Windows 7
>            Reporter: Gabriel Valencia
>              Labels: iWork
>             Fix For: 1.2
>
>         Attachments: testPagesHeadersFootersFootnotesJIRA.pages
>
>
> Tika does not extract anything from the header or footer area and also does not extract footnotes.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-906) Headers, footers, and footnotes not extracted from Pages documents

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13424834#comment-13424834 ] 

Michael McCandless commented on TIKA-906:
-----------------------------------------

bq. Shall I add the header?

+1
                
> Headers, footers, and footnotes not extracted from Pages documents
> ------------------------------------------------------------------
>
>                 Key: TIKA-906
>                 URL: https://issues.apache.org/jira/browse/TIKA-906
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.0
>         Environment: Windows 7
>            Reporter: Gabriel Valencia
>              Labels: iWork
>             Fix For: 1.2
>
>         Attachments: testPagesHeadersFootersFootnotesJIRA.pages
>
>
> Tika does not extract anything from the header or footer area and also does not extract footnotes.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira