You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Nick Burch (JIRA)" <ji...@apache.org> on 2012/07/01 23:30:52 UTC

[jira] [Created] (TIKA-946) Improve how the PPTX parser uses XLSF from POI

Nick Burch created TIKA-946:
-------------------------------

             Summary: Improve how the PPTX parser uses XLSF from POI
                 Key: TIKA-946
                 URL: https://issues.apache.org/jira/browse/TIKA-946
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.2
            Reporter: Nick Burch


One last bit from TIKA-757 and TIKA-805 - the current way that PPTX files are parsed using XSLF from Apache POI has a couple of last remaining low level parts.

We should avoid the need to go from the usermodel XMLSlideShow to the low level XSLFSlideShow to do the text extraction (occurs in XSLFPowerPointExtractorDecorator).

We should also update the usermodel slide support to extract out the slide names from docProps/app.xml, so that these can be included in the text output easily (in XSLFPowerPointExtractor)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-946) Improve how the PPTX parser uses XLSF from POI

Posted by "Daniel Bonniot de Ruisselet (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13452973#comment-13452973 ] 

Daniel Bonniot de Ruisselet commented on TIKA-946:
--------------------------------------------------

Does it also belong to this task that the output would represent the structures of slides (one <div> element per slide)?
                
> Improve how the PPTX parser uses XLSF from POI
> ----------------------------------------------
>
>                 Key: TIKA-946
>                 URL: https://issues.apache.org/jira/browse/TIKA-946
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.2
>            Reporter: Nick Burch
>
> One last bit from TIKA-757 and TIKA-805 - the current way that PPTX files are parsed using XSLF from Apache POI has a couple of last remaining low level parts.
> We should avoid the need to go from the usermodel XMLSlideShow to the low level XSLFSlideShow to do the text extraction (occurs in XSLFPowerPointExtractorDecorator).
> We should also update the usermodel slide support to extract out the slide names from docProps/app.xml, so that these can be included in the text output easily (in XSLFPowerPointExtractor)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira