You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Sam H (JIRA)" <ji...@apache.org> on 2016/02/01 14:37:39 UTC

[jira] [Commented] (TIKA-1841) Different XML output structure for PPT and PPTX

    [ https://issues.apache.org/jira/browse/TIKA-1841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15126248#comment-15126248 ] 

Sam H commented on TIKA-1841:
-----------------------------

Hi [~gagravarr],

There has been no reaction to this issue in the past 6 days. Can I assume my proposed structure is ok?

I have already started implementing this:
https://github.com/zetisam/tika/tree/TIKA-1841

The PPT code allows you to get the slide-notes-footer and slide-notes-header seperately, but the POI code seems to add these fields to the output anyway, so I don't know if this is of much use. 

I couldn't find how to do this in PPTX, so maybe this part can be dropped (in order not to have duplicate content).

The same for slide footers in general. They seem to be added to the content, so having them as a separate div would be duplicating this content.

Any thoughts?

> Different XML output structure for PPT and PPTX
> -----------------------------------------------
>
>                 Key: TIKA-1841
>                 URL: https://issues.apache.org/jira/browse/TIKA-1841
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.11
>            Reporter: Sam H
>
> Issue is slightly related to TIKA-1840
> I've noticed that the XML structure of Powerpoint (PPT) and PPTX files is different. 
> The structure for PPTX seems as follows:
> {code}
> <div class="slide-content"></div>
> <div class="slide-master-content" />
> <div class="slide-notes"></div> //optional
> <div class="slide-comment"></div> //optional
> ...
> <div class="slide-content"></div>
> <div class="slide-master-content" />
> <div class="slide-notes"></div> //optional
> <div class="slide-comment"></div> //optional
> {code}
> Note that there's no parent slide element to indicate the start and end of each slide.
> For powerpoint the structure is as follows:
> {code}
> <div class="slideShow">
>   <div class="slide">
>     <div class="slide-master-content"></div>
>     <div class="slide-content"></div>
>     <div class="slide-notes"></div> //added in TIKA-1840
>     <div class="slide-comment"></div> 
>   </div>
>   ...
>   <div class="slide">
>     <div class="slide-master-content"></div>
>     <div class="slide-content"></div>
>     <div class="slide-notes"></div> //added in TIKA-1840
>     <div class="slide-comment"></div>
>   </div>
> </div>
> <div class="slide-notes">
> {code}
> In my application, I'm using XPath to get the desired information . As the XML structure is different, I have to differentiate my XPath queries whether the file is PPT (old) or PPTX (new). It would be nice for Tika to return the same XML for both.
> I would propose changing the XML structure to this:
> {code}
> <div class="slideShow">
>   <div class="slide">
>     <div class="slide-master-content"></div>
>     <div class="slide-content"></div>
>     <div class="slide-notes"></div> //added in TIKA-1840
>     <div class="slide-comment"></div> 
>   </div>
>   ...
>   <div class="slide">
>     <div class="slide-master-content"></div>
>     <div class="slide-content"></div>
>     <div class="slide-notes"></div> //added in TIKA-1840
>     <div class="slide-comment"></div>
>   </div>
> </div>
> {code}
> So, essentially, like the current PPT output, but without the list of notes at the end (as this is also omitted for PPTX).
> On the one hand this generalizes PPT(X) handling, on the other it can break existing (external) functionality relying on a specific XML output format.
> I don't know if this is something the project wants fixed or not. If so, I'm willing to donate my time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)