You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Hudson (JIRA)" <ji...@apache.org> on 2015/10/08 04:48:27 UTC

[jira] [Commented] (TIKA-1755) Make ppt and pptx paragraph/div breaks more consistent

    [ https://issues.apache.org/jira/browse/TIKA-1755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14947991#comment-14947991 ] 

Hudson commented on TIKA-1755:
------------------------------

SUCCESS: Integrated in tika-trunk-jdk1.7 #866 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/866/])
TIKA-1755 make div and other formatting more consistent btwn PPT and PPTX (tallison: [http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1707432])
* trunk/CHANGES.txt
* trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java
* trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSLFPowerPointExtractorDecorator.java
* trunk/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/PowerPointParserTest.java
* trunk/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
* trunk/tika-parsers/src/test/resources/test-documents/testPPT_comment.ppt
* trunk/tika-parsers/src/test/resources/test-documents/testPPT_comment.pptx


> Make ppt and pptx paragraph/div breaks more consistent
> ------------------------------------------------------
>
>                 Key: TIKA-1755
>                 URL: https://issues.apache.org/jira/browse/TIKA-1755
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Minor
>         Attachments: TIKA-1755.patch
>
>
> In working on [~kiwiwings]'s patch for the new handling of PPT/X, I found that our PPT/PPTX parsers behave very differently with <p> and <div> breaks, especially now that we've applied the upgrades from TIKA-1707.
> I propose adding quite a few more <p> to capture the sentence/bullet level breaks in PPTX as we're now doing for PPT.
> There are a handful of other things that we could clean up (table handling) as well.
> Some of these changes may be relevant to this [discussion|http://mail-archives.apache.org/mod_mbox/tika-dev/201306.mbox/%3CCAL8PwkY96_GKJmps6ZXuoe7H7-byvpxJbkTBuy1goKU3sKZMtQ@mail.gmail.com%3E].  [~shaie], any input?
> Patch and example output to follow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)