You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2018/10/03 19:59:00 UTC

[jira] [Commented] (TIKA-2735) notes and footer contents are duplicated in extracting text from power point slides

    [ https://issues.apache.org/jira/browse/TIKA-2735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16637457#comment-16637457 ] 

Tim Allison commented on TIKA-2735:
-----------------------------------

I'm sorry for my delay.  If you're talking about the copyright statement, as you point out, it is getting extracted from the slide-master and from the slide notes.  We don't have a configuration option to turn off the master or notes.  We could easily add {{includeSlideMasterContent(boolean b)}} and {{includeSlideNoteContent(boolean b)}} with defaults of {{true}}, but that seems quite use case specific.  What do you think?

 
{noformat}
<title>Slide 1</title>
</head>
<body><div class="slideShow"><div class="slide"><div class="slide-master-content"><p>Copyright © 2007, SAS Institute Inc. All rights reserved.</p>
</div>
<div class="slide-content"><p>Text Miner/Teragram Integration</p>
<p>Jim Cox</p>
</div>
<div class="slide-notes"><p>Copyright © 2007, SAS Institute Inc. All rights reserved.</p>
<p>*</p>
<p />
</div>
</div>{noformat}

> notes and footer contents are duplicated in extracting text from power point slides
> -----------------------------------------------------------------------------------
>
>                 Key: TIKA-2735
>                 URL: https://issues.apache.org/jira/browse/TIKA-2735
>             Project: Tika
>          Issue Type: Bug
>          Components: handler
>    Affects Versions: 1.18
>            Reporter: feng ye
>            Priority: Major
>         Attachments: Oneslide.ppt, pptTextResults.txt
>
>
> notes and footer contents are duplicated at the end when extract text from ppt slides (like the one in the attachment). Both the input file and the text results are attached. 
> Is there a configuration option that can be used to suppress this kind of duplication?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)