You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2011/09/12 19:46:09 UTC

[jira] [Created] (TIKA-712) Master slide text isn't extracted

Master slide text isn't extracted
---------------------------------

                 Key: TIKA-712
                 URL: https://issues.apache.org/jira/browse/TIKA-712
             Project: Tika
          Issue Type: Bug
          Components: parser
            Reporter: Michael McCandless
             Fix For: 1.0


It looks like we are not getting text from the master slide for PPT
and PPTX.


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-712) Master slide text isn't extracted

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13113948#comment-13113948 ] 

Michael McCandless commented on TIKA-712:
-----------------------------------------

I committed a change to temporarily turn off master text.

Curiously, the new unit tests still passed ;)  Somehow we are now extracting the footer text properly for both PPT and PPTX!  I think this is because footer is somehow "special".

I'll make a new unit test that shows we are failing to extract master text...

> Master slide text isn't extracted
> ---------------------------------
>
>                 Key: TIKA-712
>                 URL: https://issues.apache.org/jira/browse/TIKA-712
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>         Attachments: TIKA-712-master-slide.xml, TIKA-712.patch, testPPT_masterFooter.ppt, testPPT_masterFooter.pptx, testPPT_masterFooter2.ppt, testPPT_masterFooter2.pptx
>
>
> It looks like we are not getting text from the master slide for PPT
> and PPTX.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-712) Master slide text isn't extracted

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13109726#comment-13109726 ] 

Michael McCandless commented on TIKA-712:
-----------------------------------------

OK, the good news is: I now see the master slide's text being
extracted; thanks Nick!

But the bad news is: we are now also extracting all the "boilerplate"
text that is included in the master slide by default.

For example if I open Powerpoint 2007, make no changes and just save
that one blank slide as PPTX, then get the text from it using TikaCLI, I see
this:

{noformat}
Click to edit Master title style
Click to edit Master text styles
Second level
Third level
Fourth level
Fifth level
{noformat}

This is the boiler-plate text from that initial title slide's master
slide.  I think we should somehow not include it, but, I have no idea
how... does PPT/X somehow note that this is "fake" boilerplate text!?
Somehow Powerpoint knows not to display this when I view the slide...

> Master slide text isn't extracted
> ---------------------------------
>
>                 Key: TIKA-712
>                 URL: https://issues.apache.org/jira/browse/TIKA-712
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>         Attachments: TIKA-712.patch, testPPT_masterFooter.ppt, testPPT_masterFooter.pptx, testPPT_masterFooter2.ppt, testPPT_masterFooter2.pptx
>
>
> It looks like we are not getting text from the master slide for PPT
> and PPTX.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-712) Master slide text isn't extracted

Posted by "Michael McCandless (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13128227#comment-13128227 ] 

Michael McCandless commented on TIKA-712:
-----------------------------------------

I tested the current XSLFPowerPointExtraction on POI's trunk and it works great (preserves the footer text and no placeholder text for my PPTX test case).

But for PPT files (using PowerPointExtractor) we still pull the boiler plate text.  That's expected right?  (Ie we haven't fixed that case yet).
                
> Master slide text isn't extracted
> ---------------------------------
>
>                 Key: TIKA-712
>                 URL: https://issues.apache.org/jira/browse/TIKA-712
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>         Attachments: TIKA-712-master-slide.xml, TIKA-712.patch, testPPT_masterFooter.ppt, testPPT_masterFooter.pptx, testPPT_masterFooter2.ppt, testPPT_masterFooter2.pptx
>
>
> It looks like we are not getting text from the master slide for PPT
> and PPTX.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-712) Master slide text isn't extracted

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting updated TIKA-712:
-------------------------------

    Fix Version/s:     (was: 0.10)

> Master slide text isn't extracted
> ---------------------------------
>
>                 Key: TIKA-712
>                 URL: https://issues.apache.org/jira/browse/TIKA-712
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>         Attachments: TIKA-712.patch, testPPT_masterFooter.ppt, testPPT_masterFooter.pptx, testPPT_masterFooter2.ppt, testPPT_masterFooter2.pptx
>
>
> It looks like we are not getting text from the master slide for PPT
> and PPTX.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-712) Master slide text isn't extracted

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13103607#comment-13103607 ] 

Michael McCandless commented on TIKA-712:
-----------------------------------------

bq. Any chance you could open two POI bugs, one for HSLF and one for XSLF, and include the test files there too?

Will do!

> Master slide text isn't extracted
> ---------------------------------
>
>                 Key: TIKA-712
>                 URL: https://issues.apache.org/jira/browse/TIKA-712
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>             Fix For: 1.0
>
>         Attachments: TIKA-712.patch, testPPT_masterFooter.ppt, testPPT_masterFooter.pptx
>
>
> It looks like we are not getting text from the master slide for PPT
> and PPTX.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-712) Master slide text isn't extracted

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13114046#comment-13114046 ] 

Michael McCandless commented on TIKA-712:
-----------------------------------------

OK I committed four new failing (disabled) test cases, showing that we
don't extract text elements inherited from master/layout slide.

I played around some more with the master layout/slides and I think I
know what we need to do for XSLF (but I have no idea for HSLF;
hopefully it's somehow "parallel"):

  * We'll have to look at the inheritence from slide -> layout ->
    master, so that a slide's text is the union of its actual text,
    plus text from its slide layout, plus text from the master.  The
    files in the _rels dir link a slide to its slideLayout, and a
    slideLayout to its slideMaster.

  * For each text element on slideLayout and slideMaster, we must
    check for the presence of the p:sp -> p:nvSpPr -> p:nvPr -> p:ph
    element.  For example, {{<p:ph type="body" idx="1"/>}}.  The ph
    stands for "place holder", and it seems to mean it's not really
    rendered. When I manually edited the XML in my doc to insert a
    p:ph on text I had added, and viewed that in PowerPoint, it indeed
    stopped rendering it.  So if p:ph is present we should skip that text.

I think that should work!  But I don't know where/how to do this;
likely we need to do this first in POI?  Should I open an issue there?


> Master slide text isn't extracted
> ---------------------------------
>
>                 Key: TIKA-712
>                 URL: https://issues.apache.org/jira/browse/TIKA-712
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>         Attachments: TIKA-712-master-slide.xml, TIKA-712.patch, testPPT_masterFooter.ppt, testPPT_masterFooter.pptx, testPPT_masterFooter2.ppt, testPPT_masterFooter2.pptx
>
>
> It looks like we are not getting text from the master slide for PPT
> and PPTX.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-712) Master slide text isn't extracted

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13112801#comment-13112801 ] 

Michael McCandless commented on TIKA-712:
-----------------------------------------

Good idea!  Nice how approachable OOXML is...

In theory the answer is here:
http://www.ecma-international.org/publications/standards/Ecma-376.htm
but I have not tried to dig.
 
So, here's a boilerplate-only chunk from the master slide (PowerPoint does not display this on the slide):

{noformat}
      <p:sp>
	<p:nvSpPr>
	  <p:cNvPr id="2" name="Title Placeholder 1"/>
	  <p:cNvSpPr>
	    <a:spLocks noGrp="1"/>
	  </p:cNvSpPr>
	  <p:nvPr>
	    <p:ph type="title"/>
	  </p:nvPr>
	</p:nvSpPr>
	<p:spPr>
	  <a:xfrm>
	    <a:off x="457200" y="274638"/>
	    <a:ext cx="8229600" cy="1143000"/>
	  </a:xfrm>
	  <a:prstGeom prst="rect">
	    <a:avLst/>
	  </a:prstGeom>
	</p:spPr>
	<p:txBody>
	  <a:bodyPr vert="horz" lIns="91440" tIns="45720" rIns="91440" bIns="45720" rtlCol="0" anchor="ctr">
	    <a:normAutofit/>
	  </a:bodyPr>
	  <a:lstStyle/>
	  <a:p>
	    <a:r>
	      <a:rPr lang="en-US" smtClean="0"/>
	      <a:t>Click to edit Master title style
	      </a:t>
	    </a:r>
	    <a:endParaRPr lang="en-US"/>
	  </a:p>
	</p:txBody>
      </p:sp>
{noformat}

And here's the footer I edited (PowerPoint does display this on the slide):

{noformat}
      <p:sp>
	<p:nvSpPr>
	  <p:cNvPr id="5" name="Footer Placeholder 4"/>
	  <p:cNvSpPr>
	    <a:spLocks noGrp="1"/>
	  </p:cNvSpPr>
	  <p:nvPr>
	    <p:ph type="ftr" sz="quarter" idx="3"/>
	  </p:nvPr>
	</p:nvSpPr>
	<p:spPr>
	  <a:xfrm>
	    <a:off x="3124200" y="6356350"/>
	    <a:ext cx="2895600" cy="365125"/>
	  </a:xfrm>
	  <a:prstGeom prst="rect">
	    <a:avLst/>
	  </a:prstGeom>
	</p:spPr>
	<p:txBody>
	  <a:bodyPr vert="horz" lIns="91440" tIns="45720" rIns="91440" bIns="45720" rtlCol="0" anchor="ctr"/>
	  <a:lstStyle>
	    <a:lvl1pPr algn="ctr">
	      <a:defRPr sz="1200">
		<a:solidFill>
		  <a:schemeClr val="tx1">
		    <a:tint val="75000"/>
		  </a:schemeClr>
		</a:solidFill>
	      </a:defRPr>
	    </a:lvl1pPr>
	  </a:lstStyle>
	  <a:p>
	    <a:r>
	      <a:rPr lang="en-US" smtClean="0"/>
	      <a:t>Slide footer is right here
	      </a:t>
	    </a:r>
	    <a:endParaRPr lang="en-US"/>
	  </a:p>
	</p:txBody>
      </p:sp>
{noformat}

I can't spot any obvious ideas on quick glance... I'll attach the full
master slide XML (there's lots of other stuff); could be the
difference is elsewhere in there.



> Master slide text isn't extracted
> ---------------------------------
>
>                 Key: TIKA-712
>                 URL: https://issues.apache.org/jira/browse/TIKA-712
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>         Attachments: TIKA-712.patch, testPPT_masterFooter.ppt, testPPT_masterFooter.pptx, testPPT_masterFooter2.ppt, testPPT_masterFooter2.pptx
>
>
> It looks like we are not getting text from the master slide for PPT
> and PPTX.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-712) Master slide text isn't extracted

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13103515#comment-13103515 ] 

Michael McCandless commented on TIKA-712:
-----------------------------------------

I think ideally we'd have each slide inline the text from its corresponding master?

But if this is too hard then I think outputting text for each master slide just once somewhere is better than nothing?

> Master slide text isn't extracted
> ---------------------------------
>
>                 Key: TIKA-712
>                 URL: https://issues.apache.org/jira/browse/TIKA-712
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>             Fix For: 1.0
>
>         Attachments: TIKA-712.patch, testPPT_masterFooter.ppt, testPPT_masterFooter.pptx
>
>
> It looks like we are not getting text from the master slide for PPT
> and PPTX.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-712) Master slide text isn't extracted

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated TIKA-712:
------------------------------------

    Attachment: testPPT_masterFooter.pptx
                testPPT_masterFooter.ppt
                TIKA-712.patch

Test case that fails.

> Master slide text isn't extracted
> ---------------------------------
>
>                 Key: TIKA-712
>                 URL: https://issues.apache.org/jira/browse/TIKA-712
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>             Fix For: 1.0
>
>         Attachments: TIKA-712.patch, testPPT_masterFooter.ppt, testPPT_masterFooter.pptx
>
>
> It looks like we are not getting text from the master slide for PPT
> and PPTX.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-712) Master slide text isn't extracted

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13103679#comment-13103679 ] 

Michael McCandless commented on TIKA-712:
-----------------------------------------

OK I opened https://issues.apache.org/bugzilla/show_bug.cgi?id=51803 (PPT) and https://issues.apache.org/bugzilla/show_bug.cgi?id=51804 (PPTX).

> Master slide text isn't extracted
> ---------------------------------
>
>                 Key: TIKA-712
>                 URL: https://issues.apache.org/jira/browse/TIKA-712
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>             Fix For: 1.0
>
>         Attachments: TIKA-712.patch, testPPT_masterFooter.ppt, testPPT_masterFooter.pptx, testPPT_masterFooter2.ppt, testPPT_masterFooter2.pptx
>
>
> It looks like we are not getting text from the master slide for PPT
> and PPTX.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-712) Master slide text isn't extracted

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13112808#comment-13112808 ] 

Michael McCandless commented on TIKA-712:
-----------------------------------------

I suppose a hackish solution would be to explicitly filter out the known boiler-plate text that PowerPoint includes.  But this is scary of course because in theory a PPT/PPTX may in fact legitimately have this text on their master slides, which would be rather confusing.  Hmm lemme try actually making that my text, saving, and diffing the two.

> Master slide text isn't extracted
> ---------------------------------
>
>                 Key: TIKA-712
>                 URL: https://issues.apache.org/jira/browse/TIKA-712
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>         Attachments: TIKA-712-master-slide.xml, TIKA-712.patch, testPPT_masterFooter.ppt, testPPT_masterFooter.pptx, testPPT_masterFooter2.ppt, testPPT_masterFooter2.pptx
>
>
> It looks like we are not getting text from the master slide for PPT
> and PPTX.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-712) Master slide text isn't extracted

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated TIKA-712:
------------------------------------

    Attachment: testPPT_masterFooter2.pptx
                testPPT_masterFooter2.ppt

Corrected attachments -- the last attachments didn't actually render the master slide's footer text onto the slide.

> Master slide text isn't extracted
> ---------------------------------
>
>                 Key: TIKA-712
>                 URL: https://issues.apache.org/jira/browse/TIKA-712
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>             Fix For: 1.0
>
>         Attachments: TIKA-712.patch, testPPT_masterFooter.ppt, testPPT_masterFooter.pptx, testPPT_masterFooter2.ppt, testPPT_masterFooter2.pptx
>
>
> It looks like we are not getting text from the master slide for PPT
> and PPTX.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-712) Master slide text isn't extracted

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13112875#comment-13112875 ] 

Michael McCandless commented on TIKA-712:
-----------------------------------------

Maybe, until we work this out, we should turn off extracting anything
from the master slides?  Chris is about to build the release bits for
0.10...

So I did some sleuthing.  This is all new to me so this is really just
speculative but I think I learned a few things:

  * Each slide refers to a slideLayouts/slideLayoutN.xml, from the
    _rels/slideN.xml.rels file.

  * In turn, each slideLayoutN.xml refers to a
    slideMaster/slideMasterN.xml, from the _rels/slideLayoutN.xml.rels
    file.

  * Simply editing footer text on the slide's master is not sufficient
    to see that text on the slide; you must also go to Insert ->
    Header & Footer and check the box to display footer/slide
    number/date and time.

  * If I enable footers like that, the slideN.xml actually includes
    the footer text; now, I'm not sure why Tika didn't see this before
    we changed anything.

  * If, instead, I go to the slide master and manually insert my own
    text box, then it comes through on the slides, however Tika
    (current trunk) fails to extract this onto the slide even though
    PowerPoint renders it... so we are still missing something here,
    maybe because we only render the master for the slide and not
    its layout?

  * That manually inserted element has a unique {{<p:nvPr
    userDrawn="1"/>}} under p:sp -> p:nvSpPr... maybe POI/Tika can
    interpret that to mean "include this text".

  * I suspect the p:ph element (under p:sp -> p:nvSpPr -> p:nvPr) may
    be important here... it seems to specify the "type" of the
    element, and it seems to be included in all the "boilerplate"
    elements but NOT in the new element I added to the master.  You
    can see it in my examples above (type="ftr" and type="title").
    Maybe POI/Tika can interpret the presence of this p:ph element
    to mean that text should not be included in the slide?

I'm not yet sure how to boil this all down to what POI/Tika can
concretely use to identify what should be included and what should
not but it seems like progress...


> Master slide text isn't extracted
> ---------------------------------
>
>                 Key: TIKA-712
>                 URL: https://issues.apache.org/jira/browse/TIKA-712
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>         Attachments: TIKA-712-master-slide.xml, TIKA-712.patch, testPPT_masterFooter.ppt, testPPT_masterFooter.pptx, testPPT_masterFooter2.ppt, testPPT_masterFooter2.pptx
>
>
> It looks like we are not getting text from the master slide for PPT
> and PPTX.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-712) Master slide text isn't extracted

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13109659#comment-13109659 ] 

Nick Burch commented on TIKA-712:
---------------------------------

POI enhancements done, and Tika code (some interim) committed in r1173761.

Michael - any chance you could test, and then commit your unit test if all looks fine for you too?

> Master slide text isn't extracted
> ---------------------------------
>
>                 Key: TIKA-712
>                 URL: https://issues.apache.org/jira/browse/TIKA-712
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>         Attachments: TIKA-712.patch, testPPT_masterFooter.ppt, testPPT_masterFooter.pptx, testPPT_masterFooter2.ppt, testPPT_masterFooter2.pptx
>
>
> It looks like we are not getting text from the master slide for PPT
> and PPTX.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-712) Master slide text isn't extracted

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated TIKA-712:
------------------------------------

    Attachment: TIKA-712-master-slide.xml

Full master slide XML.

> Master slide text isn't extracted
> ---------------------------------
>
>                 Key: TIKA-712
>                 URL: https://issues.apache.org/jira/browse/TIKA-712
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>         Attachments: TIKA-712-master-slide.xml, TIKA-712.patch, testPPT_masterFooter.ppt, testPPT_masterFooter.pptx, testPPT_masterFooter2.ppt, testPPT_masterFooter2.pptx
>
>
> It looks like we are not getting text from the master slide for PPT
> and PPTX.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-712) Master slide text isn't extracted

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13114700#comment-13114700 ] 

Nick Burch commented on TIKA-712:
---------------------------------

It looks like we only want to exclude the placeholder ones on the layout and master slides, and only then if they're not custom

Well, unless there isn't a matching placeholder on the slide itself....

Ideally we'll want to expand POI to have a full model for this. For now, I've got something roughly working in POI in XSLFPowerPointExtractor. If the logic in there seems ok, we can implement the same in Tika when we move to POI 3.8 beta 5

> Master slide text isn't extracted
> ---------------------------------
>
>                 Key: TIKA-712
>                 URL: https://issues.apache.org/jira/browse/TIKA-712
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>         Attachments: TIKA-712-master-slide.xml, TIKA-712.patch, testPPT_masterFooter.ppt, testPPT_masterFooter.pptx, testPPT_masterFooter2.ppt, testPPT_masterFooter2.pptx
>
>
> It looks like we are not getting text from the master slide for PPT
> and PPTX.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-712) Master slide text isn't extracted

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13112202#comment-13112202 ] 

Nick Burch commented on TIKA-712:
---------------------------------

I'd suggest you take the pptx file (it'll be simpler to poke around in that the ppt one), and unzip it. Then, look at the xml file for the master slide, and see how the text you've added differs from the boilerplate parts. Are there any obvious differences between the two? Are they in different sections? Different xml? Anything we could filter on? 

> Master slide text isn't extracted
> ---------------------------------
>
>                 Key: TIKA-712
>                 URL: https://issues.apache.org/jira/browse/TIKA-712
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>         Attachments: TIKA-712.patch, testPPT_masterFooter.ppt, testPPT_masterFooter.pptx, testPPT_masterFooter2.ppt, testPPT_masterFooter2.pptx
>
>
> It looks like we are not getting text from the master slide for PPT
> and PPTX.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-712) Master slide text isn't extracted

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated TIKA-712:
------------------------------------

    Attachment: TIKA-712.patch

I think I found a committable workaround (patch) for including text from the master slide for PPT documents: I uncommented the existing code, but then exclude text that is type 0 (TITLE_TYPE) or 1 (BODY_TYPE), just for the master slide.  In my ad-hoc testing this eliminates the boilerplate text but lets other user changes to the master slide come through correctly ... this isn't perfect but I think it's a good step forward.
                
> Master slide text isn't extracted
> ---------------------------------
>
>                 Key: TIKA-712
>                 URL: https://issues.apache.org/jira/browse/TIKA-712
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>         Attachments: testPPT_masterFooter2.ppt, testPPT_masterFooter2.pptx, testPPT_masterFooter.ppt, testPPT_masterFooter.pptx, TIKA-712-master-slide.xml, TIKA-712.patch, TIKA-712.patch
>
>
> It looks like we are not getting text from the master slide for PPT
> and PPTX.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-712) Master slide text isn't extracted

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13102975#comment-13102975 ] 

Nick Burch commented on TIKA-712:
---------------------------------

We'll probably need to add this in POI, but it shouldn't be too hard

Do you have a feeling for whether we should process all the master slides after the regular ones, or if we should try to tie each slide back to it's master and place the master text inline with the slide's own text?

> Master slide text isn't extracted
> ---------------------------------
>
>                 Key: TIKA-712
>                 URL: https://issues.apache.org/jira/browse/TIKA-712
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>             Fix For: 1.0
>
>         Attachments: TIKA-712.patch, testPPT_masterFooter.ppt, testPPT_masterFooter.pptx
>
>
> It looks like we are not getting text from the master slide for PPT
> and PPTX.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-712) Master slide text isn't extracted

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13103529#comment-13103529 ] 

Nick Burch commented on TIKA-712:
---------------------------------

Makes sense to me. Any chance you could open two POI bugs, one for HSLF and one for XSLF, and include the test files there too?

> Master slide text isn't extracted
> ---------------------------------
>
>                 Key: TIKA-712
>                 URL: https://issues.apache.org/jira/browse/TIKA-712
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>             Fix For: 1.0
>
>         Attachments: TIKA-712.patch, testPPT_masterFooter.ppt, testPPT_masterFooter.pptx
>
>
> It looks like we are not getting text from the master slide for PPT
> and PPTX.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-712) Master slide text isn't extracted

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13508010#comment-13508010 ] 

Michael McCandless commented on TIKA-712:
-----------------------------------------

I committed the patch; I'll leave this issue open for a possible future correct fix where we can detect boilerplate text in PPT.
                
> Master slide text isn't extracted
> ---------------------------------
>
>                 Key: TIKA-712
>                 URL: https://issues.apache.org/jira/browse/TIKA-712
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>         Attachments: testPPT_masterFooter2.ppt, testPPT_masterFooter2.pptx, testPPT_masterFooter.ppt, testPPT_masterFooter.pptx, TIKA-712-master-slide.xml, TIKA-712.patch, TIKA-712.patch
>
>
> It looks like we are not getting text from the master slide for PPT
> and PPTX.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-712) Master slide text isn't extracted

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13109692#comment-13109692 ] 

Michael McCandless commented on TIKA-712:
-----------------------------------------

bq. Michael - any chance you could test, and then commit your unit test if all looks fine for you too?

Excellent, thanks Nick!  I'll test & commit.

> Master slide text isn't extracted
> ---------------------------------
>
>                 Key: TIKA-712
>                 URL: https://issues.apache.org/jira/browse/TIKA-712
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>         Attachments: TIKA-712.patch, testPPT_masterFooter.ppt, testPPT_masterFooter.pptx, testPPT_masterFooter2.ppt, testPPT_masterFooter2.pptx
>
>
> It looks like we are not getting text from the master slide for PPT
> and PPTX.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira