You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@poi.apache.org by bu...@apache.org on 2018/02/09 18:35:31 UTC

[Bug 62092] New: Text not extracted from grouped text shapes in HSLF

https://bz.apache.org/bugzilla/show_bug.cgi?id=62092

            Bug ID: 62092
           Summary: Text not extracted from grouped text shapes in HSLF
           Product: POI
           Version: 3.17-FINAL
          Hardware: PC
            Status: NEW
          Severity: normal
          Priority: P2
         Component: HSLF
          Assignee: dev@poi.apache.org
          Reporter: tallison@mitre.org
  Target Milestone: ---

On TIKA-2569, a user reported that we aren't extracting text from grouped
textshapes in HSLF...all works in pptx.  I added a workaround at the Tika level
for now.

Test file:
https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/test-documents/testPPT_groups.ppt

Unit test at the Tika level:
https://github.com/apache/tika/blob/master/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/PowerPointParserTest.java#L300

When the user calls getTextParagraphs() on a slide, that should include the
text from grouped textshapes, right?

If not and we have the intended behavior, and the user has to walk through
HSLFGroupShapes, we can close this out.

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org

[Bug 62092] Text not extracted from grouped text shapes in HSLF

Posted by bu...@apache.org.

https://bz.apache.org/bugzilla/show_bug.cgi?id=62092

Javen O'Neal <on...@apache.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  Attachment #35820|text/plain                  |application/tar+gzip
          mime type|                            |
  Attachment #35820|1                           |0
           is patch|                            |

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org

[Bug 62092] Text not extracted from grouped text shapes in HSLF

Posted by bu...@apache.org.

https://bz.apache.org/bugzilla/show_bug.cgi?id=62092

--- Comment #4 from Andreas Beeker <ki...@apache.org> ---
X/HSLF/Slide.getTextParagraphs() doesn't return HeaderFooters consistently over
the various formats, i.e. PPT<2007 has a special HeaderFooters record which
can't be easily wrapped into a TextParagraph. I guess the reason for this
method was anyway just an easy access for Tika, so that's what the extractor
class is for.

How about delegating to the extractor (and changing the signature) or removing
it?

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org

[Bug 62092] Text not extracted from grouped text shapes in HSLF

Posted by bu...@apache.org.

https://bz.apache.org/bugzilla/show_bug.cgi?id=62092

Andreas Beeker <ki...@apache.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |FIXED
           Hardware|PC                          |All
             Status|NEW                         |RESOLVED

--- Comment #5 from Andreas Beeker <ki...@apache.org> ---
Provided common slideshow extractor via r1829453

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org

[Bug 62092] Text not extracted from grouped text shapes in HSLF

Posted by bu...@apache.org.

https://bz.apache.org/bugzilla/show_bug.cgi?id=62092

--- Comment #2 from Andreas Beeker <ki...@apache.org> ---
Created attachment 35820
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=35820&action=edit
SL Common SlideShowExtractor incl. fix for GroupShapes

The patch contains the SL Common SlideShowExtractor which made quite a few
changes necessary to handle Placeholders and header/footer information across
X/HSLF.
This also includes handling for group shapes.

There are a few API breaks included, e.g. for XSLF comments, therefore I would
like to a have a review.

What do you think about the PlaceholderDetails helper class?

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org

[Bug 62092] Text not extracted from grouped text shapes in HSLF

Posted by bu...@apache.org.

https://bz.apache.org/bugzilla/show_bug.cgi?id=62092

--- Comment #3 from Javen O'Neal <on...@apache.org> ---
(In reply to Tim Allison from comment #0)
> When the user calls getTextParagraphs() on a slide, that should include the
> text from grouped textshapes, right?
That sounds correct to me.

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org

[Bug 62092] Text not extracted from grouped text shapes in HSLF

Posted by bu...@apache.org.

https://bz.apache.org/bugzilla/show_bug.cgi?id=62092

--- Comment #1 from Andreas Beeker <ki...@apache.org> ---
I'm currently working on a SL Common SlideShowExtractor, i.e. trying to
deprecate the old extractors and moving the getTextParagraphs there.

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org