You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by bu...@apache.org on 2018/02/09 18:35:31 UTC
[Bug 62092] New: Text not extracted from grouped text shapes in HSLF
https://bz.apache.org/bugzilla/show_bug.cgi?id=62092
Bug ID: 62092
Summary: Text not extracted from grouped text shapes in HSLF
Product: POI
Version: 3.17-FINAL
Hardware: PC
Status: NEW
Severity: normal
Priority: P2
Component: HSLF
Assignee: dev@poi.apache.org
Reporter: tallison@mitre.org
Target Milestone: ---
On TIKA-2569, a user reported that we aren't extracting text from grouped
textshapes in HSLF...all works in pptx. I added a workaround at the Tika level
for now.
Test file:
https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/test-documents/testPPT_groups.ppt
Unit test at the Tika level:
https://github.com/apache/tika/blob/master/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/PowerPointParserTest.java#L300
When the user calls getTextParagraphs() on a slide, that should include the
text from grouped textshapes, right?
If not and we have the intended behavior, and the user has to walk through
HSLFGroupShapes, we can close this out.
--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
[Bug 62092] Text not extracted from grouped text shapes in HSLF
Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=62092
Javen O'Neal <on...@apache.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Attachment #35820|text/plain |application/tar+gzip
mime type| |
Attachment #35820|1 |0
is patch| |
--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
[Bug 62092] Text not extracted from grouped text shapes in HSLF
Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=62092
--- Comment #4 from Andreas Beeker <ki...@apache.org> ---
X/HSLF/Slide.getTextParagraphs() doesn't return HeaderFooters consistently over
the various formats, i.e. PPT<2007 has a special HeaderFooters record which
can't be easily wrapped into a TextParagraph. I guess the reason for this
method was anyway just an easy access for Tika, so that's what the extractor
class is for.
How about delegating to the extractor (and changing the signature) or removing
it?
--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
[Bug 62092] Text not extracted from grouped text shapes in HSLF
Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=62092
Andreas Beeker <ki...@apache.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Resolution|--- |FIXED
Hardware|PC |All
Status|NEW |RESOLVED
--- Comment #5 from Andreas Beeker <ki...@apache.org> ---
Provided common slideshow extractor via r1829453
--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
[Bug 62092] Text not extracted from grouped text shapes in HSLF
Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=62092
--- Comment #2 from Andreas Beeker <ki...@apache.org> ---
Created attachment 35820
--> https://bz.apache.org/bugzilla/attachment.cgi?id=35820&action=edit
SL Common SlideShowExtractor incl. fix for GroupShapes
The patch contains the SL Common SlideShowExtractor which made quite a few
changes necessary to handle Placeholders and header/footer information across
X/HSLF.
This also includes handling for group shapes.
There are a few API breaks included, e.g. for XSLF comments, therefore I would
like to a have a review.
What do you think about the PlaceholderDetails helper class?
--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
[Bug 62092] Text not extracted from grouped text shapes in HSLF
Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=62092
--- Comment #3 from Javen O'Neal <on...@apache.org> ---
(In reply to Tim Allison from comment #0)
> When the user calls getTextParagraphs() on a slide, that should include the
> text from grouped textshapes, right?
That sounds correct to me.
--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
[Bug 62092] Text not extracted from grouped text shapes in HSLF
Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=62092
--- Comment #1 from Andreas Beeker <ki...@apache.org> ---
I'm currently working on a SL Common SlideShowExtractor, i.e. trying to
deprecate the old extractors and moving the getTextParagraphs there.
--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org