You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by bu...@apache.org on 2013/03/29 20:06:07 UTC

[Bug 54771] New: Read text from Cover Page, Table of Contents and Bibliography

https://issues.apache.org/bugzilla/show_bug.cgi?id=54771

            Bug ID: 54771
           Summary: Read text from Cover Page, Table of Contents and
                    Bibliography
           Product: POI
           Version: 3.9-dev
          Hardware: All
                OS: All
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: XWPF
          Assignee: dev@poi.apache.org
          Reporter: vikas.garg@blackboard.com
    Classification: Unclassified

Currently, XWPFWordExtractor.getText() is not reading text from Cover Page,
Table Of Contents or Bibliography parts of the docx file. Are there any plans
to add the support for extracting the text from these parts? If so then, will
it be in next release? OR Is there any other API available to do so?

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 54771] Read text from Cover Page, Table of Contents and Bibliography

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=54771

--- Comment #3 from Tim Allison <ta...@mitre.org> ---
Created attachment 31704
  --> https://issues.apache.org/bugzilla/attachment.cgi?id=31704&action=edit
rough draft of patch

Rough draft of patch attached.  I need to clean up a few things before I commit
(end of the week?).  All feedback welcome.  Thank you!

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 54771] Read text from Cover Page, Table of Contents and Bibliography

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=54771

--- Comment #4 from Nick Burch <ap...@gagravarr.org> ---
At first glance the patch looks promising

Any chance you could also look at updating appendTableText in XWPFWordExtractor
with similar logic to in your updated unit test?

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 54771] Read text from Cover Page, Table of Contents and Bibliography

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=54771

--- Comment #2 from Tim Allison <ta...@mitre.org> ---
Vladimir Glina just submitted test docs over on TIKA-1317.  This issue is
related to POI-54849, which got most SDTs but apparently didn't capture this
case.  I'll try to fix this soon.

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 54771] Read text from Cover Page, Table of Contents and Bibliography

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=54771

--- Comment #1 from Nick Burch <ap...@gagravarr.org> ---
Apache Tika might be a better bet - it uses Apache POI internally but pulls out
a richer set of text and styling

Otherwise, please submit a patch to enhance XWPFWordExtractor if it isn't doing
everything required!

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 54771] Read text from Cover Page, Table of Contents and Bibliography

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=54771

--- Comment #5 from Tim Allison <ta...@mitre.org> ---
Thank you, Nick!

There's a slight difference between the test's extractSDTs and the way that
XWPFDocumentExtractor works.  The general goal is to return all text
recursively from an XWPFSDTCell's content object; this is what the extractor
calls.  The test recursively goes through all objects to gather the SDTs, so
that we can test numbers of SDTs and text within them.

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 54771] Read text from Cover Page, Table of Contents and Bibliography

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=54771

Tim Allison <ta...@mitre.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|---                         |FIXED

--- Comment #6 from Tim Allison <ta...@mitre.org> ---
Fixed r1602960.

Thank you, Vikas, for submitting this issue.

Thank you, Vladimir, for submitting test docs on TIKA-1317.

Thank you, Nick, for your review.

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org