You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by bu...@apache.org on 2017/07/27 21:49:09 UTC
[Bug 61354] New: Tika fails to get full HTML
https://bz.apache.org/bugzilla/show_bug.cgi?id=61354
Bug ID: 61354
Summary: Tika fails to get full HTML
Product: POI
Version: 3.17-dev
Hardware: PC
Status: NEW
Severity: major
Priority: P2
Component: XWPF
Assignee: dev@poi.apache.org
Reporter: kramachandran@commvault.com
Target Milestone: ---
Created attachment 35184
--> https://bz.apache.org/bugzilla/attachment.cgi?id=35184&action=edit
MultipleBodyBug
Apache Tika fails to get full HTML if the Word Document has multiple body. We
only get the data from the first body.
--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
[Bug 61354] Tika fails to get full HTML
Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=61354
PJ Fanning <fa...@yahoo.com> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution|--- |FIXED
--- Comment #2 from PJ Fanning <fa...@yahoo.com> ---
Merged using https://svn.apache.org/repos/asf/poi/trunk@1803250
--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
[Bug 61354] Tika fails to get full HTML
Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=61354
Karthik Ramachandran <kr...@commvault.com> changed:
What |Removed |Added
----------------------------------------------------------------------------
OS| |All
--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
[Bug 61354] Tika fails to get full HTML
Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=61354
Karthik Ramachandran <kr...@commvault.com> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |kramachandran@commvault.com
--- Comment #1 from Karthik Ramachandran <kr...@commvault.com> ---
Created attachment 35185
--> https://bz.apache.org/bugzilla/attachment.cgi?id=35185&action=edit
Patch for reading all body
--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
[Bug 61354] Tika fails to get full HTML
Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=61354
--- Comment #3 from Tim Allison <ta...@mitre.org> ---
Karthik, Thank you for sharing a patch and triggering document! PJ, thank you
for fixing this so quickly!
As a side note, Tika's experimental SAX parser for docx does extract
everything; and this is exactly one of the reasons that I added it -- so that
if we don't account for structural rareties(?), we'll still get the text. With
our DOM model, we're looking for some specific things in specific places (see
also TIKA-1130).
Make no mistake, we need to fix our DOM parser when people find problems, and
I'm grateful that you opened this!
--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org