You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by bu...@apache.org on 2017/07/27 21:49:09 UTC

[Bug 61354] New: Tika fails to get full HTML

https://bz.apache.org/bugzilla/show_bug.cgi?id=61354

            Bug ID: 61354
           Summary: Tika fails to get full HTML
           Product: POI
           Version: 3.17-dev
          Hardware: PC
            Status: NEW
          Severity: major
          Priority: P2
         Component: XWPF
          Assignee: dev@poi.apache.org
          Reporter: kramachandran@commvault.com
  Target Milestone: ---

Created attachment 35184
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=35184&action=edit
MultipleBodyBug

Apache Tika fails to get full HTML if the Word Document has multiple body.  We
only get the data from the first body.

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 61354] Tika fails to get full HTML

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=61354

PJ Fanning <fa...@yahoo.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|---                         |FIXED

--- Comment #2 from PJ Fanning <fa...@yahoo.com> ---
Merged using https://svn.apache.org/repos/asf/poi/trunk@1803250

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 61354] Tika fails to get full HTML

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=61354

Karthik Ramachandran <kr...@commvault.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 OS|                            |All

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 61354] Tika fails to get full HTML

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=61354

Karthik Ramachandran <kr...@commvault.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |kramachandran@commvault.com

--- Comment #1 from Karthik Ramachandran <kr...@commvault.com> ---
Created attachment 35185
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=35185&action=edit
Patch for reading all body

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 61354] Tika fails to get full HTML

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=61354

--- Comment #3 from Tim Allison <ta...@mitre.org> ---
Karthik, Thank you for sharing a patch and triggering document!  PJ, thank you
for fixing this so quickly!

As a side note, Tika's experimental SAX parser for docx does extract
everything; and this is exactly one of the reasons that I added it -- so that
if we don't account for structural rareties(?), we'll still get the text.  With
our DOM model, we're looking for some specific things in specific places (see
also TIKA-1130).

Make no mistake, we need to fix our DOM parser when people find problems, and
I'm grateful that you opened this!

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org