You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by bu...@apache.org on 2017/07/09 10:50:54 UTC

[Bug 61266] New: File not parsing

https://bz.apache.org/bugzilla/show_bug.cgi?id=61266

            Bug ID: 61266
           Summary: File not parsing
           Product: POI
           Version: 3.16-dev
          Hardware: PC
            Status: NEW
          Severity: critical
          Priority: P2
         Component: POI Overall
          Assignee: dev@poi.apache.org
          Reporter: gaurav.chd3@gmail.com
  Target Milestone: ---

Created attachment 35105
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=35105&action=edit
DOC file

The full exception stack trace is included below:

org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from
org.apache.tika.parser.microsoft.OfficeParser@547eb45
        at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)
        at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
        at
org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
        at
org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:74)
        at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:357)
        at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:308)
        at
org.apache.tika.gui.ParsingTransferHandler.importFiles(ParsingTransferHandler.java:94)
        at
org.apache.tika.gui.ParsingTransferHandler.importData(ParsingTransferHandler.java:77)
        at javax.swing.TransferHandler.importData(Unknown Source)
        at javax.swing.TransferHandler$DropHandler.drop(Unknown Source)
        at java.awt.dnd.DropTarget.drop(Unknown Source)
        at javax.swing.TransferHandler$SwingDropTarget.drop(Unknown Source)
        at sun.awt.dnd.SunDropTargetContextPeer.processDropMessage(Unknown
Source)
        at
sun.awt.dnd.SunDropTargetContextPeer$EventDispatcher.dispatchDropEvent(Unknown
Source)
        at
sun.awt.dnd.SunDropTargetContextPeer$EventDispatcher.dispatchEvent(Unknown
Source)
        at sun.awt.dnd.SunDropTargetEvent.dispatch(Unknown Source)
        at java.awt.Component.dispatchEventImpl(Unknown Source)
        at java.awt.Container.dispatchEventImpl(Unknown Source)
        at java.awt.Component.dispatchEvent(Unknown Source)
        at java.awt.LightweightDispatcher.retargetMouseEvent(Unknown Source)
        at java.awt.LightweightDispatcher.processDropTargetEvent(Unknown
Source)
        at java.awt.LightweightDispatcher.dispatchEvent(Unknown Source)
        at java.awt.Container.dispatchEventImpl(Unknown Source)
        at java.awt.Window.dispatchEventImpl(Unknown Source)
        at java.awt.Component.dispatchEvent(Unknown Source)
        at java.awt.EventQueue.dispatchEventImpl(Unknown Source)
        at java.awt.EventQueue.access$500(Unknown Source)
        at java.awt.EventQueue$3.run(Unknown Source)
        at java.awt.EventQueue$3.run(Unknown Source)
        at java.security.AccessController.doPrivileged(Native Method)
        at
java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(Unknown
Source)
        at
java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(Unknown
Source)
        at java.awt.EventQueue$4.run(Unknown Source)
        at java.awt.EventQueue$4.run(Unknown Source)
        at java.security.AccessController.doPrivileged(Native Method)
        at
java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(Unknown
Source)
        at java.awt.EventQueue.dispatchEvent(Unknown Source)
        at java.awt.EventDispatchThread.pumpOneEventForFilters(Unknown Source)
        at java.awt.EventDispatchThread.pumpEventsForFilter(Unknown Source)
        at java.awt.EventDispatchThread.pumpEventsForHierarchy(Unknown Source)
        at java.awt.EventDispatchThread.pumpEvents(Unknown Source)
        at java.awt.EventDispatchThread.pumpEvents(Unknown Source)
        at java.awt.EventDispatchThread.run(Unknown Source)
Caused by: org.apache.poi.poifs.filesystem.NotOLE2FileException: Invalid header
signature; read 0x0000AB000000BE31, expected 0xE11AB1A1E011CFD0 - Your file
appears not to be a valid OLE2 document
        at
org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:181)
        at
org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:140)
        at
org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:302)
        at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:124)
        at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        ... 43 more

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 61266] File not parsing

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=61266

Javen O'Neal <on...@apache.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |NEEDINFO

--- Comment #1 from Javen O'Neal <on...@apache.org> ---
Same comment as bug 61265 and bug 61257, please provide a better bug title and
include the version of POI that you're using.

You can remove the javax.swing, java.awt, and sun calls in the stack trace.

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 61266] Extract text from Microsoft Write document

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=61266

Javen O'Neal <on...@apache.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|Extract text from old Word  |Extract text from Microsoft
                   |file from 1991              |Write document

--- Comment #6 from Javen O'Neal <on...@apache.org> ---
Starting with 0x31be, the provided file is presumably a Microsoft Write file,
typically found with a .wri extension, though later saved with a .doc extension
and *optionally* saved in an OLE2 container (this file isn't). This format
dates back to the Windows 1.0 days (1985).

http://www.filesignatures.net/index.php?page=search&search=31BE&mode=SIG
https://en.wikipedia.org/wiki/Microsoft_Write

Strictly speaking, Write is not part of the Microsoft Office suite.

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 61266] File not parsing

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=61266

gaurav.chd3@gmail.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 OS|                            |All
                 CC|                            |gaurav.chd3@gmail.com

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 61266] File not parsing

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=61266

--- Comment #4 from Javen O'Neal <on...@apache.org> ---
If this is a 1991 Word file, then perhaps HWPFOldDocument (for Word 6 and Word
95) should be used instead of HWPFDocument (BIFF8). It's possible that this
file format predates Word 6.

Not sure if POI or Tika should be specifying a different file handler, though
it's possible POI (and therefore Tika) can't read this ancient format.

The o.a.p.poifs.storage.HeaderBlock constructor recognizes that this file is
not a BIFF2, 3, or 4 document.

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 61266] Extract text from Microsoft Write document

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=61266

Dominik Stadler <do...@gmx.at> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|---                         |WONTFIX

--- Comment #8 from Dominik Stadler <do...@gmx.at> ---
I think we won't pursue full support for such ancient file formats, better to
convert them to something newer as likely all sorts of tools won't be able to
handle these files any more soon.

Detection was improved, so we at least state now that we found a Write-document
which we cannot read.

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 61266] File not parsing

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=61266

Javen O'Neal <on...@apache.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|critical                    |normal
          Component|POI Overall                 |POIFS

--- Comment #2 from Javen O'Neal <on...@apache.org> ---
Caused by: org.apache.poi.poifs.filesystem.NotOLE2FileException: Invalid header
signature; read 0x0000AB000000BE31, expected 0xE11AB1A1E011CFD0 - Your file
appears not to be a valid OLE2 document

Google Docs reported your file as corrupt as well. Are you sure this is a valid
doc file and not encrypted?

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 61266] File not parsing

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=61266

gaurav.chd3@gmail.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEEDINFO                    |NEW

--- Comment #3 from gaurav.chd3@gmail.com ---
This is a valid DOC file. This is an old file (year 1991). When we open this
file (with text encoding as windows default) and resave it in docx format.
Then, the docx format gets parsed successfully.

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 61266] Extract text from old Word file from 1991

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=61266

Javen O'Neal <on...@apache.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|File not parsing            |Extract text from old Word
                   |                            |file from 1991

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 61266] File not parsing

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=61266

gaurav.chd3@gmail.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|normal                      |major

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 61266] Extract text from Microsoft Write document

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=61266

--- Comment #7 from Nick Burch <ap...@gagravarr.org> ---
I've added a more helpful exception for these files in r1801376, based on the
mime magic from Apache Tika for them

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 61266] File not parsing

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=61266

Javen O'Neal <on...@apache.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|major                       |enhancement

--- Comment #5 from Javen O'Neal <on...@apache.org> ---
Looks like POI doesn't currently support reading this file format.

Opening the binary file in a text editor reveals that most of the document
contents are saved as ASCII, with a few special characters to embed figures and
designate the start of sections. This doesn't look like any OLE2 file I have
seen before.

Presumably if all that is needed is text extraction, you could use `strings` on
this document.

Changing this to an enhancement request in case someone is interested in
figuring out what archaic file format this is and writing a primitive parser
that can extract text from the document.

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org