You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@poi.apache.org by bu...@apache.org on 2017/07/09 10:54:28 UTC

[Bug 61267] New: Meta data of attached word file gets parsed. However, content of file is not parsed and is blank

https://bz.apache.org/bugzilla/show_bug.cgi?id=61267

            Bug ID: 61267
           Summary: Meta data of attached word file gets parsed. However,
                    content of file is not parsed and is blank
           Product: POI
           Version: 3.16-dev
          Hardware: PC
            Status: NEW
          Severity: normal
          Priority: P2
         Component: POI Overall
          Assignee: dev@poi.apache.org
          Reporter: gaurav.chd3@gmail.com
  Target Milestone: ---

Created attachment 35106
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=35106&action=edit
Meta data of attached word file gets parsed. However, content of file is not
parsed and is blank

Meta data of attached word file gets parsed. However, content of file is not
parsed and is blank

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org

[Bug 61267] Meta data of attached word file gets parsed. However, content of file is not parsed and is blank

Posted by bu...@apache.org.

https://bz.apache.org/bugzilla/show_bug.cgi?id=61267

gaurav.chd3@gmail.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |gaurav.chd3@gmail.com

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org

[Bug 61267] Meta data of attached word file gets parsed. However, content of file is not parsed and is blank

Posted by bu...@apache.org.

https://bz.apache.org/bugzilla/show_bug.cgi?id=61267

gaurav.chd3@gmail.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 OS|                            |All
           Severity|normal                      |major

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org

[Bug 61267] Meta data of attached word file gets parsed. However, content of file is not parsed and is blank

Posted by bu...@apache.org.

https://bz.apache.org/bugzilla/show_bug.cgi?id=61267

Javen O'Neal <on...@apache.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |NEEDINFO

--- Comment #1 from Javen O'Neal <on...@apache.org> ---
The file begins with the following bytes:
> 00000000  db a5 2d 00 31 40 09 04  00 00 00 00 2d 00 00 00  |..-.1@......-...|

And has quite a bit of ASCII embedded in it. This doesn't look like a OLE2
BIFF8 Microsoft Word .doc file nor an OOXML Word .docx file. This looks more
like a Microsoft Write .wri file, though has a different magic number.

> 00000180  09 4d 65 6d 62 65 72 20  6f 66 20 33 47 50 50 20  |.Member of 3GPP |
> 00000190  28 41 52 49 42 29 0d 0a  4d 72 2e 20 42 65 6e 6e  |(ARIB)..Mr. Benn|

Furthermore, I cannot open this file with Google Docs.

Are you sure this is a Microsoft Word file?
I wasn't able to find any common uses of this magic number.

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org

[Bug 61267] Extract text from Microsoft Word 2.0 (pre-OLE2) document

Posted by bu...@apache.org.

https://bz.apache.org/bugzilla/show_bug.cgi?id=61267

--- Comment #4 from Nick Burch <ap...@gagravarr.org> ---
Tip for next time - run the Tika App jar in --detect mode to see if the file
magic is known. In this case, Tika knows it's application/msword2

pre-OLE2 word2 has 2 magics, word5 has 1 (at least that Tika knows about), do
people think it's worth adding helpful exceptions in POI for those too?

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org

[Bug 61267] Meta data of attached word file gets parsed. However, content of file is not parsed and is blank

Posted by bu...@apache.org.

https://bz.apache.org/bugzilla/show_bug.cgi?id=61267

--- Comment #2 from Javen O'Neal <on...@apache.org> ---
Nevermind. Looks like this claims to be a Word 2.0 file.

http://www.filesignatures.net/index.php?page=search&search=DBA52D00&mode=SIG
> DB A5 2D 00   Word 2.0 file, ASCII

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org

[Bug 61267] Extract text from Microsoft Word 2.0 (pre-OLE2) document

Posted by bu...@apache.org.

https://bz.apache.org/bugzilla/show_bug.cgi?id=61267

Dominik Stadler <do...@gmx.at> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEEDINFO                    |RESOLVED
         Resolution|---                         |WONTFIX

--- Comment #5 from Dominik Stadler <do...@gmx.at> ---
In r1828176 we have added detection for word2 files and thus now make it easier
to spot that Apache POI does not support this type of file. 

I think there are currently no plans to fully support this very old format,
please reopen this with initial patches for review if you are interested in
this feature and you can work on implementing and maintaining this.

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org

[Bug 61267] Extract text from Microsoft Word 2.0 (pre-OLE2) document

Posted by bu...@apache.org.

https://bz.apache.org/bugzilla/show_bug.cgi?id=61267

Javen O'Neal <on...@apache.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|Meta data of attached word  |Extract text from Microsoft
                   |file gets parsed. However,  |Word 2.0 (pre-OLE2)
                   |content of file is not      |document
                   |parsed and is blank         |
           Severity|major                       |enhancement

--- Comment #3 from Javen O'Neal <on...@apache.org> ---
There are several entry points into POI. We should figure out what class should
be responsible for checking the first few bytes (magic number) of a file to
figure out what file format it is (Tika style).

We could continue adding known magic numbers to o.a.p.poifs.HeaderBlock, but we
may want to reuse that code elsewhere, such as
WorkbookFactory/DocumentFactory/SlideshowFactory, the Extractor classes for
Tika, etc.

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org