You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2012/10/19 15:38:11 UTC
[jira] [Created] (TIKA-1010) Embedded documents in RTF are not
extracted
Michael McCandless created TIKA-1010:
----------------------------------------
Summary: Embedded documents in RTF are not extracted
Key: TIKA-1010
URL: https://issues.apache.org/jira/browse/TIKA-1010
Project: Tika
Issue Type: Bug
Components: parser
Reporter: Michael McCandless
When an RTF doc embeds a doc it looks like this:
{noformat}
{\object\objemb
\objw628\objh765{\*\objclass Package}{\*\objdata 0105000002000000080000005061636b61676500000000000000000066000000
020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000000030022000000433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b00000048656c6c6f20576f726c64000001050000050000000d0000004d45544146494c455049435400
54040000bbfaffffee0000000800540445050000
0100090000037300000002001c0000000000050000000b0200000000050000000c02320029001c000000fb02f5ff000000000000900100000001000000005461686f6d61000055170a7000fc070058b1f37761b1f3772040f57749366683040000002d01000005000000090200000000050000000102ffffff0005000000
020101000000050000002e0106000000090000002105060048772e747874210015001c000000fb021000070000000000bc02000000000102022253797374656d00004936668300000a0026008a0100000000ffffffff8cfc0700040000002d010100030000000000}
{noformat}
But, unfortunately, the format of those hex bytes is not spelled out
in the RTF spec ... the spec merely says the bytes are saved by the
OLESaveToStream function ... and I haven't been able to find a
description of what the bytes mean.
In this case they are a "Package object" (\objclass Package), which I
think is an [old?] way to wrap any non-OLE file (this is just a .txt
file).
Here's the hex dump:
{noformat}
00000000 01 05 00 00 02 00 00 00 08 00 00 00 50 61 63 6b |............Pack|
00000010 61 67 65 00 00 00 00 00 00 00 00 00 66 00 00 00 |age.........f...|
00000020 02 00 48 77 2e 74 78 74 00 43 3a 5c 44 4f 43 55 |..Hw.txt.C:\DOCU|
00000030 4d 45 7e 31 5c 69 67 61 6c 73 68 5c 44 65 73 6b |ME~1\igalsh\Desk|
00000040 74 6f 70 5c 48 57 2e 74 78 74 00 00 00 03 00 22 |top\HW.txt....."|
00000050 00 00 00 43 3a 5c 44 4f 43 55 4d 45 7e 31 5c 69 |...C:\DOCUME~1\i|
00000060 67 61 6c 73 68 5c 44 65 73 6b 74 6f 70 5c 48 57 |galsh\Desktop\HW|
00000070 2e 74 78 74 00 0b 00 00 00 48 65 6c 6c 6f 20 57 |.txt.....Hello W|
00000080 6f 72 6c 64 00 00 01 05 00 00 05 00 00 00 0d 00 |orld............|
00000090 00 00 4d 45 54 41 46 49 4c 45 50 49 43 54 00 54 |..METAFILEPICT.T|
000000a0 04 00 00 bb fa ff ff ee 00 00 00 08 00 54 04 45 |.............T.E|
000000b0 05 00 00 01 00 09 00 00 03 73 00 00 00 02 00 1c |.........s......|
000000c0 00 00 00 00 00 05 00 00 00 0b 02 00 00 00 00 05 |................|
000000d0 00 00 00 0c 02 32 00 29 00 1c 00 00 00 fb 02 f5 |.....2.)........|
000000e0 ff 00 00 00 00 00 00 90 01 00 00 00 01 00 00 00 |................|
000000f0 00 54 61 68 6f 6d 61 00 00 55 17 0a 70 00 fc 07 |.Tahoma..U..p...|
00000100 00 58 b1 f3 77 61 b1 f3 77 20 40 f5 77 49 36 66 |.X..wa..w @.wI6f|
00000110 83 04 00 00 00 2d 01 00 00 05 00 00 00 09 02 00 |.....-..........|
00000120 00 00 00 05 00 00 00 01 02 ff ff ff 00 05 00 00 |................|
00000130 00 02 01 01 00 00 00 05 00 00 00 2e 01 06 00 00 |................|
00000140 00 09 00 00 00 21 05 06 00 48 77 2e 74 78 74 21 |.....!...Hw.txt!|
00000150 00 15 00 1c 00 00 00 fb 02 10 00 07 00 00 00 00 |................|
00000160 00 bc 02 00 00 00 00 01 02 02 22 53 79 73 74 65 |.........."Syste|
00000170 6d 00 00 49 36 66 83 00 00 0a 00 26 00 8a 01 00 |m..I6f.....&....|
00000180 00 00 00 ff ff ff ff 8c fc 07 00 04 00 00 00 2d |...............-|
00000190 01 01 00 03 00 00 00 00 00 |.........|
00000199
{noformat}
Anyway I have no idea how to decode the bytes at this point ... just
opening the issue in case anyone else does!
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira