You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@poi.apache.org by Dmitry Goldenberg <DG...@attivio.com> on 2008/08/28 18:43:05 UTC

How to extract embedded files from Office 07

Folks,

I've noticed that some embeddings in '07 are represented as .bin files which they store in the /embeddings subdirectory within the doc structure. For example, I embedded a zip file and it showed up as oleObject1.bin.

Does anyone have any idea as to how to read this type of file or convert it back to the original zip?

Thanks.

RE: How to extract embedded files from Office 07

Posted by Nick Burch <ni...@torchbox.com>.
On Thu, 28 Aug 2008, Dmitry Goldenberg wrote:
> You were right, the .bin file in /embeddings is Ole and can be read with 
> POIFS.

It looks like there's three files within the poifs stream:
   Ole <(0x01)Ole>
   CompObj <(0x01)CompObj>
   Ole10Native <(0x01)Ole10Native>

> The gotcha is, there's currently no API to extract the file out of the 
> Ole structures within POIFS.

It should be a five minute job - grab the poifs entry, get the bytes, and 
write them to a FileOutputStream. Probably 15-20 minutes including unit 
tests and overloaded methods :)


The slight snag will be that the Ole10Native entry isn't quite what you 
want. It contains the file name, the absolute file name, a little bit more 
bumpf, then your real file data after that. A little bit of work will be 
needed to figure out how to tell where the real file data starts, but then 
you'd be away!

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


RE: How to extract embedded files from Office 07

Posted by Dmitry Goldenberg <DG...@attivio.com>.
Rainer,
You were right, the .bin file in /embeddings is Ole and can be read with POIFS.

The gotcha is, there's currently no API to extract the file out of the Ole structures within POIFS.

HSLF has an API to enumerate Ole objects within slides. But what I need is a generic API that would let me do the following:

List<Embedding> embeddings = poifs.getEmbeddings();
for (Embedding embedding : embeddings) {
    System.out.println(">> Embedding: " + embedding.getName());
    embedding.extractTo(new FileOutputStream(outputDir, Utils.getCleanFileName(embedding.getName())));
}

getEmbeddings() could be getOleObjects() or whatever, but that's the gist of it..

- Dmitry

-----Original Message-----
From: Rainer Schwarze [mailto:rsc@admadic.de]
Sent: Thursday, August 28, 2008 6:49 PM
To: POI Users List
Subject: Re: How to extract embedded files from Office 07

Dmitry Goldenberg wrote:
> Yegor,
>
> The first 8 bytes contain the standard MS Office magic number stuff - d0 cf 11 e0 a1 b1 1a e1.
>
> Seems like they compress data in a proprietary way. I've read one post where someone recommended the .NET Packaging API to crack these ...  Not a good option ...

Hi Dmitry,

this may be interesting (unless you already found it):

http://www.nabble.com/Can-POIFS-convert-PDF-to-OLE-td18568081.html


Looking at such things I suspect this:

The data is inside "Ole10Native". This could be extracted using POIFS.
The structures there look like this:

[4 bytes] = size of structure including data
[???] a few flags and strings (zero terminated)
[4 bytes] = size of actually embedded binary data
[???] = the actual binary data

If you know that it is a ZIP file, you could search for a byte sequence
[size]"PK", where [size] depends on the search position. Assume you
start immediately after the first 4 bytes for total length, then the
size value is length-4. Step further by one byte and check for the
sequence with size set to length-5 a.s.o. When the 6 bytes match the
expected [size]PK sequence, you can be somewhat sure, that "PK"
represents the start of the ZIP file and [size] is its size.

Of course nothing beats the analysis of the actual binary data structure
:-) (Would this be worth the effort for your purpose?)

Best wishes, Rainer
--

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


RE: How to extract embedded files from Office 07

Posted by Dmitry Goldenberg <DG...@attivio.com>.
Would it be good to devise the generic _Ole10Native reader to do the following? -

List<String> magicStrings = getMagicStrings();
for (String ms : magicStrings) {
   if (found ms within the first N bytes) {
      // this must be a file of type <X>
      // read from ms onward, extract to disk
      break;
   }
}

-----Original Message-----
From: Rainer Schwarze [mailto:rsc@admadic.de]
Sent: Thursday, August 28, 2008 6:49 PM
To: POI Users List
Subject: Re: How to extract embedded files from Office 07

Dmitry Goldenberg wrote:
> Yegor,
>
> The first 8 bytes contain the standard MS Office magic number stuff - d0 cf 11 e0 a1 b1 1a e1.
>
> Seems like they compress data in a proprietary way. I've read one post where someone recommended the .NET Packaging API to crack these ...  Not a good option ...

Hi Dmitry,

this may be interesting (unless you already found it):

http://www.nabble.com/Can-POIFS-convert-PDF-to-OLE-td18568081.html


Looking at such things I suspect this:

The data is inside "Ole10Native". This could be extracted using POIFS.
The structures there look like this:

[4 bytes] = size of structure including data
[???] a few flags and strings (zero terminated)
[4 bytes] = size of actually embedded binary data
[???] = the actual binary data

If you know that it is a ZIP file, you could search for a byte sequence
[size]"PK", where [size] depends on the search position. Assume you
start immediately after the first 4 bytes for total length, then the
size value is length-4. Step further by one byte and check for the
sequence with size set to length-5 a.s.o. When the 6 bytes match the
expected [size]PK sequence, you can be somewhat sure, that "PK"
represents the start of the ZIP file and [size] is its size.

Of course nothing beats the analysis of the actual binary data structure
:-) (Would this be worth the effort for your purpose?)

Best wishes, Rainer
--

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: How to extract embedded files from Office 07

Posted by Rainer Schwarze <rs...@admadic.de>.
Dmitry Goldenberg wrote:
> Yegor,
> 
> The first 8 bytes contain the standard MS Office magic number stuff - d0 cf 11 e0 a1 b1 1a e1.
> 
> Seems like they compress data in a proprietary way. I've read one post where someone recommended the .NET Packaging API to crack these ...  Not a good option ...

Hi Dmitry,

this may be interesting (unless you already found it):

http://www.nabble.com/Can-POIFS-convert-PDF-to-OLE-td18568081.html


Looking at such things I suspect this:

The data is inside "Ole10Native". This could be extracted using POIFS. 
The structures there look like this:

[4 bytes] = size of structure including data
[???] a few flags and strings (zero terminated)
[4 bytes] = size of actually embedded binary data
[???] = the actual binary data

If you know that it is a ZIP file, you could search for a byte sequence 
[size]"PK", where [size] depends on the search position. Assume you 
start immediately after the first 4 bytes for total length, then the 
size value is length-4. Step further by one byte and check for the 
sequence with size set to length-5 a.s.o. When the 6 bytes match the 
expected [size]PK sequence, you can be somewhat sure, that "PK" 
represents the start of the ZIP file and [size] is its size.

Of course nothing beats the analysis of the actual binary data structure 
:-) (Would this be worth the effort for your purpose?)

Best wishes, Rainer
-- 

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


RE: How to extract embedded files from Office 07

Posted by Dmitry Goldenberg <DG...@attivio.com>.
Yegor,

The first 8 bytes contain the standard MS Office magic number stuff - d0 cf 11 e0 a1 b1 1a e1.

Seems like they compress data in a proprietary way. I've read one post where someone recommended the .NET Packaging API to crack these ...  Not a good option ...

- Dmitry

-----Original Message-----
From: Yegor Kozlov [mailto:yegor@dinom.ru]
Sent: Thursday, August 28, 2008 2:18 PM
To: POI Users List
Subject: Re: How to extract embedded files from Office 07

The first 4 bytes may contain the length of the uncompressed data. That's how OLE data is stored in the binary formats.

Yegor

> Tried reading the .bin file with 7-Zip-Jbindings, no go.  It wasn't recognized as any of
>
> ARJ BZIP_2 CAB CHM CPIO CDEB ISO LZH NSIS RAR SPLIT TAR Z ZIP
>
> Egads.
>
>
> -----Original Message-----
> From: Dmitry Goldenberg
> Sent: Thursday, August 28, 2008 1:15 PM
> To: POI Users List
> Subject: RE: How to extract embedded files from Office 07
>
> 1. I got two .bin files, oleObject1 and oleObject2.
> 2. the UNIX file utility spits out "Microsoft Office Document"
> 3. the magic number on .bin is d0 cf 11 e0 a1 b1 1a e1 which explains #2.
> 4. this seems to be the BIN/ISO format. Would I be able to read it with something like 7-Zip-Jbinding perhaps? Is there an easier way to decompress the file/extract contents?
>
> Thanks.
> - Dmitry
>
> -----Original Message-----
> From: Nick Burch [mailto:nick@torchbox.com]
> Sent: Thursday, August 28, 2008 1:03 PM
> To: POI Users List
> Subject: Re: How to extract embedded files from Office 07
>
> On Thu, 28 Aug 2008, Dmitry Goldenberg wrote:
>> I've noticed that some embeddings in '07 are represented as .bin files
>> which they store in the /embeddings subdirectory within the doc
>> structure. For example, I embedded a zip file and it showed up as
>> oleObject1.bin.
>>
>> Does anyone have any idea as to how to read this type of file or convert
>> it back to the original zip?
>
> Two things I'd suggest trying:
> * if you embed two zip files, do you get oleObject1 and oleObject2, or
>    just a bigger oleObject1 ?
> * if you unzip the parent ooxml file, and run the unix "file" utility
>    against oleObject1, what does it say the file is?
>
> Nick
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
> For additional commands, e-mail: user-help@poi.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
> For additional commands, e-mail: user-help@poi.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
> For additional commands, e-mail: user-help@poi.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: How to extract embedded files from Office 07

Posted by Yegor Kozlov <ye...@dinom.ru>.
The first 4 bytes may contain the length of the uncompressed data. That's how OLE data is stored in the binary formats.

Yegor

> Tried reading the .bin file with 7-Zip-Jbindings, no go.  It wasn't recognized as any of
> 
> ARJ BZIP_2 CAB CHM CPIO CDEB ISO LZH NSIS RAR SPLIT TAR Z ZIP
> 
> Egads.
> 
> 
> -----Original Message-----
> From: Dmitry Goldenberg
> Sent: Thursday, August 28, 2008 1:15 PM
> To: POI Users List
> Subject: RE: How to extract embedded files from Office 07
> 
> 1. I got two .bin files, oleObject1 and oleObject2.
> 2. the UNIX file utility spits out "Microsoft Office Document"
> 3. the magic number on .bin is d0 cf 11 e0 a1 b1 1a e1 which explains #2.
> 4. this seems to be the BIN/ISO format. Would I be able to read it with something like 7-Zip-Jbinding perhaps? Is there an easier way to decompress the file/extract contents?
> 
> Thanks.
> - Dmitry
> 
> -----Original Message-----
> From: Nick Burch [mailto:nick@torchbox.com]
> Sent: Thursday, August 28, 2008 1:03 PM
> To: POI Users List
> Subject: Re: How to extract embedded files from Office 07
> 
> On Thu, 28 Aug 2008, Dmitry Goldenberg wrote:
>> I've noticed that some embeddings in '07 are represented as .bin files
>> which they store in the /embeddings subdirectory within the doc
>> structure. For example, I embedded a zip file and it showed up as
>> oleObject1.bin.
>>
>> Does anyone have any idea as to how to read this type of file or convert
>> it back to the original zip?
> 
> Two things I'd suggest trying:
> * if you embed two zip files, do you get oleObject1 and oleObject2, or
>    just a bigger oleObject1 ?
> * if you unzip the parent ooxml file, and run the unix "file" utility
>    against oleObject1, what does it say the file is?
> 
> Nick
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
> For additional commands, e-mail: user-help@poi.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
> For additional commands, e-mail: user-help@poi.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
> For additional commands, e-mail: user-help@poi.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


RE: How to extract embedded files from Office 07

Posted by Dmitry Goldenberg <DG...@attivio.com>.
Tried reading the .bin file with 7-Zip-Jbindings, no go.  It wasn't recognized as any of

ARJ BZIP_2 CAB CHM CPIO CDEB ISO LZH NSIS RAR SPLIT TAR Z ZIP

Egads.


-----Original Message-----
From: Dmitry Goldenberg
Sent: Thursday, August 28, 2008 1:15 PM
To: POI Users List
Subject: RE: How to extract embedded files from Office 07

1. I got two .bin files, oleObject1 and oleObject2.
2. the UNIX file utility spits out "Microsoft Office Document"
3. the magic number on .bin is d0 cf 11 e0 a1 b1 1a e1 which explains #2.
4. this seems to be the BIN/ISO format. Would I be able to read it with something like 7-Zip-Jbinding perhaps? Is there an easier way to decompress the file/extract contents?

Thanks.
- Dmitry

-----Original Message-----
From: Nick Burch [mailto:nick@torchbox.com]
Sent: Thursday, August 28, 2008 1:03 PM
To: POI Users List
Subject: Re: How to extract embedded files from Office 07

On Thu, 28 Aug 2008, Dmitry Goldenberg wrote:
> I've noticed that some embeddings in '07 are represented as .bin files
> which they store in the /embeddings subdirectory within the doc
> structure. For example, I embedded a zip file and it showed up as
> oleObject1.bin.
>
> Does anyone have any idea as to how to read this type of file or convert
> it back to the original zip?

Two things I'd suggest trying:
* if you embed two zip files, do you get oleObject1 and oleObject2, or
   just a bigger oleObject1 ?
* if you unzip the parent ooxml file, and run the unix "file" utility
   against oleObject1, what does it say the file is?

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


RE: How to extract embedded files from Office 07

Posted by Nick Burch <ni...@torchbox.com>.
On Thu, 28 Aug 2008, Dmitry Goldenberg wrote:
> 1. I got two .bin files, oleObject1 and oleObject2.
> 2. the UNIX file utility spits out "Microsoft Office Document"

Can you try running org.apache.poi.poifs.dev.POIFSLister against these two 
files? I'm wondering if they've wrapped your original files up as an ole2 
document.

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


RE: How to extract embedded files from Office 07

Posted by Dmitry Goldenberg <DG...@attivio.com>.
1. I got two .bin files, oleObject1 and oleObject2.
2. the UNIX file utility spits out "Microsoft Office Document"
3. the magic number on .bin is d0 cf 11 e0 a1 b1 1a e1 which explains #2.
4. this seems to be the BIN/ISO format. Would I be able to read it with something like 7-Zip-Jbinding perhaps? Is there an easier way to decompress the file/extract contents?

Thanks.
- Dmitry

-----Original Message-----
From: Nick Burch [mailto:nick@torchbox.com]
Sent: Thursday, August 28, 2008 1:03 PM
To: POI Users List
Subject: Re: How to extract embedded files from Office 07

On Thu, 28 Aug 2008, Dmitry Goldenberg wrote:
> I've noticed that some embeddings in '07 are represented as .bin files
> which they store in the /embeddings subdirectory within the doc
> structure. For example, I embedded a zip file and it showed up as
> oleObject1.bin.
>
> Does anyone have any idea as to how to read this type of file or convert
> it back to the original zip?

Two things I'd suggest trying:
* if you embed two zip files, do you get oleObject1 and oleObject2, or
   just a bigger oleObject1 ?
* if you unzip the parent ooxml file, and run the unix "file" utility
   against oleObject1, what does it say the file is?

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: How to extract embedded files from Office 07

Posted by Nick Burch <ni...@torchbox.com>.
On Thu, 28 Aug 2008, Dmitry Goldenberg wrote:
> I've noticed that some embeddings in '07 are represented as .bin files 
> which they store in the /embeddings subdirectory within the doc 
> structure. For example, I embedded a zip file and it showed up as 
> oleObject1.bin.
>
> Does anyone have any idea as to how to read this type of file or convert 
> it back to the original zip?

Two things I'd suggest trying:
* if you embed two zip files, do you get oleObject1 and oleObject2, or
   just a bigger oleObject1 ?
* if you unzip the parent ooxml file, and run the unix "file" utility
   against oleObject1, what does it say the file is?

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org