You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@poi.apache.org by Yury Batrakov <ba...@gmail.com> on 2008/05/07 08:48:50 UTC

Save embedded files to disk

Hi all.

I'm working on saving embedded documents (not only office ones, but
also zip, pdf, etc) to disk and completely lost hope of doing that.
Things I tried to do it:
1. Listen for events on POIFS trying to fetch apropriate streams and
save them via DocumentInputStream. Problem: stream names are being
generated in unpredictable (for me way). Examples: \001Ole10Native,
\001CompObj, CONTENT and so on
2. Call HSSFWorkbook.getAllEmbeddedObjects(). Problem: it lacks of
functions to save these objects on disk and it's not obvious for me to
implement them, such function is defined for HSSF and (AFAIU) fo HSLF,
but I'd prefer it for HWPF and HGDF also.

Could you advise me what is to do?

BTW, a little bit offtopic: RTF spec defines that embedded RTF objects
are in OLESaveToStream format, does POI have code to parse this
format?

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: Re[2]: Save embedded files to disk

Posted by Yury Batrakov <ba...@gmail.com>.
On 5/7/08, Yegor Kozlov <ye...@dinom.ru> wrote:
>  I'm note sure about ZIP, text or other formats. I searched the spec
>  but didn't find a clue. My advice is to look at CompObj. Every embedded
>  entry seems to have it. Try to parse this data, may be you will figure out the pattern.

The spec is as messy as the format is :)  Thanks, I'll try to parse CompObj

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re[2]: Save embedded files to disk

Posted by Yegor Kozlov <ye...@dinom.ru>.
> But what's about getAllEmbeddedObjects()? Does it parse object streams
> or It relies on OLE stream names? Is there big impact to wtite
> functions to dump these objects to disk?

Embedded OLE data is stored in the host-defined format. It means that
depending on the data the structure of directory entries can be
different.
For example:
 XLS:
     CompObj
     DocumentSummaryInformation
     SummaryInformation
     Workbook
 DOC:
     CompObj
     DocumentSummaryInformation
     SummaryInformation
     WordDocument
 ZIP:
     CompObj
     ObjInfo
     Ole
     Ole10Native
 TEXT:
  ???
 VISIO
  ???
 PDF
  ???
.....

In case if the embedded entry is a xls, ppt or doc you can save it with
appropriate extension and it will be "real" office document. You
should be able to open it in MS Office or by POI.

I'm note sure about ZIP, text or other formats. I searched the spec
but didn't find a clue. My advice is to look at CompObj. Every embedded
entry seems to have it. Try to parse this data, may be you will figure out the pattern.
     
Yegor

> On 5/7/08, Nick Burch <ni...@torchbox.com> wrote:
>> On Wed, 7 May 2008, Yury Batrakov wrote:
>>
>> > It's quite obvious and I already can save office files, but the problem is
>> to save zip and other files: I don't know which of the streams in
>> _1271662200 should i open and save. Word, Excel and others have some
>> predefined stream name, but all others don't.
>> >
>>
>>  For zip, should be easy. Open each one in turn, check the first few bytes
>> and see if they are the zip header. If so, save that
>>
>>  For text, I guess just open it and see if it looks text like (mostly just
>> bytes in the right ranges)
>>
>>
>>  Nick
>>
>> ---------------------------------------------------------------------
>>  To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
>>  For additional commands, e-mail: user-help@poi.apache.org
>>
>>

> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
> For additional commands, e-mail: user-help@poi.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: Save embedded files to disk

Posted by Yury Batrakov <ba...@gmail.com>.
But what's about getAllEmbeddedObjects()? Does it parse object streams
or It relies on OLE stream names? Is there big impact to wtite
functions to dump these objects to disk?

On 5/7/08, Nick Burch <ni...@torchbox.com> wrote:
> On Wed, 7 May 2008, Yury Batrakov wrote:
>
> > It's quite obvious and I already can save office files, but the problem is
> to save zip and other files: I don't know which of the streams in
> _1271662200 should i open and save. Word, Excel and others have some
> predefined stream name, but all others don't.
> >
>
>  For zip, should be easy. Open each one in turn, check the first few bytes
> and see if they are the zip header. If so, save that
>
>  For text, I guess just open it and see if it looks text like (mostly just
> bytes in the right ranges)
>
>
>  Nick
>
> ---------------------------------------------------------------------
>  To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
>  For additional commands, e-mail: user-help@poi.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: Save embedded files to disk

Posted by Nick Burch <ni...@torchbox.com>.
On Wed, 7 May 2008, Yury Batrakov wrote:
> It's quite obvious and I already can save office files, but the problem 
> is to save zip and other files: I don't know which of the streams in 
> _1271662200 should i open and save. Word, Excel and others have some 
> predefined stream name, but all others don't.

For zip, should be easy. Open each one in turn, check the first few bytes 
and see if they are the zip header. If so, save that

For text, I guess just open it and see if it looks text like (mostly just 
bytes in the right ranges)

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: Save embedded files to disk

Posted by Yury Batrakov <ba...@gmail.com>.
On 5/7/08, Nick Burch <ni...@torchbox.com> wrote:
>  You have word at:
>   \
>   \ObjectPool\_1271595753\
>   \ObjectPool\_1271595753\ObjectPool\_1268214091\
>  And excel at:
>   \ObjectPool\_1271595753\ObjectPool\_1268480555\
>  My hunch is that your zip file is somewhere in
>   \ObjectPool\_1271662200\

It's quite obvious and I already can save office files, but the
problem is to save zip and other files: I don't know which of the
streams in _1271662200 should i open and save. Word, Excel and others
have some predefined stream name, but all others don't.

>  OK, not a full OLE2 image. Try looking at the first 20 bytes or so, and
> compare it to those from POIFSViewer. You might find that for example, it's
> the Workbook stream of an excel file, without the normal OLE2 wrapper.

I opened extracted and original files in hex editor and found that
header being added to extracted files has variable length. Some of
them are 20 bytes long, but some other are 82 bytes long. I also tried
to examine wine's implementation of OLESaveToStream but  gave up :(
There's a temporary workaround: define signatures for supported file
format and extract starting from it, but this looks dirty

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: Save embedded files to disk

Posted by Nick Burch <ni...@torchbox.com>.
On Wed, 7 May 2008, Yury Batrakov wrote:
> It doesn't help :( For example, I have a document with word, zip and
> text files embedded. Embedde word document contains excel and word.
> These are filesystem entries:

You have word at:
  \
  \ObjectPool\_1271595753\
  \ObjectPool\_1271595753\ObjectPool\_1268214091\
And excel at:
  \ObjectPool\_1271595753\ObjectPool\_1268480555\
My hunch is that your zip file is somewhere in
  \ObjectPool\_1271662200\

>>  Try feeding it to poifs and see what it thinks of it?
>>
> I tried it:
> java.io.IOException: Invalid header signature; read 8589935873,
> expected -2226271756974174256

OK, not a full OLE2 image. Try looking at the first 20 bytes or so, and 
compare it to those from POIFSViewer. You might find that for example, 
it's the Workbook stream of an excel file, without the normal OLE2 
wrapper.

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: Save embedded files to disk

Posted by Yury Batrakov <ba...@gmail.com>.
On 5/7/08, Nick Burch <ni...@torchbox.com> wrote:
>  Have you read through
> http://poi.apache.org/poifs/embeded.html ? That ought to
> contain most of the information you'll need

It doesn't help :( For example, I have a document with word, zip and
text files embedded. Embedde word document contains excel and word.
These are filesystem entries:

\WordDocument
\1Table
\ObjectPool\_1271595753\ObjInfo
\ObjectPool\_1271595753\WordDocument
\ObjectPool\_1271595753\SummaryInformation
\ObjectPool\_1271595753\DocumentSummaryInformation
\ObjectPool\_1271595753\ObjectPool\_1268214091\1Table
\ObjectPool\_1271595753\ObjectPool\_1268214091\ObjInfo
\ObjectPool\_1271595753\ObjectPool\_1268214091\SummaryInformation
\ObjectPool\_1271595753\ObjectPool\_1268214091\DocumentSummaryInformation
\ObjectPool\_1271595753\ObjectPool\_1268214091\WordDocument
\ObjectPool\_1271595753\ObjectPool\_1268214091\CompObj
\ObjectPool\_1271595753\ObjectPool\_1268214091\Data
\ObjectPool\_1271595753\ObjectPool\_1268480555\EPRINT
\ObjectPool\_1271595753\ObjectPool\_1268480555\ObjInfo
\ObjectPool\_1271595753\ObjectPool\_1268480555\SummaryInformation
\ObjectPool\_1271595753\ObjectPool\_1268480555\DocumentSummaryInformation
\ObjectPool\_1271595753\ObjectPool\_1268480555\Workbook
\ObjectPool\_1271595753\ObjectPool\_1268480555\CompObj
\ObjectPool\_1271595753\ObjectPool\_1268480555\Ole
\ObjectPool\_1271595753\1Table
\ObjectPool\_1271595753\CompObj
\ObjectPool\_1271595753\Data
\ObjectPool\_1271662200\CompObj
\ObjectPool\_1271662200\ObjInfo
\ObjectPool\_1271662200\Ole10Native
\ObjectPool\_1271662200\Ole
\CompObj
\Data


>  Try feeding it to poifs and see what it thinks of it?
>
I tried it:
java.io.IOException: Invalid header signature; read 8589935873,
expected -2226271756974174256
	at org.apache.poi.poifs.storage.HeaderBlockReader.<init>(HeaderBlockReader.java:112)
	at org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java:151)
	at org.apache.poi.hwpf.HWPFDocument.verifyAndBuildPOIFS(HWPFDocument.java:121)
	at org.apache.poi.hwpf.extractor.WordExtractor.<init>(WordExtractor.java:49)
	at ru.mera.tmp.Test.main(Test.java:13)

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: Save embedded files to disk

Posted by Nick Burch <ni...@torchbox.com>.
On Wed, 7 May 2008, Yury Batrakov wrote:
> I'm working on saving embedded documents (not only office ones, but also 
> zip, pdf, etc) to disk and completely lost hope of doing that.

Have you read through http://poi.apache.org/poifs/embeded.html ? That 
ought to contain most of the information you'll need

> BTW, a little bit offtopic: RTF spec defines that embedded RTF objects 
> are in OLESaveToStream format, does POI have code to parse this format?

Try feeding it to poifs and see what it thinks of it?

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org