You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@poi.apache.org by Bing Ran <bi...@gmail.com> on 2014/05/28 12:29:00 UTC

equation images in a .doc

Hi,

New to the list but I have pressing need to extract all the embedded
equation images from an Word 97 .doc file (not .docx).

I know all those images are in WMF format. After I dumped the picture
content (from the Picture.getContent()) to a file, I found that the file
was not entirely a valid WMF or at least they did not have the correct size
information.

I'd appreciate it so much if someone can get me started on the right track.

Thanks!

Bing

Re: equation images in a .doc

Posted by Andreas Beeker <an...@gmx.de>.
Hi Bing,

I haven't (re-)searched for the checksum algorithm yet, but I would be happy if you tell me,
what you have found out.

Thank you,
Andi

On 29.05.2014 04:49, Bing Ran wrote:
> I figured out the checksum:)
>
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: equation images in a .doc

Posted by Bing Ran <bi...@gmail.com>.
I figured out the checksum:)



2014-05-29 10:37 GMT+08:00 Bing Ran <bi...@gmail.com>:

> Thanks Andi! I'm working towards that direction.
>
> I'm wondering how the checksum is calculated. That's the last two bytes in
> the header...
>
> Thanks for helping!
>
> Bing
>
>
>
>
>
> 2014-05-29 4:04 GMT+08:00 Andreas Beeker <an...@gmx.de>:
>
> Hi Bing,
>>
>> the wmf code is in a branch, because I'd like to commit (save) it without
>> interfering with the trunk ... and it's still far from being finished.
>> Not sure, how the github synchronisation works, but I guess "it" only
>> fetches the trunk.
>>
>> And for the header: I couldn't find the reference on a quick search, but
>> - I think - the wmf pictures are
>> always saved without the placeable header. At least when I've
>> implemented/tested the wmf code in HssfWorkbook.addPicture, it only worked,
>> when I've removed that header - maybe it's different with hwpf ...
>> So I guess JWord is simply making a header up - either it uses standard
>> values for the bounding box or it reads one of the window records [1].
>>
>> Anyways, I would simply try to append the 22 bytes of a working picture
>> to the pictures with the missing magic code 0x9AC6CDD7 (little endian) and
>> see if your post-processing is ok with that.
>>
>> Andi
>>
>>
>> [1] http://svn.apache.org/repos/asf/poi/branches/wmf_render/
>> src/scratchpad/src/org/apache/poi/hwmf/record/WmfWindowing.java
>>
>>
>> On 28.05.2014 18:34, Bing Ran wrote:
>>
>>> Hi Andreas,
>>>
>>> Thanks for answer.
>>>
>>> The raw data was acquired from overriding the
>>> AbstractWordConverter.processingImage()...
>>> in the hwpf package, by calling picture.getContent(). I cannot
>>> immediately figure
>>> out how to reset the header after reading your code reference.
>>>
>>> BTW, I was using a local compile of the POI modules from Github.  Is the
>>> code considered out of date? I could not find the hwmf package in the
>>> github code.
>>>
>>> Thanks
>>>
>>> Bing
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
>> For additional commands, e-mail: user-help@poi.apache.org
>>
>>
>

Re: equation images in a .doc

Posted by Bing Ran <bi...@gmail.com>.
Thanks Andi! I'm working towards that direction.

I'm wondering how the checksum is calculated. That's the last two bytes in
the header...

Thanks for helping!

Bing





2014-05-29 4:04 GMT+08:00 Andreas Beeker <an...@gmx.de>:

> Hi Bing,
>
> the wmf code is in a branch, because I'd like to commit (save) it without
> interfering with the trunk ... and it's still far from being finished.
> Not sure, how the github synchronisation works, but I guess "it" only
> fetches the trunk.
>
> And for the header: I couldn't find the reference on a quick search, but -
> I think - the wmf pictures are
> always saved without the placeable header. At least when I've
> implemented/tested the wmf code in HssfWorkbook.addPicture, it only worked,
> when I've removed that header - maybe it's different with hwpf ...
> So I guess JWord is simply making a header up - either it uses standard
> values for the bounding box or it reads one of the window records [1].
>
> Anyways, I would simply try to append the 22 bytes of a working picture to
> the pictures with the missing magic code 0x9AC6CDD7 (little endian) and see
> if your post-processing is ok with that.
>
> Andi
>
>
> [1] http://svn.apache.org/repos/asf/poi/branches/wmf_render/
> src/scratchpad/src/org/apache/poi/hwmf/record/WmfWindowing.java
>
>
> On 28.05.2014 18:34, Bing Ran wrote:
>
>> Hi Andreas,
>>
>> Thanks for answer.
>>
>> The raw data was acquired from overriding the
>> AbstractWordConverter.processingImage()...
>> in the hwpf package, by calling picture.getContent(). I cannot
>> immediately figure
>> out how to reset the header after reading your code reference.
>>
>> BTW, I was using a local compile of the POI modules from Github.  Is the
>> code considered out of date? I could not find the hwmf package in the
>> github code.
>>
>> Thanks
>>
>> Bing
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
> For additional commands, e-mail: user-help@poi.apache.org
>
>

Re: equation images in a .doc

Posted by Andreas Beeker <an...@gmx.de>.
Hi Bing,

the wmf code is in a branch, because I'd like to commit (save) it without interfering with the trunk ... and it's still far from being finished.
Not sure, how the github synchronisation works, but I guess "it" only fetches the trunk.

And for the header: I couldn't find the reference on a quick search, but - I think - the wmf pictures are
always saved without the placeable header. At least when I've implemented/tested the wmf code in HssfWorkbook.addPicture, it only worked, when I've removed that header - maybe it's different with hwpf ...
So I guess JWord is simply making a header up - either it uses standard values for the bounding box or it reads one of the window records [1].

Anyways, I would simply try to append the 22 bytes of a working picture to the pictures with the missing magic code 0x9AC6CDD7 (little endian) and see if your post-processing is ok with that.

Andi


[1] http://svn.apache.org/repos/asf/poi/branches/wmf_render/src/scratchpad/src/org/apache/poi/hwmf/record/WmfWindowing.java

On 28.05.2014 18:34, Bing Ran wrote:
> Hi Andreas,
>
> Thanks for answer.
>
> The raw data was acquired from overriding the
> AbstractWordConverter.processingImage()...
> in the hwpf package, by calling picture.getContent(). I cannot
> immediately figure
> out how to reset the header after reading your code reference.
>
> BTW, I was using a local compile of the POI modules from Github.  Is the
> code considered out of date? I could not find the hwmf package in the
> github code.
>
> Thanks
>
> Bing
>


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: equation images in a .doc

Posted by Bing Ran <bi...@gmail.com>.
Using JWord from IndependentSoft to extract images and I got the same set
of wmf files, only 22 bytes longer. Those files displayed properly.  So the
headers were chopped off by POI.

I'm wondering how tools like JWord recover the head information...





2014-05-29 0:34 GMT+08:00 Bing Ran <bi...@gmail.com>:

> Hi Andreas,
>
> Thanks for answer.
>
> The raw data was acquired from overriding the AbstractWordConverter.processingImage()...
> in the hwpf package, by calling picture.getContent(). I cannot immediately
> figure out how to reset the header after reading your code reference.
>
> BTW, I was using a local compile of the POI modules from Github.  Is the
> code considered out of date? I could not find the hwmf package in the
> github code.
>
> Thanks
>
> Bing
>
>
>
>
>
>
>
> 2014-05-28 23:13 GMT+08:00 Andreas Beeker <an...@gmx.de>:
>
> Hi Bing,
>>
>> maybe the wmfs are missing the wmf header, which can be chopped off, when
>> the wmf is embedded [2]
>> - so if the size is 22 bytes to big, this would be a good indication.
>>
>> I've started to implement wmf parsing a while ago and maybe you can
>> recreate the header with the
>> WmfPlaceableHeader class [1].
>>
>> Andi
>>
>> [1] http://svn.apache.org/repos/asf/poi/branches/wmf_render/
>> src/scratchpad/src/org/apache/poi/hwmf/record/WmfPlaceableHeader.java
>> [2] org.apache.poi.hssf.usermodel.HSSFWorkbook.addPicture()
>>
>>
>> On 28.05.2014 12:29, Bing Ran wrote:
>>
>>> Hi,
>>>
>>> New to the list but I have pressing need to extract all the embedded
>>> equation images from an Word 97 .doc file (not .docx).
>>>
>>> I know all those images are in WMF format. After I dumped the picture
>>> content (from the Picture.getContent()) to a file, I found that the file
>>> was not entirely a valid WMF or at least they did not have the correct
>>> size
>>> information.
>>>
>>> I'd appreciate it so much if someone can get me started on the right
>>> track.
>>>
>>> Thanks!
>>>
>>> Bing
>>>
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
>> For additional commands, e-mail: user-help@poi.apache.org
>>
>>
>

Re: equation images in a .doc

Posted by Bing Ran <bi...@gmail.com>.
Hi Andreas,

Thanks for answer.

The raw data was acquired from overriding the
AbstractWordConverter.processingImage()...
in the hwpf package, by calling picture.getContent(). I cannot
immediately figure
out how to reset the header after reading your code reference.

BTW, I was using a local compile of the POI modules from Github.  Is the
code considered out of date? I could not find the hwmf package in the
github code.

Thanks

Bing







2014-05-28 23:13 GMT+08:00 Andreas Beeker <an...@gmx.de>:

> Hi Bing,
>
> maybe the wmfs are missing the wmf header, which can be chopped off, when
> the wmf is embedded [2]
> - so if the size is 22 bytes to big, this would be a good indication.
>
> I've started to implement wmf parsing a while ago and maybe you can
> recreate the header with the
> WmfPlaceableHeader class [1].
>
> Andi
>
> [1] http://svn.apache.org/repos/asf/poi/branches/wmf_render/
> src/scratchpad/src/org/apache/poi/hwmf/record/WmfPlaceableHeader.java
> [2] org.apache.poi.hssf.usermodel.HSSFWorkbook.addPicture()
>
>
> On 28.05.2014 12:29, Bing Ran wrote:
>
>> Hi,
>>
>> New to the list but I have pressing need to extract all the embedded
>> equation images from an Word 97 .doc file (not .docx).
>>
>> I know all those images are in WMF format. After I dumped the picture
>> content (from the Picture.getContent()) to a file, I found that the file
>> was not entirely a valid WMF or at least they did not have the correct
>> size
>> information.
>>
>> I'd appreciate it so much if someone can get me started on the right
>> track.
>>
>> Thanks!
>>
>> Bing
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
> For additional commands, e-mail: user-help@poi.apache.org
>
>

Re: equation images in a .doc

Posted by Andreas Beeker <an...@gmx.de>.
Hi Bing,

maybe the wmfs are missing the wmf header, which can be chopped off, when the wmf is embedded [2]
- so if the size is 22 bytes to big, this would be a good indication.

I've started to implement wmf parsing a while ago and maybe you can recreate the header with the
WmfPlaceableHeader class [1].

Andi

[1] http://svn.apache.org/repos/asf/poi/branches/wmf_render/src/scratchpad/src/org/apache/poi/hwmf/record/WmfPlaceableHeader.java
[2] org.apache.poi.hssf.usermodel.HSSFWorkbook.addPicture()

On 28.05.2014 12:29, Bing Ran wrote:
> Hi,
>
> New to the list but I have pressing need to extract all the embedded
> equation images from an Word 97 .doc file (not .docx).
>
> I know all those images are in WMF format. After I dumped the picture
> content (from the Picture.getContent()) to a file, I found that the file
> was not entirely a valid WMF or at least they did not have the correct size
> information.
>
> I'd appreciate it so much if someone can get me started on the right track.
>
> Thanks!
>
> Bing
>



---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org