You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@poi.apache.org by Vasko Gjurovski <va...@gmail.com> on 2009/11/11 12:56:29 UTC

HWPF image extraction problem

Hi all,

I am trying to extract a whole .doc document and have managed to do great
with text, tables and bullets, but I remain stuck regarding the images.
AFAIK the images in the MSWord file are stored as .emz, which is a gzip-ed
emf file. This is my code:


        List picList = picTable.getAllPictures();
        Picture picture = (Picture) picList.get(picC);
        String folderPath = PATH;
        String emzPath = folderPath+picture.suggestFullFileName()+".emz";
        OutputStream image = new FileOutputStream(emzPath);
        picture.writeImageContent(image);
        image.close();
        InputStream is = new FileInputStream(new File(emzPath));
        GZIPInputStream gzipis = new GZIPInputStream(is);
        OutputStream emfos = new FileOutputStream(new
File(folderPath+picture.suggestFullFileName()+".emf"));
        byte[] buf = new byte[1024];
        int len;
        while ((len = gzipis.read(buf)) > 0) {
          emfos.write(buf, 0, len);
        }
        gzipis.close();
        emfos.close();

This should do the extraction of the emf image file from the emz. However my
code fails to do so because the gzipis (the supposed gzip InputStream) is
not a gzip at all! It seems that the extracted image is not an emz file. I
tried another approach, to save the word file as HTML (which stores the
images in a separate folder) and I got the images as .emz and gif. Now the
size of the .emz file from that extraction and my extraction defer in bytes,
meaning that the extraction is done wrong? I have been able to open the .emz
file from the HTML extraction with gzip, but not my extracted file, getting
an not good gzip file?

Any help with this?

Best regards,
Vasko

Re: HWPF image extraction problem

Posted by jonycus <va...@gmail.com>.
Anyone with the same problem or some experience with this?



jonycus wrote:
> 
> Hi all,
> 
> I am trying to extract a whole .doc document and have managed to do great
> with text, tables and bullets, but I remain stuck regarding the images.
> AFAIK the images in the MSWord file are stored as .emz, which is a gzip-ed
> emf file. This is my code:
> 
> 
>         List picList = picTable.getAllPictures();
>         Picture picture = (Picture) picList.get(picC);
>         String folderPath = PATH;
>         String emzPath = folderPath+picture.suggestFullFileName()+".emz";
>         OutputStream image = new FileOutputStream(emzPath);
>         picture.writeImageContent(image);
>         image.close();
>         InputStream is = new FileInputStream(new File(emzPath));
>         GZIPInputStream gzipis = new GZIPInputStream(is);
>         OutputStream emfos = new FileOutputStream(new
> File(folderPath+picture.suggestFullFileName()+".emf"));
>         byte[] buf = new byte[1024];
>         int len;
>         while ((len = gzipis.read(buf)) > 0) {
>           emfos.write(buf, 0, len);
>         }
>         gzipis.close();
>         emfos.close();
> 
> This should do the extraction of the emf image file from the emz. However
> my
> code fails to do so because the gzipis (the supposed gzip InputStream) is
> not a gzip at all! It seems that the extracted image is not an emz file. I
> tried another approach, to save the word file as HTML (which stores the
> images in a separate folder) and I got the images as .emz and gif. Now the
> size of the .emz file from that extraction and my extraction defer in
> bytes,
> meaning that the extraction is done wrong? I have been able to open the
> .emz
> file from the HTML extraction with gzip, but not my extracted file,
> getting
> an not good gzip file?
> 
> Any help with this?
> 
> Best regards,
> Vasko
> 
> 

-- 
View this message in context: http://old.nabble.com/HWPF-image-extraction-problem-tp26300123p26498551.html
Sent from the POI - User mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org