You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@poi.apache.org by Hong-Thai Nguyen <Ho...@polyspot.com> on 2014/02/21 11:28:28 UTC

Extract thumbnail of MS Office ?

Hi all,

I'm trying extract thumbnail of a MS Word document using HPSF (this file has embedded thumbnail). After doc : http://poi.apache.org/hpsf/thumbnails.html, I can do with follow code :
static byte[] process(File docFile) throws Exception {
    final HWPFDocumentCore wordDocument = AbstractWordUtils.loadDoc(docFile);
    SummaryInformation summaryInformation = wordDocument.getSummaryInformation();
    System.out.println(summaryInformation.getAuthor());
    System.out.println(summaryInformation.getApplicationName() + ":" + summaryInformation.getTitle());
    Thumbnail thumbnail = new Thumbnail(summaryInformation.getThumbnail());
    System.out.println(thumbnail.getClipboardFormat());
    System.out.println(thumbnail.getClipboardFormatTag());
    return thumbnail.getThumbnailAsWMF();
  }

Unfornatly, the extraction raises exception :
Converting E:\test.doc
Saving output to E:\test.wmf
org.apache.poi.hpsf.HPSFException: Clipboard Format Tag of Thumbnail must be CFTAG_WINDOWS.
       at org.apache.poi.hpsf.Thumbnail.getClipboardFormat(Thumbnail.java:234)
       at DOC2JPG.process(DOC2JPG.java:52)
       at DOC2JPG.main(DOC2JPG.java:33)
Michel ARNOULD
Microsoft Word 9.0:GROUPE DE PAIRS DE VILLIERS-ST-GEORGES

I exported content from summaryInformation.getThumbnail() to a file, then show by Hexa. The 4 bytes value of Clipboard format tag is never -1 (CFTAG_WINDOWS), but a '4294967295' :
18 33 00 00 FF FF FF FF 03 00 00 00 08 00 05 52
01 74 E2 18 01 00 09 00 00 03 7C 19 00 00 0A 00
...

I tested on some other Word documents, the format tag value is always '4294967295'.

Thank alot for your help.

Hong-Thai