You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Antoni Mylka <an...@aduna-software.com> on 2010/04/28 14:44:58 UTC

Getting the raw, undecoded content of an Image

Hi,

I'm writing a program that extracts Images from a PDF. I was inspired by 
the algorithm presented in the ExtractImages class

http://kickjava.com/src/org/pdfbox/ExtractImages.java.htm

I tried it on a 8MB ebook which turned out to contain on the order of 
50K small images. They were all pngs. My profiler revealed that 95% of 
the time was spent in image.write2OutputStream method, vast majority in 
Deflater - the class that decompresses the image to get a normal PNG file.

My idea was
  - get the basic raw, undecoded bytes of the image
  - compute a hash of them
  - if the hash hasn't been seen before - decode the full image, 
otherwise go on

My reasoning was that the same image must occur many times, so decoding 
only unique ones, would make it all faster.

Now my question: how to get the basic, raw, undecoded bytes from an 
instance of the PDXObjectImage.

I tried

image.getPDStream().createInputStream()
image.getPDStream().getStream().getUnfilteredStream().

both work on pdfs with embedded PNG files, but if I have embedded JPGs I 
get a warning:

Warning: DCTFilter.decode is not implemented yet, skipping this stream.

and the returned stream is empty.

What to do?

Thanks in advance

Antoni Mylka
antoni.mylka@aduna-software.com