You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Antoni Mylka <an...@aduna-software.com> on 2010/04/28 14:44:58 UTC
Getting the raw, undecoded content of an Image
Hi,
I'm writing a program that extracts Images from a PDF. I was inspired by
the algorithm presented in the ExtractImages class
http://kickjava.com/src/org/pdfbox/ExtractImages.java.htm
I tried it on a 8MB ebook which turned out to contain on the order of
50K small images. They were all pngs. My profiler revealed that 95% of
the time was spent in image.write2OutputStream method, vast majority in
Deflater - the class that decompresses the image to get a normal PNG file.
My idea was
- get the basic raw, undecoded bytes of the image
- compute a hash of them
- if the hash hasn't been seen before - decode the full image,
otherwise go on
My reasoning was that the same image must occur many times, so decoding
only unique ones, would make it all faster.
Now my question: how to get the basic, raw, undecoded bytes from an
instance of the PDXObjectImage.
I tried
image.getPDStream().createInputStream()
image.getPDStream().getStream().getUnfilteredStream().
both work on pdfs with embedded PNG files, but if I have embedded JPGs I
get a warning:
Warning: DCTFilter.decode is not implemented yet, skipping this stream.
and the returned stream is empty.
What to do?
Thanks in advance
Antoni Mylka
antoni.mylka@aduna-software.com