You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by Andreas Lehmkühler <an...@lehmi.de> on 2009/09/15 07:40:11 UTC

[VOTE] Release PDFBox 0.8.0-incubating (3. attempt)

Hi,

I have posted a candidate for the first Apache release of PDFBox
developed within the PDFBox podling. The candidate can be found at

http://people.apache.org/~lehmi/pdfbox/pdfbox-0.8.0-incubating/

See the RELEASE-NOTES.txt file (also included at the end of this
message) for details on release contents. The release candidate is a
jar archive of the sources in

http://svn.apache.org/repos/asf/incubator/pdfbox/tags/0.8.0-incubating.

The MD5 checksum of the pdfbox-0.8.0-incubating-src.jar release package
is 1E 2B 55 FC 8C 9D 7C 31  16 AE 37 91 42 30 F5 39.

Please vote on releasing this package as Apache PDFBox
0.8.0-incubating. The vote is open for the next 72 hours and passes if
a majority of at least three +1 PDFBox PPMC votes is reached. Assuming
the vote passes, I will ask the Incubator PMC to approve the release.

[ ] +1 Release this package as Apache PDFBox 0.8.0-incubating
[ ] -1 Do not release this package because...

With the source release I have also included a pre-compiled jar file.
The Maven POM file from the source release is also included so that we
can deploy the released jar to the central Maven repository if the
release vote passes.
In addition I have also included a pre-compiled jar file for a standalone
version including all needed external libs to run PDFBox.

Changelog for my 3. attempt:
- adding the README.txt and the RELEASE-NOTES.txt to the standalone jar
- including all files from svn to the src jar
- including the changes from the weekend (logging, colorspace caching,
font handling)
- updating the release notes, both txt and web version

Here's my +1.

BR
Andreas Lehmkühler

Release Notes -- Apache PDFBox -- Version 0.8.0-incubating

Introduction
------------

Apache PDFBox is an open source Java library for working with PDF documents.

This 0.8.0-incubating release is the first PDFBox release made at the
Apache Software Foundation. The most notable change since the previous
release (0.7.3) is the renaming of all Java packages from org.pdfbox to
org.apache.pdfbox. If you've used PDFBox before, you need to update all
your client code to use the renamed PDFBox packages.

The -incubating label included in the version number reflects the incubation
status of the project. See the disclaimer below for more about incubation.

See the Apache PDFBox website at http://incubator.apache.org/pdfbox/ for
more information.

Release Contents
----------------

This release consists of a source archive (pdfbox-0.8.0-incubating-src.jar).
You can build the release with Apache Ant like this:

jar xf pdfbox-0.8.0-incubating-src.jar
cd pdfbox-0.8.0-incubating
ant

The source archive is accompanied by SHA1 and MD5 checksums and a PGP
signature that you can use to verify the authenticity of your download.
The public key used for the PGP signature can be found at
https://svn.apache.org/repos/asf/incubator/pdfbox/KEYS.

Changelog
---------

Bug

* [PDFBOX-51] - PDFToImage fails to render correctly
* [PDFBOX-93] - Error in FlateFilter?
* [PDFBOX-94] - Unexpected end of ZLIB input stream
* [PDFBOX-107] - viewer crashed
* [PDFBOX-110] - bad font data with TrueTypeFont
* [PDFBOX-141] - PDF to image conversion can lead to mostly black area
* [PDFBOX-148] - Error getting pdf version (NumberFormatException)
* [PDFBOX-152] - Merge Landscape and Portrait PDFs does not keep orientation
* [PDFBOX-162] - font spacing
* [PDFBOX-173] - Some suggested COSString improvements
* [PDFBOX-178] - splitting some words randomnly
* [PDFBOX-183] - java.lang.NullPointerException in
highlighter.generateXMLHig
* [PDFBOX-187] - Error in parsing CMap file
* [PDFBOX-211] - Regression: ArrayIndexOutOfBoundsException in PDFBox 0.7.3
* [PDFBOX-221] - NPE on convertToImage
* [PDFBOX-223] - CurrentColor in PageDrawer Doesn't Restore Properly
* [PDFBOX-224] - Printing Rectangles on rotated pages
* [PDFBOX-227] - ArrayIndexOutOfBoundsException:4
* [PDFBOX-234] - spaces lost
* [PDFBOX-249] - Imbricated XObjects with the same name
* [PDFBOX-250] - Table borders not printing correctly
* [PDFBOX-286] - PDF document renders incorrectly
* [PDFBOX-290] - java.lang.NoSuchMethodError in fontbox
* [PDFBOX-292] - Text Extraction strips 1 char when extracting a twin pair
* [PDFBOX-296] - Extreme memory usage while extracting text from one pdf
* [PDFBOX-313] - OutOfMemoryError for larger PDF text extraction
* [PDFBOX-318] - Error getting pdf version
* [PDFBOX-321] - PDF printing or conversion : lines are too thick - SOLVED ?
* [PDFBOX-324] - One rectangle missing when converting PDF to image
* [PDFBOX-330] - Watermarks aren't correctly showed
* [PDFBOX-335] - Version incompatibility with Lucene?
* [PDFBOX-343] - java.lang.ClassCastException: org.pdfbox.cos.COSArray
cannot
* [PDFBOX-348] - java.lang.NoClassDefFoundError: org/fontbox/afm/AFMParser
* [PDFBOX-349] - Spaces between words ignored in scanned pdf files
* [PDFBOX-361] - NullPointerException in PDPageNode.getAllKids
* [PDFBOX-364] - Latest trunk uses Java 5 autoboxing
* [PDFBOX-373] - (null) printed when characters cannot be decoded during
text extraction
* [PDFBOX-374] - text areas not properly being sorted because of page
rotation
* [PDFBOX-377] - Incorrect direction of extracted Arabic Text
* [PDFBOX-379] - PDType1Font uses the Java 5 constant Font.TYPE1
* [PDFBOX-385] - ClassCastException when call parseCOSArray in
BaseParser.java
* [PDFBOX-390] - org.pdfbox.filter.ASCIIHexFilter does not skip Whitespace
* [PDFBOX-393] - Maven files in jempbox do not work in Eclipse.
* [PDFBOX-395] - NPE on public key encryption of an unencrypted document
* [PDFBOX-396] - Incorrect permissions after decryption
* [PDFBOX-401] - setStrokingColorSpace and setNonStrokingColorSpace in
PDPageContentStream doesn't work correct
* [PDFBOX-404] - ClassCastException in COSDictionaryMap
* [PDFBOX-407] - PDLineDashPattern missing call to super.clone()
* [PDFBOX-409] - Small hashcode issue, The code invokes hashCode on an
array.
* [PDFBOX-415] - Errors when decomposing Arabic Ligatures
* [PDFBOX-418] - PDFStreamParser reads incorrect number (patch provided)
* [PDFBOX-421] - Unit tests are failing
* [PDFBOX-425] - Silent print ignores passed PrintJob
* [PDFBOX-426] - Class StrokePath has the wrong superclass
* [PDFBOX-428] - Error Printing: dash lengths all zero
* [PDFBOX-436] - PDFontFactory.createFont returns null if the given
parameter fontCache is null
* [PDFBOX-438] - FlateFilter: endless loop because of missing length
check (for encrypted pdfs)
* [PDFBOX-442] - race condition in PdfFont
* [PDFBOX-446] - A empty page produces a NPE
* [PDFBOX-450] - PDFTextStripper CAN NOT extract correct font
information for some early produced PDF documents
* [PDFBOX-452] - [patch] maven build errors in current trunk
* [PDFBOX-453] - FlateFilter decode() throwing OutOfMemoryError
* [PDFBOX-454] - IOException upon opening a PDF
* [PDFBOX-455] - java.lang.ClassCastException: org.pdfbox.cos.COSString
cannot be cast to org.pdfbox.cos.COSName
* [PDFBOX-456] - PDFTextStripperByArea never finds any text (pageNo
check in PDFTextStripper always returns false)
* [PDFBOX-458] - Wrong implementation of COSArray.getInt()
* [PDFBOX-459] - Trailer Dictionary object labeled "Size" is overwritten
when there are 2 xref table objects
* [PDFBOX-466] - error parsing files generated by crystal reports
* [PDFBOX-468] - index out of bounds exception
* [PDFBOX-470] - corrupt zip stream causes document to not parse
* [PDFBOX-471] - invalid dictionary crashes parser
* [PDFBOX-473] - attempt to push back when content read
* [PDFBOX-474] - invalid xref entry causes parser to fail
* [PDFBOX-477] - extra spaces added to rotated text
* [PDFBOX-478] - PDFToImage don't render text in iText generated PDF
* [PDFBOX-482] - DeviceCMYK support in PDColorSpaceFactory
* [PDFBOX-483] - rendering issues during clipping (W/W*-operator)
* [PDFBOX-485] - Fonts not printed on HP laserjet (1320 & 8150) when
having landscape orientation
* [PDFBOX-487] - Font size not rendered with the needed precision
* [PDFBOX-496] - PDDocument.load hangs when loading zero-length file
* [PDFBOX-498] - some pdf-files have no newline after endobj, pdfbox
fails on that
* [PDFBOX-503] - PDF loader causes infinite loop on non-PDF inputs
* [PDFBOX-512] - org.apache.pdfbox.pdmodel.PDDocument.getPageMap()
always returns null

Improvement

* [PDFBOX-302] - Improve font handling (was: layout print problem)
* [PDFBOX-319] - Implementation of PDDeviceCMYK.createColorModel()
* [PDFBOX-358] - Vertical text extraction splitting text
* [PDFBOX-363] - Fixed Page rotation
* [PDFBOX-365] - Updating Lucene version (was: Error in LucenPDFDocument
class)
* [PDFBOX-368] - Use the Maven standard directory layout
* [PDFBOX-376] - Remove the js.jar file
* [PDFBOX-380] - Limited support for SC and SCN operator
* [PDFBOX-381] - Remove direct JAI dependency
* [PDFBOX-387] - new Maven pom.xml files for pdfbox, fontbox, and jempbox
* [PDFBOX-389] - Support for b*, B*, d, i, j and J operator
* [PDFBOX-405] - Not a bug, but definately incorrect code in
PDPageContentStream
* [PDFBOX-437] - Prepare JempBox and FontBox for release
* [PDFBOX-460] - [PATCH] Improvements for bitmap production (resolution
and color depth)
* [PDFBOX-461] - Disable javadoc creation timestamp
* [PDFBOX-472] - use commons logging
* [PDFBOX-507] - [PATCH] Option to disable close warning in finalizer of
COSDocument.

New Feature

* [PDFBOX-98] - Print PDF
* [PDFBOX-264] - colorspace as an array entry
* [PDFBOX-272] - Identify text rotation angle in TextPosition
* [PDFBOX-338] - pdf page extraction
* [PDFBOX-493] - Ability to get page number for bookmarks

Disclaimer
----------

Apache PDFBox is an effort undergoing incubation at The Apache Software
Foundation (ASF), sponsored by the Apache Incubator PMC. Incubation is
required of all newly accepted projects until a further review indicates
that the infrastructure, communications, and decision making process have
stabilized in a manner consistent with other successful ASF projects. While
incubation status is not necessarily a reflection of the completeness or
stability of the code, it does indicate that the project has yet to be fully
endorsed by the ASF.

See http://incubator.apache.org/projects/pdfbox.html for the current
incubation status of the Apache PDFBox project.

About The Apache Software Foundation
------------------------------------

Established in 1999, The Apache Software Foundation provides organizational,
legal, and financial support for more than 100 freely-available,
collaboratively-developed Open Source projects. The pragmatic Apache License
enables individual and commercial users to easily deploy Apache software;
the Foundation's intellectual property framework limits the legal exposure
of its 2,500+ contributors.

For more information, visit http://www.apache.org/


RE: [VOTE] Release PDFBox 0.8.0-incubating (3. attempt)

Posted by Francisco Garrido <fg...@pedagogiainteractiva.com>.
HOW TO UNSUBSCRIBE?!

Francesc Garrido
Àrea Tecnologia
Pedagogia Interactiva, S.L.
 
C/Marie Curie s/n 
Parc Tecnològic BCNord
08042 Barcelona
T: +34 93 253 91 94 ; F: +34 93 291 76 91 
www.pedagogiainteractiva.com
 
Advertència legal  /  Advertencia legal  /  Legal Notice


-----Original Message-----
From: Jukka Zitting [mailto:jukka.zitting@gmail.com] 
Sent: Tuesday, September 15, 2009 10:50 AM
To: pdfbox-dev@incubator.apache.org
Subject: Re: [VOTE] Release PDFBox 0.8.0-incubating (3. attempt)

Hi,

2009/9/15 Andreas Lehmkühler <an...@lehmi.de>:
> http://people.apache.org/~lehmi/pdfbox/pdfbox-0.8.0-incubating/

Do you use a new code signing key (ID DB880CA4) or did you
accidentally use a different key than your normal one (ID 1DFDBF44)?
The new key is not included in the KEYS file.

BR,

Jukka Zitting


Re: [VOTE] Release PDFBox 0.8.0-incubating (3. attempt)

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

2009/9/15 Andreas Lehmkühler <an...@lehmi.de>:
> http://people.apache.org/~lehmi/pdfbox/pdfbox-0.8.0-incubating/

Do you use a new code signing key (ID DB880CA4) or did you
accidentally use a different key than your normal one (ID 1DFDBF44)?
The new key is not included in the KEYS file.

BR,

Jukka Zitting

RE: Extracting Images

Posted by "Martinez, Mel" <m....@ll.mit.edu>.
Unfortunately, I am not the creator of the PDF documents I need to extract from.  The images will come in whatever format they come in.

I see exactly what you describe - blues become pink and the pallete is 'sort of reversed'.  But inverting the palette (in an image editor) doesn't quite fix it.
  
Unfortunately, this is also not my area of expertise so I'm struggling too.  I just don't have the luxury to choose which type of image format gets used.

-mel

-----Original Message-----
From: Adam@swmc.com [mailto:Adam@swmc.com] 
Sent: Tuesday, September 22, 2009 11:54 AM
To: pdfbox-dev@incubator.apache.org
Subject: RE: Extracting Images

I noticed that an image with an indexed pallette (tested with BMP, PNG) 
did not look right after encrypting the PDF.  The colors were switched 
around.  I remember that blue became pink, but it wasn't a straight 
inverse.  Writing out the same PDF without encryption worked fine.  If RGB 
is used, it'll work fine whether encrypted or not (tested with PNG).

This doesn't seem to be the same thing you are describing, but it could be 
related.  I don't have the time nor expertise to look into that one so my 
solution was to use RGB images.

--Adam




"Martinez, Mel" <m....@ll.mit.edu> 
09/22/2009 06:52
Please respond to
pdfbox-dev@incubator.apache.org


To
"pdfbox-dev@incubator.apache.org" <pd...@incubator.apache.org>
cc

Subject
RE: Extracting Images







Thanks, Alex.

Unfortunately, that (ExtractImages) is the first place I looked when I 
started this.

It basically uses the first technique below (Image.write2File(String)).

The problem I described also happens with the ExtractImages class.  It 
also happens with PDF2Image - which converts each whole page to an image. 
Within each page image, the embedded photos all have their colors all 
screwed up.

I've tried this with several PDF input files and it happens with every 
color photo image.

Line art (even if rasterized and embedded as jpeg) and B&W images are 
fine.

I think there is something wrong with how PDFBox is extracting the images.

Is no one else seeing this?

I'm on a Windows XP PRO (64bit) machine.

-mel

-----Original Message-----
From: Alex Shvartz [mailto:alshvartz@yahoo.com] 
Sent: Monday, September 21, 2009 7:20 PM
To: pdfbox-dev@incubator.apache.org
Subject: RE: Extracting Images

Hi,

Please have a look to org.apache.pdfbox.ExtractImages class.
In extractImages() method there is a good explanation how to extract image 
from PDF file and save it.

Best Regards.


Alex.

--- On Mon, 9/21/09, Martinez, Mel <m....@ll.mit.edu> wrote:

From: Martinez, Mel <m....@ll.mit.edu>
Subject: RE: Extracting Images
To: "pdfbox-dev@incubator.apache.org" <pd...@incubator.apache.org>
Date: Monday, September 21, 2009, 3:31 PM

Ugh!  I'm crying uncle!  I obviously need help (in more ways than one!).

If ANYBODY has some experience with extracting jpeg images from PDF files 
using PDFBox, I'd appreciate a few pointers.

I've started with the basics (lotsa null checks & junk removed):

    PDPage page = ....
    PDResources resources = page.getResources();
    Map<String, PDXObjectImage images = resources.getImages();

So far, so good.  Some null & empty tests then
    ...
    PDXObjectImage image = images.get(key);

At this point, I've tried several things.  I've tried just letting the 
image class write itself out:

    Image.write2File(fname); //where fname does not include the suffix

I've also tried rebuilding the image object from pieces like so:

    BufferedImage bi = image.getRGBImage();
    int bpc = image.getBitsPerComopnent();
    PDColorSpace cspace = image.getColorSpace();
    ...
    WritableRaster srcRaster = bi.getRaster();
    ...
    ColorModel cm = cspace.createColorModel(bpc);
    int h = image.getHeight();
    int w = image.getWidth();
    WritableRaster raster = cm.createCompatibleWritableRaster(w,h);
    raster.setRect(srcRaster);
    bi = new BufferedImage(cm,raster,false,null);
    ImageIO.write(bi,format,new File(fname+"."+format));

This second method has the advantage of allowing you to write out to a 
different format, though some conversions crash it or look like garbage.

In general, both methods 'work' in that they extract the image and write 
it out to a file that can then be opened and displayed with any image 
viewer (or a web browser).  The problem is, the colors in the resulting 
image are simply off.  Way off.

JPEG & BMP color photo images look about the same, though the color 
palettes are sometimes off in different ways.
JPEG & BMP Black & white images and line art (even color) generally look 
fine.
TIFF images and PNG images look completely messed up.  Often turning into 
black rectangles or random color bands.  They also tend to blow up the 
second code.

Does anybody have a clue about this stuff?

Thanks in advance,

Mel

Dr. Mel Martinez
m.martinez@ll.mit.edu


-----Original Message-----
From: Daniel Wilson [mailto:williamstonconsulting@gmail.com] 
Sent: Tuesday, September 15, 2009 7:33 PM
To: pdfbox-dev@incubator.apache.org
Subject: Re: Extracting Images

I've done battle with the PDXObjectImage, but it has usually defeated me!
Sections 4.7 and 4.8 of the PDF spec address it.

Daniel

On Tue, Sep 15, 2009 at 6:01 PM, Martinez, Mel 
<m....@ll.mit.edu>wrote:

> I've been playing with extracting images.
>
> I've found a few 'wierdnesses' (I know, that's not a real word) in the
> org.apache.pdfbox.ExtractText class and If I can clear some time, I'll 
try
> to submit something on that.
>
> Ignoring the 'wierdnesses' (which have more to do with options parsing 
and
> filenaming), it does successfully extract images to separate files.
>
> However, the color table is apparently not being handled properly.
>
> All the images end up displaying with the default Windows palette, which
> tells me that they probably are missing their own.
>
> I assume that what probably needs to be done is that the color space 
needs
> to be rebuilt and reset on each image object prior to writing the image 
out
> to file, but I'm not entirely certain how to proceed with that.
>
> Does anybody have any familiarity with the PDXObjectImage and its 
related
> APIs?
>
> If someone can point me in the right direction, I don't mind doing the 
work
> of fixing this.
>
> Mel
>
> Dr. Mel Martinez
> m.martinez@ll.mit.edu
>
>
>
>


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


?  Click here to submit conditions  

This email and any content within or attached hereto from  Sun West Mortgage Company, Inc.  is confidential and/or legally privileged. The information is intended only for the use of the individual or entity named on this email. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or the taking of any action in reliance on the contents of this email information is strictly prohibited, and that the documents should be returned to this office immediately by email. Receipt by anyone other than the intended recipient is not a waiver of any privilege. Please do not include your social security number, account number, or any other personal or financial information in the content of the email. Should you have any questions, please call  (800) 453 7884.   

RE: Extracting Images

Posted by Ad...@swmc.com.
I noticed that an image with an indexed pallette (tested with BMP, PNG) 
did not look right after encrypting the PDF.  The colors were switched 
around.  I remember that blue became pink, but it wasn't a straight 
inverse.  Writing out the same PDF without encryption worked fine.  If RGB 
is used, it'll work fine whether encrypted or not (tested with PNG).

This doesn't seem to be the same thing you are describing, but it could be 
related.  I don't have the time nor expertise to look into that one so my 
solution was to use RGB images.

--Adam




"Martinez, Mel" <m....@ll.mit.edu> 
09/22/2009 06:52
Please respond to
pdfbox-dev@incubator.apache.org


To
"pdfbox-dev@incubator.apache.org" <pd...@incubator.apache.org>
cc

Subject
RE: Extracting Images







Thanks, Alex.

Unfortunately, that (ExtractImages) is the first place I looked when I 
started this.

It basically uses the first technique below (Image.write2File(String)).

The problem I described also happens with the ExtractImages class.  It 
also happens with PDF2Image - which converts each whole page to an image. 
Within each page image, the embedded photos all have their colors all 
screwed up.

I've tried this with several PDF input files and it happens with every 
color photo image.

Line art (even if rasterized and embedded as jpeg) and B&W images are 
fine.

I think there is something wrong with how PDFBox is extracting the images.

Is no one else seeing this?

I'm on a Windows XP PRO (64bit) machine.

-mel

-----Original Message-----
From: Alex Shvartz [mailto:alshvartz@yahoo.com] 
Sent: Monday, September 21, 2009 7:20 PM
To: pdfbox-dev@incubator.apache.org
Subject: RE: Extracting Images

Hi,

Please have a look to org.apache.pdfbox.ExtractImages class.
In extractImages() method there is a good explanation how to extract image 
from PDF file and save it.

Best Regards.


Alex.

--- On Mon, 9/21/09, Martinez, Mel <m....@ll.mit.edu> wrote:

From: Martinez, Mel <m....@ll.mit.edu>
Subject: RE: Extracting Images
To: "pdfbox-dev@incubator.apache.org" <pd...@incubator.apache.org>
Date: Monday, September 21, 2009, 3:31 PM

Ugh!  I'm crying uncle!  I obviously need help (in more ways than one!).

If ANYBODY has some experience with extracting jpeg images from PDF files 
using PDFBox, I'd appreciate a few pointers.

I've started with the basics (lotsa null checks & junk removed):

    PDPage page = ....
    PDResources resources = page.getResources();
    Map<String, PDXObjectImage images = resources.getImages();

So far, so good.  Some null & empty tests then
    ...
    PDXObjectImage image = images.get(key);

At this point, I've tried several things.  I've tried just letting the 
image class write itself out:

    Image.write2File(fname); //where fname does not include the suffix

I've also tried rebuilding the image object from pieces like so:

    BufferedImage bi = image.getRGBImage();
    int bpc = image.getBitsPerComopnent();
    PDColorSpace cspace = image.getColorSpace();
    ...
    WritableRaster srcRaster = bi.getRaster();
    ...
    ColorModel cm = cspace.createColorModel(bpc);
    int h = image.getHeight();
    int w = image.getWidth();
    WritableRaster raster = cm.createCompatibleWritableRaster(w,h);
    raster.setRect(srcRaster);
    bi = new BufferedImage(cm,raster,false,null);
    ImageIO.write(bi,format,new File(fname+"."+format));

This second method has the advantage of allowing you to write out to a 
different format, though some conversions crash it or look like garbage.

In general, both methods 'work' in that they extract the image and write 
it out to a file that can then be opened and displayed with any image 
viewer (or a web browser).  The problem is, the colors in the resulting 
image are simply off.  Way off.

JPEG & BMP color photo images look about the same, though the color 
palettes are sometimes off in different ways.
JPEG & BMP Black & white images and line art (even color) generally look 
fine.
TIFF images and PNG images look completely messed up.  Often turning into 
black rectangles or random color bands.  They also tend to blow up the 
second code.

Does anybody have a clue about this stuff?

Thanks in advance,

Mel

Dr. Mel Martinez
m.martinez@ll.mit.edu


-----Original Message-----
From: Daniel Wilson [mailto:williamstonconsulting@gmail.com] 
Sent: Tuesday, September 15, 2009 7:33 PM
To: pdfbox-dev@incubator.apache.org
Subject: Re: Extracting Images

I've done battle with the PDXObjectImage, but it has usually defeated me!
Sections 4.7 and 4.8 of the PDF spec address it.

Daniel

On Tue, Sep 15, 2009 at 6:01 PM, Martinez, Mel 
<m....@ll.mit.edu>wrote:

> I've been playing with extracting images.
>
> I've found a few 'wierdnesses' (I know, that's not a real word) in the
> org.apache.pdfbox.ExtractText class and If I can clear some time, I'll 
try
> to submit something on that.
>
> Ignoring the 'wierdnesses' (which have more to do with options parsing 
and
> filenaming), it does successfully extract images to separate files.
>
> However, the color table is apparently not being handled properly.
>
> All the images end up displaying with the default Windows palette, which
> tells me that they probably are missing their own.
>
> I assume that what probably needs to be done is that the color space 
needs
> to be rebuilt and reset on each image object prior to writing the image 
out
> to file, but I'm not entirely certain how to proceed with that.
>
> Does anybody have any familiarity with the PDXObjectImage and its 
related
> APIs?
>
> If someone can point me in the right direction, I don't mind doing the 
work
> of fixing this.
>
> Mel
>
> Dr. Mel Martinez
> m.martinez@ll.mit.edu
>
>
>
>


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


?  Click here to submit conditions  

This email and any content within or attached hereto from  Sun West Mortgage Company, Inc.  is confidential and/or legally privileged. The information is intended only for the use of the individual or entity named on this email. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or the taking of any action in reliance on the contents of this email information is strictly prohibited, and that the documents should be returned to this office immediately by email. Receipt by anyone other than the intended recipient is not a waiver of any privilege. Please do not include your social security number, account number, or any other personal or financial information in the content of the email. Should you have any questions, please call  (800) 453 7884.   

RE: Extracting Images

Posted by "Martinez, Mel" <m....@ll.mit.edu>.
Thanks, Alex.

Unfortunately, that (ExtractImages) is the first place I looked when I started this.

It basically uses the first technique below (Image.write2File(String)).

The problem I described also happens with the ExtractImages class.  It also happens with PDF2Image - which converts each whole page to an image.  Within each page image, the embedded photos all have their colors all screwed up.

I've tried this with several PDF input files and it happens with every color photo image.

Line art (even if rasterized and embedded as jpeg) and B&W images are fine.

I think there is something wrong with how PDFBox is extracting the images.

Is no one else seeing this?

I'm on a Windows XP PRO (64bit) machine.

-mel

-----Original Message-----
From: Alex Shvartz [mailto:alshvartz@yahoo.com] 
Sent: Monday, September 21, 2009 7:20 PM
To: pdfbox-dev@incubator.apache.org
Subject: RE: Extracting Images

Hi,

Please have a look to org.apache.pdfbox.ExtractImages class.
In extractImages() method there is a good explanation how to extract image from PDF file and save it.

Best Regards.


Alex.

--- On Mon, 9/21/09, Martinez, Mel <m....@ll.mit.edu> wrote:

From: Martinez, Mel <m....@ll.mit.edu>
Subject: RE: Extracting Images
To: "pdfbox-dev@incubator.apache.org" <pd...@incubator.apache.org>
Date: Monday, September 21, 2009, 3:31 PM

Ugh!  I'm crying uncle!  I obviously need help (in more ways than one!).

If ANYBODY has some experience with extracting jpeg images from PDF files using PDFBox, I'd appreciate a few pointers.

I've started with the basics (lotsa null checks & junk removed):

    PDPage page = ....
    PDResources resources = page.getResources();
    Map<String, PDXObjectImage images = resources.getImages();

So far, so good.  Some null & empty tests then
    ...
    PDXObjectImage image = images.get(key);

At this point, I've tried several things.  I've tried just letting the image class write itself out:

    Image.write2File(fname); //where fname does not include the suffix

I've also tried rebuilding the image object from pieces like so:

    BufferedImage bi = image.getRGBImage();
    int bpc = image.getBitsPerComopnent();
    PDColorSpace cspace = image.getColorSpace();
    ...
    WritableRaster srcRaster = bi.getRaster();
    ...
    ColorModel cm = cspace.createColorModel(bpc);
    int h = image.getHeight();
    int w = image.getWidth();
    WritableRaster raster = cm.createCompatibleWritableRaster(w,h);
    raster.setRect(srcRaster);
    bi = new BufferedImage(cm,raster,false,null);
    ImageIO.write(bi,format,new File(fname+"."+format));

This second method has the advantage of allowing you to write out to a different format, though some conversions crash it or look like garbage.

In general, both methods 'work' in that they extract the image and write it out to a file that can then be opened and displayed with any image viewer (or a web browser).  The problem is, the colors in the resulting image are simply off.  Way off.

JPEG & BMP color photo images look about the same, though the color palettes are sometimes off in different ways.
JPEG & BMP Black & white images and line art (even color) generally look fine.
TIFF images and PNG images look completely messed up.  Often turning into black rectangles or random color bands.  They also tend to blow up the second code.

Does anybody have a clue about this stuff?

Thanks in advance,

Mel

Dr. Mel Martinez
m.martinez@ll.mit.edu


-----Original Message-----
From: Daniel Wilson [mailto:williamstonconsulting@gmail.com] 
Sent: Tuesday, September 15, 2009 7:33 PM
To: pdfbox-dev@incubator.apache.org
Subject: Re: Extracting Images

I've done battle with the PDXObjectImage, but it has usually defeated me!
Sections 4.7 and 4.8 of the PDF spec address it.

Daniel

On Tue, Sep 15, 2009 at 6:01 PM, Martinez, Mel <m....@ll.mit.edu>wrote:

> I've been playing with extracting images.
>
> I've found a few 'wierdnesses' (I know, that's not a real word) in the
> org.apache.pdfbox.ExtractText class and If I can clear some time, I'll try
> to submit something on that.
>
> Ignoring the 'wierdnesses' (which have more to do with options parsing and
> filenaming), it does successfully extract images to separate files.
>
> However, the color table is apparently not being handled properly.
>
> All the images end up displaying with the default Windows palette, which
> tells me that they probably are missing their own.
>
> I assume that what probably needs to be done is that the color space needs
> to be rebuilt and reset on each image object prior to writing the image out
> to file, but I'm not entirely certain how to proceed with that.
>
> Does anybody have any familiarity with the PDXObjectImage and its related
> APIs?
>
> If someone can point me in the right direction, I don't mind doing the work
> of fixing this.
>
> Mel
>
> Dr. Mel Martinez
> m.martinez@ll.mit.edu
>
>
>
>


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

RE: Extracting Images

Posted by Alex Shvartz <al...@yahoo.com>.
Hi,

Please have a look to org.apache.pdfbox.ExtractImages class.
In extractImages() method there is a good explanation how to extract image from PDF file and save it.

Best Regards.


Alex.

--- On Mon, 9/21/09, Martinez, Mel <m....@ll.mit.edu> wrote:

From: Martinez, Mel <m....@ll.mit.edu>
Subject: RE: Extracting Images
To: "pdfbox-dev@incubator.apache.org" <pd...@incubator.apache.org>
Date: Monday, September 21, 2009, 3:31 PM

Ugh!  I'm crying uncle!  I obviously need help (in more ways than one!).

If ANYBODY has some experience with extracting jpeg images from PDF files using PDFBox, I'd appreciate a few pointers.

I've started with the basics (lotsa null checks & junk removed):

    PDPage page = ....
    PDResources resources = page.getResources();
    Map<String, PDXObjectImage images = resources.getImages();

So far, so good.  Some null & empty tests then
    ...
    PDXObjectImage image = images.get(key);

At this point, I've tried several things.  I've tried just letting the image class write itself out:

    Image.write2File(fname); //where fname does not include the suffix

I've also tried rebuilding the image object from pieces like so:

    BufferedImage bi = image.getRGBImage();
    int bpc = image.getBitsPerComopnent();
    PDColorSpace cspace = image.getColorSpace();
    ...
    WritableRaster srcRaster = bi.getRaster();
    ...
    ColorModel cm = cspace.createColorModel(bpc);
    int h = image.getHeight();
    int w = image.getWidth();
    WritableRaster raster = cm.createCompatibleWritableRaster(w,h);
    raster.setRect(srcRaster);
    bi = new BufferedImage(cm,raster,false,null);
    ImageIO.write(bi,format,new File(fname+"."+format));

This second method has the advantage of allowing you to write out to a different format, though some conversions crash it or look like garbage.

In general, both methods 'work' in that they extract the image and write it out to a file that can then be opened and displayed with any image viewer (or a web browser).  The problem is, the colors in the resulting image are simply off.  Way off.

JPEG & BMP color photo images look about the same, though the color palettes are sometimes off in different ways.
JPEG & BMP Black & white images and line art (even color) generally look fine.
TIFF images and PNG images look completely messed up.  Often turning into black rectangles or random color bands.  They also tend to blow up the second code.

Does anybody have a clue about this stuff?

Thanks in advance,

Mel

Dr. Mel Martinez
m.martinez@ll.mit.edu


-----Original Message-----
From: Daniel Wilson [mailto:williamstonconsulting@gmail.com] 
Sent: Tuesday, September 15, 2009 7:33 PM
To: pdfbox-dev@incubator.apache.org
Subject: Re: Extracting Images

I've done battle with the PDXObjectImage, but it has usually defeated me!
Sections 4.7 and 4.8 of the PDF spec address it.

Daniel

On Tue, Sep 15, 2009 at 6:01 PM, Martinez, Mel <m....@ll.mit.edu>wrote:

> I've been playing with extracting images.
>
> I've found a few 'wierdnesses' (I know, that's not a real word) in the
> org.apache.pdfbox.ExtractText class and If I can clear some time, I'll try
> to submit something on that.
>
> Ignoring the 'wierdnesses' (which have more to do with options parsing and
> filenaming), it does successfully extract images to separate files.
>
> However, the color table is apparently not being handled properly.
>
> All the images end up displaying with the default Windows palette, which
> tells me that they probably are missing their own.
>
> I assume that what probably needs to be done is that the color space needs
> to be rebuilt and reset on each image object prior to writing the image out
> to file, but I'm not entirely certain how to proceed with that.
>
> Does anybody have any familiarity with the PDXObjectImage and its related
> APIs?
>
> If someone can point me in the right direction, I don't mind doing the work
> of fixing this.
>
> Mel
>
> Dr. Mel Martinez
> m.martinez@ll.mit.edu
>
>
>
>


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

RE: Extracting Images

Posted by "Martinez, Mel" <m....@ll.mit.edu>.
Ugh!  I'm crying uncle!  I obviously need help (in more ways than one!).

If ANYBODY has some experience with extracting jpeg images from PDF files using PDFBox, I'd appreciate a few pointers.

I've started with the basics (lotsa null checks & junk removed):

    PDPage page = ....
    PDResources resources = page.getResources();
    Map<String, PDXObjectImage images = resources.getImages();

So far, so good.  Some null & empty tests then
    ...
    PDXObjectImage image = images.get(key);

At this point, I've tried several things.  I've tried just letting the image class write itself out:

    Image.write2File(fname); //where fname does not include the suffix

I've also tried rebuilding the image object from pieces like so:

    BufferedImage bi = image.getRGBImage();
    int bpc = image.getBitsPerComopnent();
    PDColorSpace cspace = image.getColorSpace();
    ...
    WritableRaster srcRaster = bi.getRaster();
    ...
    ColorModel cm = cspace.createColorModel(bpc);
    int h = image.getHeight();
    int w = image.getWidth();
    WritableRaster raster = cm.createCompatibleWritableRaster(w,h);
    raster.setRect(srcRaster);
    bi = new BufferedImage(cm,raster,false,null);
    ImageIO.write(bi,format,new File(fname+"."+format));

This second method has the advantage of allowing you to write out to a different format, though some conversions crash it or look like garbage.

In general, both methods 'work' in that they extract the image and write it out to a file that can then be opened and displayed with any image viewer (or a web browser).  The problem is, the colors in the resulting image are simply off.  Way off.

JPEG & BMP color photo images look about the same, though the color palettes are sometimes off in different ways.
JPEG & BMP Black & white images and line art (even color) generally look fine.
TIFF images and PNG images look completely messed up.  Often turning into black rectangles or random color bands.  They also tend to blow up the second code.

Does anybody have a clue about this stuff?

Thanks in advance,

Mel

Dr. Mel Martinez
m.martinez@ll.mit.edu


-----Original Message-----
From: Daniel Wilson [mailto:williamstonconsulting@gmail.com] 
Sent: Tuesday, September 15, 2009 7:33 PM
To: pdfbox-dev@incubator.apache.org
Subject: Re: Extracting Images

I've done battle with the PDXObjectImage, but it has usually defeated me!
Sections 4.7 and 4.8 of the PDF spec address it.

Daniel

On Tue, Sep 15, 2009 at 6:01 PM, Martinez, Mel <m....@ll.mit.edu>wrote:

> I've been playing with extracting images.
>
> I've found a few 'wierdnesses' (I know, that's not a real word) in the
> org.apache.pdfbox.ExtractText class and If I can clear some time, I'll try
> to submit something on that.
>
> Ignoring the 'wierdnesses' (which have more to do with options parsing and
> filenaming), it does successfully extract images to separate files.
>
> However, the color table is apparently not being handled properly.
>
> All the images end up displaying with the default Windows palette, which
> tells me that they probably are missing their own.
>
> I assume that what probably needs to be done is that the color space needs
> to be rebuilt and reset on each image object prior to writing the image out
> to file, but I'm not entirely certain how to proceed with that.
>
> Does anybody have any familiarity with the PDXObjectImage and its related
> APIs?
>
> If someone can point me in the right direction, I don't mind doing the work
> of fixing this.
>
> Mel
>
> Dr. Mel Martinez
> m.martinez@ll.mit.edu
>
>
>
>

Re: Extracting Images

Posted by Daniel Wilson <wi...@gmail.com>.
I've done battle with the PDXObjectImage, but it has usually defeated me!
Sections 4.7 and 4.8 of the PDF spec address it.

Daniel

On Tue, Sep 15, 2009 at 6:01 PM, Martinez, Mel <m....@ll.mit.edu>wrote:

> I've been playing with extracting images.
>
> I've found a few 'wierdnesses' (I know, that's not a real word) in the
> org.apache.pdfbox.ExtractText class and If I can clear some time, I'll try
> to submit something on that.
>
> Ignoring the 'wierdnesses' (which have more to do with options parsing and
> filenaming), it does successfully extract images to separate files.
>
> However, the color table is apparently not being handled properly.
>
> All the images end up displaying with the default Windows palette, which
> tells me that they probably are missing their own.
>
> I assume that what probably needs to be done is that the color space needs
> to be rebuilt and reset on each image object prior to writing the image out
> to file, but I'm not entirely certain how to proceed with that.
>
> Does anybody have any familiarity with the PDXObjectImage and its related
> APIs?
>
> If someone can point me in the right direction, I don't mind doing the work
> of fixing this.
>
> Mel
>
> Dr. Mel Martinez
> m.martinez@ll.mit.edu
>
>
>
>

Extracting Images

Posted by "Martinez, Mel" <m....@ll.mit.edu>.
I've been playing with extracting images.

I've found a few 'wierdnesses' (I know, that's not a real word) in the org.apache.pdfbox.ExtractText class and If I can clear some time, I'll try to submit something on that.

Ignoring the 'wierdnesses' (which have more to do with options parsing and filenaming), it does successfully extract images to separate files.

However, the color table is apparently not being handled properly.

All the images end up displaying with the default Windows palette, which tells me that they probably are missing their own.

I assume that what probably needs to be done is that the color space needs to be rebuilt and reset on each image object prior to writing the image out to file, but I'm not entirely certain how to proceed with that.

Does anybody have any familiarity with the PDXObjectImage and its related APIs?

If someone can point me in the right direction, I don't mind doing the work of fixing this.

Mel

Dr. Mel Martinez
m.martinez@ll.mit.edu




Re: [VOTE] Release PDFBox 0.8.0-incubating (3. attempt)

Posted by Daniel Wilson <wi...@gmail.com>.
+1
Daniel

On Tue, Sep 15, 2009 at 4:12 PM, Niall Pemberton
<ni...@gmail.com>wrote:

> [X] +1 Release this package as Apache PDFBox 0.8.0-incubating
>
> Niall
>
> 2009/9/15 Andreas Lehmkühler <an...@lehmi.de>:
> > Hi,
> >
> > I have posted a candidate for the first Apache release of PDFBox
> > developed within the PDFBox podling. The candidate can be found at
> >
> > http://people.apache.org/~lehmi/pdfbox/pdfbox-0.8.0-incubating/<http://people.apache.org/%7Elehmi/pdfbox/pdfbox-0.8.0-incubating/>
> >
> > See the RELEASE-NOTES.txt file (also included at the end of this
> > message) for details on release contents. The release candidate is a
> > jar archive of the sources in
> >
> > http://svn.apache.org/repos/asf/incubator/pdfbox/tags/0.8.0-incubating.
> >
> > The MD5 checksum of the pdfbox-0.8.0-incubating-src.jar release package
> > is 1E 2B 55 FC 8C 9D 7C 31  16 AE 37 91 42 30 F5 39.
> >
> > Please vote on releasing this package as Apache PDFBox
> > 0.8.0-incubating. The vote is open for the next 72 hours and passes if
> > a majority of at least three +1 PDFBox PPMC votes is reached. Assuming
> > the vote passes, I will ask the Incubator PMC to approve the release.
> >
> > [ ] +1 Release this package as Apache PDFBox 0.8.0-incubating
> > [ ] -1 Do not release this package because...
> >
> > With the source release I have also included a pre-compiled jar file.
> > The Maven POM file from the source release is also included so that we
> > can deploy the released jar to the central Maven repository if the
> > release vote passes.
> > In addition I have also included a pre-compiled jar file for a standalone
> > version including all needed external libs to run PDFBox.
> >
> > Changelog for my 3. attempt:
> > - adding the README.txt and the RELEASE-NOTES.txt to the standalone jar
> > - including all files from svn to the src jar
> > - including the changes from the weekend (logging, colorspace caching,
> > font handling)
> > - updating the release notes, both txt and web version
> >
> > Here's my +1.
> >
> > BR
> > Andreas Lehmkühler
> >
> > Release Notes -- Apache PDFBox -- Version 0.8.0-incubating
> >
> > Introduction
> > ------------
> >
> > Apache PDFBox is an open source Java library for working with PDF
> documents.
> >
> > This 0.8.0-incubating release is the first PDFBox release made at the
> > Apache Software Foundation. The most notable change since the previous
> > release (0.7.3) is the renaming of all Java packages from org.pdfbox to
> > org.apache.pdfbox. If you've used PDFBox before, you need to update all
> > your client code to use the renamed PDFBox packages.
> >
> > The -incubating label included in the version number reflects the
> incubation
> > status of the project. See the disclaimer below for more about
> incubation.
> >
> > See the Apache PDFBox website at http://incubator.apache.org/pdfbox/ for
> > more information.
> >
> > Release Contents
> > ----------------
> >
> > This release consists of a source archive
> (pdfbox-0.8.0-incubating-src.jar).
> > You can build the release with Apache Ant like this:
> >
> > jar xf pdfbox-0.8.0-incubating-src.jar
> > cd pdfbox-0.8.0-incubating
> > ant
> >
> > The source archive is accompanied by SHA1 and MD5 checksums and a PGP
> > signature that you can use to verify the authenticity of your download.
> > The public key used for the PGP signature can be found at
> > https://svn.apache.org/repos/asf/incubator/pdfbox/KEYS.
> >
> > Changelog
> > ---------
> >
> > Bug
> >
> > * [PDFBOX-51] - PDFToImage fails to render correctly
> > * [PDFBOX-93] - Error in FlateFilter?
> > * [PDFBOX-94] - Unexpected end of ZLIB input stream
> > * [PDFBOX-107] - viewer crashed
> > * [PDFBOX-110] - bad font data with TrueTypeFont
> > * [PDFBOX-141] - PDF to image conversion can lead to mostly black area
> > * [PDFBOX-148] - Error getting pdf version (NumberFormatException)
> > * [PDFBOX-152] - Merge Landscape and Portrait PDFs does not keep
> orientation
> > * [PDFBOX-162] - font spacing
> > * [PDFBOX-173] - Some suggested COSString improvements
> > * [PDFBOX-178] - splitting some words randomnly
> > * [PDFBOX-183] - java.lang.NullPointerException in
> > highlighter.generateXMLHig
> > * [PDFBOX-187] - Error in parsing CMap file
> > * [PDFBOX-211] - Regression: ArrayIndexOutOfBoundsException in PDFBox
> 0.7.3
> > * [PDFBOX-221] - NPE on convertToImage
> > * [PDFBOX-223] - CurrentColor in PageDrawer Doesn't Restore Properly
> > * [PDFBOX-224] - Printing Rectangles on rotated pages
> > * [PDFBOX-227] - ArrayIndexOutOfBoundsException:4
> > * [PDFBOX-234] - spaces lost
> > * [PDFBOX-249] - Imbricated XObjects with the same name
> > * [PDFBOX-250] - Table borders not printing correctly
> > * [PDFBOX-286] - PDF document renders incorrectly
> > * [PDFBOX-290] - java.lang.NoSuchMethodError in fontbox
> > * [PDFBOX-292] - Text Extraction strips 1 char when extracting a twin
> pair
> > * [PDFBOX-296] - Extreme memory usage while extracting text from one pdf
> > * [PDFBOX-313] - OutOfMemoryError for larger PDF text extraction
> > * [PDFBOX-318] - Error getting pdf version
> > * [PDFBOX-321] - PDF printing or conversion : lines are too thick -
> SOLVED ?
> > * [PDFBOX-324] - One rectangle missing when converting PDF to image
> > * [PDFBOX-330] - Watermarks aren't correctly showed
> > * [PDFBOX-335] - Version incompatibility with Lucene?
> > * [PDFBOX-343] - java.lang.ClassCastException: org.pdfbox.cos.COSArray
> > cannot
> > * [PDFBOX-348] - java.lang.NoClassDefFoundError:
> org/fontbox/afm/AFMParser
> > * [PDFBOX-349] - Spaces between words ignored in scanned pdf files
> > * [PDFBOX-361] - NullPointerException in PDPageNode.getAllKids
> > * [PDFBOX-364] - Latest trunk uses Java 5 autoboxing
> > * [PDFBOX-373] - (null) printed when characters cannot be decoded during
> > text extraction
> > * [PDFBOX-374] - text areas not properly being sorted because of page
> > rotation
> > * [PDFBOX-377] - Incorrect direction of extracted Arabic Text
> > * [PDFBOX-379] - PDType1Font uses the Java 5 constant Font.TYPE1
> > * [PDFBOX-385] - ClassCastException when call parseCOSArray in
> > BaseParser.java
> > * [PDFBOX-390] - org.pdfbox.filter.ASCIIHexFilter does not skip
> Whitespace
> > * [PDFBOX-393] - Maven files in jempbox do not work in Eclipse.
> > * [PDFBOX-395] - NPE on public key encryption of an unencrypted document
> > * [PDFBOX-396] - Incorrect permissions after decryption
> > * [PDFBOX-401] - setStrokingColorSpace and setNonStrokingColorSpace in
> > PDPageContentStream doesn't work correct
> > * [PDFBOX-404] - ClassCastException in COSDictionaryMap
> > * [PDFBOX-407] - PDLineDashPattern missing call to super.clone()
> > * [PDFBOX-409] - Small hashcode issue, The code invokes hashCode on an
> > array.
> > * [PDFBOX-415] - Errors when decomposing Arabic Ligatures
> > * [PDFBOX-418] - PDFStreamParser reads incorrect number (patch provided)
> > * [PDFBOX-421] - Unit tests are failing
> > * [PDFBOX-425] - Silent print ignores passed PrintJob
> > * [PDFBOX-426] - Class StrokePath has the wrong superclass
> > * [PDFBOX-428] - Error Printing: dash lengths all zero
> > * [PDFBOX-436] - PDFontFactory.createFont returns null if the given
> > parameter fontCache is null
> > * [PDFBOX-438] - FlateFilter: endless loop because of missing length
> > check (for encrypted pdfs)
> > * [PDFBOX-442] - race condition in PdfFont
> > * [PDFBOX-446] - A empty page produces a NPE
> > * [PDFBOX-450] - PDFTextStripper CAN NOT extract correct font
> > information for some early produced PDF documents
> > * [PDFBOX-452] - [patch] maven build errors in current trunk
> > * [PDFBOX-453] - FlateFilter decode() throwing OutOfMemoryError
> > * [PDFBOX-454] - IOException upon opening a PDF
> > * [PDFBOX-455] - java.lang.ClassCastException: org.pdfbox.cos.COSString
> > cannot be cast to org.pdfbox.cos.COSName
> > * [PDFBOX-456] - PDFTextStripperByArea never finds any text (pageNo
> > check in PDFTextStripper always returns false)
> > * [PDFBOX-458] - Wrong implementation of COSArray.getInt()
> > * [PDFBOX-459] - Trailer Dictionary object labeled "Size" is overwritten
> > when there are 2 xref table objects
> > * [PDFBOX-466] - error parsing files generated by crystal reports
> > * [PDFBOX-468] - index out of bounds exception
> > * [PDFBOX-470] - corrupt zip stream causes document to not parse
> > * [PDFBOX-471] - invalid dictionary crashes parser
> > * [PDFBOX-473] - attempt to push back when content read
> > * [PDFBOX-474] - invalid xref entry causes parser to fail
> > * [PDFBOX-477] - extra spaces added to rotated text
> > * [PDFBOX-478] - PDFToImage don't render text in iText generated PDF
> > * [PDFBOX-482] - DeviceCMYK support in PDColorSpaceFactory
> > * [PDFBOX-483] - rendering issues during clipping (W/W*-operator)
> > * [PDFBOX-485] - Fonts not printed on HP laserjet (1320 & 8150) when
> > having landscape orientation
> > * [PDFBOX-487] - Font size not rendered with the needed precision
> > * [PDFBOX-496] - PDDocument.load hangs when loading zero-length file
> > * [PDFBOX-498] - some pdf-files have no newline after endobj, pdfbox
> > fails on that
> > * [PDFBOX-503] - PDF loader causes infinite loop on non-PDF inputs
> > * [PDFBOX-512] - org.apache.pdfbox.pdmodel.PDDocument.getPageMap()
> > always returns null
> >
> > Improvement
> >
> > * [PDFBOX-302] - Improve font handling (was: layout print problem)
> > * [PDFBOX-319] - Implementation of PDDeviceCMYK.createColorModel()
> > * [PDFBOX-358] - Vertical text extraction splitting text
> > * [PDFBOX-363] - Fixed Page rotation
> > * [PDFBOX-365] - Updating Lucene version (was: Error in LucenPDFDocument
> > class)
> > * [PDFBOX-368] - Use the Maven standard directory layout
> > * [PDFBOX-376] - Remove the js.jar file
> > * [PDFBOX-380] - Limited support for SC and SCN operator
> > * [PDFBOX-381] - Remove direct JAI dependency
> > * [PDFBOX-387] - new Maven pom.xml files for pdfbox, fontbox, and jempbox
> > * [PDFBOX-389] - Support for b*, B*, d, i, j and J operator
> > * [PDFBOX-405] - Not a bug, but definately incorrect code in
> > PDPageContentStream
> > * [PDFBOX-437] - Prepare JempBox and FontBox for release
> > * [PDFBOX-460] - [PATCH] Improvements for bitmap production (resolution
> > and color depth)
> > * [PDFBOX-461] - Disable javadoc creation timestamp
> > * [PDFBOX-472] - use commons logging
> > * [PDFBOX-507] - [PATCH] Option to disable close warning in finalizer of
> > COSDocument.
> >
> > New Feature
> >
> > * [PDFBOX-98] - Print PDF
> > * [PDFBOX-264] - colorspace as an array entry
> > * [PDFBOX-272] - Identify text rotation angle in TextPosition
> > * [PDFBOX-338] - pdf page extraction
> > * [PDFBOX-493] - Ability to get page number for bookmarks
> >
> > Disclaimer
> > ----------
> >
> > Apache PDFBox is an effort undergoing incubation at The Apache Software
> > Foundation (ASF), sponsored by the Apache Incubator PMC. Incubation is
> > required of all newly accepted projects until a further review indicates
> > that the infrastructure, communications, and decision making process have
> > stabilized in a manner consistent with other successful ASF projects.
> While
> > incubation status is not necessarily a reflection of the completeness or
> > stability of the code, it does indicate that the project has yet to be
> fully
> > endorsed by the ASF.
> >
> > See http://incubator.apache.org/projects/pdfbox.html for the current
> > incubation status of the Apache PDFBox project.
> >
> > About The Apache Software Foundation
> > ------------------------------------
> >
> > Established in 1999, The Apache Software Foundation provides
> organizational,
> > legal, and financial support for more than 100 freely-available,
> > collaboratively-developed Open Source projects. The pragmatic Apache
> License
> > enables individual and commercial users to easily deploy Apache software;
> > the Foundation's intellectual property framework limits the legal
> exposure
> > of its 2,500+ contributors.
> >
> > For more information, visit http://www.apache.org/
> >
> >
>

Re: [VOTE] Release PDFBox 0.8.0-incubating (3. attempt)

Posted by Niall Pemberton <ni...@gmail.com>.
[X] +1 Release this package as Apache PDFBox 0.8.0-incubating

Niall

2009/9/15 Andreas Lehmkühler <an...@lehmi.de>:
> Hi,
>
> I have posted a candidate for the first Apache release of PDFBox
> developed within the PDFBox podling. The candidate can be found at
>
> http://people.apache.org/~lehmi/pdfbox/pdfbox-0.8.0-incubating/
>
> See the RELEASE-NOTES.txt file (also included at the end of this
> message) for details on release contents. The release candidate is a
> jar archive of the sources in
>
> http://svn.apache.org/repos/asf/incubator/pdfbox/tags/0.8.0-incubating.
>
> The MD5 checksum of the pdfbox-0.8.0-incubating-src.jar release package
> is 1E 2B 55 FC 8C 9D 7C 31  16 AE 37 91 42 30 F5 39.
>
> Please vote on releasing this package as Apache PDFBox
> 0.8.0-incubating. The vote is open for the next 72 hours and passes if
> a majority of at least three +1 PDFBox PPMC votes is reached. Assuming
> the vote passes, I will ask the Incubator PMC to approve the release.
>
> [ ] +1 Release this package as Apache PDFBox 0.8.0-incubating
> [ ] -1 Do not release this package because...
>
> With the source release I have also included a pre-compiled jar file.
> The Maven POM file from the source release is also included so that we
> can deploy the released jar to the central Maven repository if the
> release vote passes.
> In addition I have also included a pre-compiled jar file for a standalone
> version including all needed external libs to run PDFBox.
>
> Changelog for my 3. attempt:
> - adding the README.txt and the RELEASE-NOTES.txt to the standalone jar
> - including all files from svn to the src jar
> - including the changes from the weekend (logging, colorspace caching,
> font handling)
> - updating the release notes, both txt and web version
>
> Here's my +1.
>
> BR
> Andreas Lehmkühler
>
> Release Notes -- Apache PDFBox -- Version 0.8.0-incubating
>
> Introduction
> ------------
>
> Apache PDFBox is an open source Java library for working with PDF documents.
>
> This 0.8.0-incubating release is the first PDFBox release made at the
> Apache Software Foundation. The most notable change since the previous
> release (0.7.3) is the renaming of all Java packages from org.pdfbox to
> org.apache.pdfbox. If you've used PDFBox before, you need to update all
> your client code to use the renamed PDFBox packages.
>
> The -incubating label included in the version number reflects the incubation
> status of the project. See the disclaimer below for more about incubation.
>
> See the Apache PDFBox website at http://incubator.apache.org/pdfbox/ for
> more information.
>
> Release Contents
> ----------------
>
> This release consists of a source archive (pdfbox-0.8.0-incubating-src.jar).
> You can build the release with Apache Ant like this:
>
> jar xf pdfbox-0.8.0-incubating-src.jar
> cd pdfbox-0.8.0-incubating
> ant
>
> The source archive is accompanied by SHA1 and MD5 checksums and a PGP
> signature that you can use to verify the authenticity of your download.
> The public key used for the PGP signature can be found at
> https://svn.apache.org/repos/asf/incubator/pdfbox/KEYS.
>
> Changelog
> ---------
>
> Bug
>
> * [PDFBOX-51] - PDFToImage fails to render correctly
> * [PDFBOX-93] - Error in FlateFilter?
> * [PDFBOX-94] - Unexpected end of ZLIB input stream
> * [PDFBOX-107] - viewer crashed
> * [PDFBOX-110] - bad font data with TrueTypeFont
> * [PDFBOX-141] - PDF to image conversion can lead to mostly black area
> * [PDFBOX-148] - Error getting pdf version (NumberFormatException)
> * [PDFBOX-152] - Merge Landscape and Portrait PDFs does not keep orientation
> * [PDFBOX-162] - font spacing
> * [PDFBOX-173] - Some suggested COSString improvements
> * [PDFBOX-178] - splitting some words randomnly
> * [PDFBOX-183] - java.lang.NullPointerException in
> highlighter.generateXMLHig
> * [PDFBOX-187] - Error in parsing CMap file
> * [PDFBOX-211] - Regression: ArrayIndexOutOfBoundsException in PDFBox 0.7.3
> * [PDFBOX-221] - NPE on convertToImage
> * [PDFBOX-223] - CurrentColor in PageDrawer Doesn't Restore Properly
> * [PDFBOX-224] - Printing Rectangles on rotated pages
> * [PDFBOX-227] - ArrayIndexOutOfBoundsException:4
> * [PDFBOX-234] - spaces lost
> * [PDFBOX-249] - Imbricated XObjects with the same name
> * [PDFBOX-250] - Table borders not printing correctly
> * [PDFBOX-286] - PDF document renders incorrectly
> * [PDFBOX-290] - java.lang.NoSuchMethodError in fontbox
> * [PDFBOX-292] - Text Extraction strips 1 char when extracting a twin pair
> * [PDFBOX-296] - Extreme memory usage while extracting text from one pdf
> * [PDFBOX-313] - OutOfMemoryError for larger PDF text extraction
> * [PDFBOX-318] - Error getting pdf version
> * [PDFBOX-321] - PDF printing or conversion : lines are too thick - SOLVED ?
> * [PDFBOX-324] - One rectangle missing when converting PDF to image
> * [PDFBOX-330] - Watermarks aren't correctly showed
> * [PDFBOX-335] - Version incompatibility with Lucene?
> * [PDFBOX-343] - java.lang.ClassCastException: org.pdfbox.cos.COSArray
> cannot
> * [PDFBOX-348] - java.lang.NoClassDefFoundError: org/fontbox/afm/AFMParser
> * [PDFBOX-349] - Spaces between words ignored in scanned pdf files
> * [PDFBOX-361] - NullPointerException in PDPageNode.getAllKids
> * [PDFBOX-364] - Latest trunk uses Java 5 autoboxing
> * [PDFBOX-373] - (null) printed when characters cannot be decoded during
> text extraction
> * [PDFBOX-374] - text areas not properly being sorted because of page
> rotation
> * [PDFBOX-377] - Incorrect direction of extracted Arabic Text
> * [PDFBOX-379] - PDType1Font uses the Java 5 constant Font.TYPE1
> * [PDFBOX-385] - ClassCastException when call parseCOSArray in
> BaseParser.java
> * [PDFBOX-390] - org.pdfbox.filter.ASCIIHexFilter does not skip Whitespace
> * [PDFBOX-393] - Maven files in jempbox do not work in Eclipse.
> * [PDFBOX-395] - NPE on public key encryption of an unencrypted document
> * [PDFBOX-396] - Incorrect permissions after decryption
> * [PDFBOX-401] - setStrokingColorSpace and setNonStrokingColorSpace in
> PDPageContentStream doesn't work correct
> * [PDFBOX-404] - ClassCastException in COSDictionaryMap
> * [PDFBOX-407] - PDLineDashPattern missing call to super.clone()
> * [PDFBOX-409] - Small hashcode issue, The code invokes hashCode on an
> array.
> * [PDFBOX-415] - Errors when decomposing Arabic Ligatures
> * [PDFBOX-418] - PDFStreamParser reads incorrect number (patch provided)
> * [PDFBOX-421] - Unit tests are failing
> * [PDFBOX-425] - Silent print ignores passed PrintJob
> * [PDFBOX-426] - Class StrokePath has the wrong superclass
> * [PDFBOX-428] - Error Printing: dash lengths all zero
> * [PDFBOX-436] - PDFontFactory.createFont returns null if the given
> parameter fontCache is null
> * [PDFBOX-438] - FlateFilter: endless loop because of missing length
> check (for encrypted pdfs)
> * [PDFBOX-442] - race condition in PdfFont
> * [PDFBOX-446] - A empty page produces a NPE
> * [PDFBOX-450] - PDFTextStripper CAN NOT extract correct font
> information for some early produced PDF documents
> * [PDFBOX-452] - [patch] maven build errors in current trunk
> * [PDFBOX-453] - FlateFilter decode() throwing OutOfMemoryError
> * [PDFBOX-454] - IOException upon opening a PDF
> * [PDFBOX-455] - java.lang.ClassCastException: org.pdfbox.cos.COSString
> cannot be cast to org.pdfbox.cos.COSName
> * [PDFBOX-456] - PDFTextStripperByArea never finds any text (pageNo
> check in PDFTextStripper always returns false)
> * [PDFBOX-458] - Wrong implementation of COSArray.getInt()
> * [PDFBOX-459] - Trailer Dictionary object labeled "Size" is overwritten
> when there are 2 xref table objects
> * [PDFBOX-466] - error parsing files generated by crystal reports
> * [PDFBOX-468] - index out of bounds exception
> * [PDFBOX-470] - corrupt zip stream causes document to not parse
> * [PDFBOX-471] - invalid dictionary crashes parser
> * [PDFBOX-473] - attempt to push back when content read
> * [PDFBOX-474] - invalid xref entry causes parser to fail
> * [PDFBOX-477] - extra spaces added to rotated text
> * [PDFBOX-478] - PDFToImage don't render text in iText generated PDF
> * [PDFBOX-482] - DeviceCMYK support in PDColorSpaceFactory
> * [PDFBOX-483] - rendering issues during clipping (W/W*-operator)
> * [PDFBOX-485] - Fonts not printed on HP laserjet (1320 & 8150) when
> having landscape orientation
> * [PDFBOX-487] - Font size not rendered with the needed precision
> * [PDFBOX-496] - PDDocument.load hangs when loading zero-length file
> * [PDFBOX-498] - some pdf-files have no newline after endobj, pdfbox
> fails on that
> * [PDFBOX-503] - PDF loader causes infinite loop on non-PDF inputs
> * [PDFBOX-512] - org.apache.pdfbox.pdmodel.PDDocument.getPageMap()
> always returns null
>
> Improvement
>
> * [PDFBOX-302] - Improve font handling (was: layout print problem)
> * [PDFBOX-319] - Implementation of PDDeviceCMYK.createColorModel()
> * [PDFBOX-358] - Vertical text extraction splitting text
> * [PDFBOX-363] - Fixed Page rotation
> * [PDFBOX-365] - Updating Lucene version (was: Error in LucenPDFDocument
> class)
> * [PDFBOX-368] - Use the Maven standard directory layout
> * [PDFBOX-376] - Remove the js.jar file
> * [PDFBOX-380] - Limited support for SC and SCN operator
> * [PDFBOX-381] - Remove direct JAI dependency
> * [PDFBOX-387] - new Maven pom.xml files for pdfbox, fontbox, and jempbox
> * [PDFBOX-389] - Support for b*, B*, d, i, j and J operator
> * [PDFBOX-405] - Not a bug, but definately incorrect code in
> PDPageContentStream
> * [PDFBOX-437] - Prepare JempBox and FontBox for release
> * [PDFBOX-460] - [PATCH] Improvements for bitmap production (resolution
> and color depth)
> * [PDFBOX-461] - Disable javadoc creation timestamp
> * [PDFBOX-472] - use commons logging
> * [PDFBOX-507] - [PATCH] Option to disable close warning in finalizer of
> COSDocument.
>
> New Feature
>
> * [PDFBOX-98] - Print PDF
> * [PDFBOX-264] - colorspace as an array entry
> * [PDFBOX-272] - Identify text rotation angle in TextPosition
> * [PDFBOX-338] - pdf page extraction
> * [PDFBOX-493] - Ability to get page number for bookmarks
>
> Disclaimer
> ----------
>
> Apache PDFBox is an effort undergoing incubation at The Apache Software
> Foundation (ASF), sponsored by the Apache Incubator PMC. Incubation is
> required of all newly accepted projects until a further review indicates
> that the infrastructure, communications, and decision making process have
> stabilized in a manner consistent with other successful ASF projects. While
> incubation status is not necessarily a reflection of the completeness or
> stability of the code, it does indicate that the project has yet to be fully
> endorsed by the ASF.
>
> See http://incubator.apache.org/projects/pdfbox.html for the current
> incubation status of the Apache PDFBox project.
>
> About The Apache Software Foundation
> ------------------------------------
>
> Established in 1999, The Apache Software Foundation provides organizational,
> legal, and financial support for more than 100 freely-available,
> collaboratively-developed Open Source projects. The pragmatic Apache License
> enables individual and commercial users to easily deploy Apache software;
> the Foundation's intellectual property framework limits the legal exposure
> of its 2,500+ contributors.
>
> For more information, visit http://www.apache.org/
>
>

[RESULT][VOTE] Release PDFBox 0.8.0-incubating

Posted by Andreas Lehmkühler <an...@lehmi.de>.
Hi,

> Please vote on releasing this package as Apache PDFBox
> 0.8.0-incubating. The vote is open for the next 72 hours and passes if
> a majority of at least three +1 PDFBox PPMC votes is reached. Assuming
> the vote passes, I will ask the Incubator PMC to approve the release.
> 
> [ ] +1 Release this package as Apache PDFBox 0.8.0-incubating
> [ ] -1 Do not release this package because...

The vote passes as follows:

+1 Andreas Lehmkühler
+1 Jukka Zitting
+1 Niall Pemberton
+1 Daniel Wilson
+1 Mel Martinez (no binding vote)
+1 Phillip Koch
+1 Jeremias Maerki

Thanks to all for your patience and for your help reviewing this release.

I'll ask the IPMC to approve the release.

BR
Andreas Lehmkühler




Re: [VOTE] Release PDFBox 0.8.0-incubating (3. attempt)

Posted by Jeremias Maerki <de...@jeremias-maerki.ch>.
[X] +1 Release this package as Apache PDFBox 0.8.0-incubating

On 15.09.2009 07:40:11 Andreas Lehmkühler wrote:
> Hi,
> 
> I have posted a candidate for the first Apache release of PDFBox
> developed within the PDFBox podling. The candidate can be found at
> 
> http://people.apache.org/~lehmi/pdfbox/pdfbox-0.8.0-incubating/
> 
> See the RELEASE-NOTES.txt file (also included at the end of this
> message) for details on release contents. The release candidate is a
> jar archive of the sources in
> 
> http://svn.apache.org/repos/asf/incubator/pdfbox/tags/0.8.0-incubating.
> 
> The MD5 checksum of the pdfbox-0.8.0-incubating-src.jar release package
> is 1E 2B 55 FC 8C 9D 7C 31  16 AE 37 91 42 30 F5 39.
> 
> Please vote on releasing this package as Apache PDFBox
> 0.8.0-incubating. The vote is open for the next 72 hours and passes if
> a majority of at least three +1 PDFBox PPMC votes is reached. Assuming
> the vote passes, I will ask the Incubator PMC to approve the release.
> 
> [X] +1 Release this package as Apache PDFBox 0.8.0-incubating
> [ ] -1 Do not release this package because...
> 
> With the source release I have also included a pre-compiled jar file.
> The Maven POM file from the source release is also included so that we
> can deploy the released jar to the central Maven repository if the
> release vote passes.
> In addition I have also included a pre-compiled jar file for a standalone
> version including all needed external libs to run PDFBox.
> 
> Changelog for my 3. attempt:
> - adding the README.txt and the RELEASE-NOTES.txt to the standalone jar
> - including all files from svn to the src jar
> - including the changes from the weekend (logging, colorspace caching,
> font handling)
> - updating the release notes, both txt and web version
> 
> Here's my +1.
> 
> BR
> Andreas Lehmkühler
> 
<snip/>


Jeremias Maerki


Re: [VOTE] Release PDFBox 0.8.0-incubating (3. attempt)

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

    [x] +1 Release this package as Apache PDFBox 0.8.0-incubating

I found the DB880CA4 key on the public key servers, but it would still
be good to either have it added to the KEYS file or have new
signatures with the 1DFDBF44 key. For completeness, see below for my
PGP signatures on the release artifacts I reviewed. If you like, you
can add these signatures also to the .asc files.

BR,

Jukka Zitting

pdfbox-0.8.0-incubating-src.jar
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEABECAAYFAkqvYucACgkQpzBSnKNVpj5A+ACfdJatP+dJsmgOTYZF2TK/kYY1
IogAoIMRVC/GxML3FnRInZrNnu5yJ3D2
=qGUZ
-----END PGP SIGNATURE-----

pdfbox-0.8.0-incubating.jar
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEABECAAYFAkqvYuYACgkQpzBSnKNVpj6WrACeOPGFPUSXc5QNPYdGlzTAIELa
+qkAn0+j7pjkpBqm0l9FPh1U39ISnbse
=Ynjn
-----END PGP SIGNATURE-----

pdfbox-0.8.0-incubating-standalone.jar
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEABECAAYFAkqvYucACgkQpzBSnKNVpj66PgCeKGJecIfnrF6uAWgBSHZHbWlo
VhkAnRbnwLZCVxrN1WbP1Sis2epkghL6
=VsNH
-----END PGP SIGNATURE-----

pom.xml
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEABECAAYFAkqvYucACgkQpzBSnKNVpj5dEwCeIeM73AwnxcYlmEoYNnB+9aaZ
IDgAn11kJC0sEm+EsS6t+qrF3HRRw83x
=/0hL
-----END PGP SIGNATURE-----