You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Daniel Earwicker <de...@fiscaltec.com.INVALID> on 2022/08/02 15:31:45 UTC

ExtractImages command test - images appear different

Hi, this project looks perfect for my needs - converting PDF pages into images for easy rendering elsewhere. This is very much my first try so apologies in advance if this is a stupid question, but in the docs at https://pdfbox.apache.org/2.0/commandline.html I can't see any options that might improve the output.

Here's a side-by-side comparison, ExtractImages output on the left, and the PDF opened in chrome on the right:

https://imgur.com/a/KgNAZQ2

The PDF is an example I got from: https://www.ets.org/Media/Tests/GRE/pdf/gre_research_validity_data.pdf

Just in case this is relevant, I ran it a clean debian container:

docker run -it -v c:/Users/me:/external debian:bullseye-slim

apt update
apt install openjdk-17-jre -y
apt install wget -y
wget https://dlcdn.apache.org/pdfbox/2.0.26/pdfbox-app-2.0.26.jar

and then tested with:

java -jar pdfbox-app-2.0.26.jar ExtractImages -prefix /external/extract-test /external/gre_research_validity_data.pdf

The screenshot is of the resulting extract-test-2.jpg file.

There's obviously some problem with the colours, and also there's a lot of extra stuff in the page margins that Chrome somehow knows it ought to hide. Is there any way to configure this extraction process so the image to look like how Chrome displays it? And for this kind of accurate rendering to work for the majority of PDFs? (this being the first one I tried). Thanks!
This email is from FISCAL Technologies Limited, a company registered in England and Wales with company number 4801836, whose registered office is at 448 Basingstoke Road, Reading, RG2 0LP, United Kingdom. This notice applies to this email and to any other email subsequently sent by anyone at FISCAL Technologies Limited and appearing in the same chain of email correspondence. References below to "this email" should be read accordingly. The contents of this email and any attachments (if any) are private and confidential. If you have received this message in error, please notify us immediately by returning it to the sender or call our switchboard on +44 (0) 845 680 1905 and remove it from your system, do not use, copy or disclose it. The opinions expressed within this communication are not necessarily those expressed by FISCAL Technologies Limited. Emails are not secure and may contain viruses and it is your responsibility to scan attachments (if any). The e-mail system of FISCAL Technologies Limited is subject to random monitoring. For information about how we use your personal data (including your rights) please see our privacy policy - https://www.fiscaltec.com/uk/general/privacy-policy/
Visit our website at www.fiscaltec.co.uk<http://www.fiscaltec.co.uk>

Re: ExtractImages command test - images appear different

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 04.08.2022 um 20:41 schrieb Tilman Hausherr:
> It gets weirder: I get a wrong rendering when using the twelvemonkeys 
> library.

fixed in

https://issues.apache.org/jira/browse/PDFBOX-5488

(but I still don't know if that was your problem)

Tilman


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: ExtractImages command test - images appear different

Posted by Tilman Hausherr <TH...@t-online.de>.

It gets weirder: I get a wrong rendering when using the twelvemonkeys 
library.


Tilman

Am 04.08.2022 um 20:20 schrieb Tilman Hausherr:
> Hi,
>
> Could you upload the image that you got? Here's my first page:
>
>
> Tilman
>
> Am 02.08.2022 um 18:09 schrieb Daniel Earwicker:
>> Cool, thanks - that has made it crop the pages as expected but the colour is still as before. I tried specifying each of the documented color depths with the -color option and none of them resulted in the expected/normal appearance of the test PDF. (Also how would I know which option to pass for a random PDF?)
>>
>> Is there anything I can do to get an image automatically rendered how it is intended to look, in the same way that Ghostscript can? (I just tried gs with the same tests pdf and it renders it essentially the same as the Chrome viewer).
>>
>> -----Original Message-----
>> From:sahyoun@fileaffairs.de  <sa...@fileaffairs.de>
>> Sent: 02 August 2022 16:36
>> To:users@pdfbox.apache.org
>> Subject: Re: ExtractImages command test - images appear different
>>
>> Hi Daniel,
>>
>> the command you are using extracts images contaoined in the PDF but doesn't render the PDF into an Image.
>>
>> Usehttps://pdfbox.apache.org/2.0/commandline.html#pdftoimage
>>
>> BR
>> Maruan
>>
>> Am Dienstag, dem 02.08.2022 um 15:31 +0000 schrieb Daniel Earwicker:
>>> Hi, this project looks perfect for my needs - converting PDF pages
>>> into images for easy rendering elsewhere. This is very much my first
>>> try so apologies in advance if this is a stupid question, but in the
>>> docs athttps://pdfbox.apache.org/2.0/commandline.html  I can't see any
>>> options that might improve the output.
>>>
>>> Here's a side-by-side comparison, ExtractImages output on the left,
>>> and the PDF opened in chrome on the right:
>>>
>>> https://imgur.com/a/KgNAZQ2
>>>
>>> The PDF is an example I got from:
>>> https://www.ets.org/Media/Tests/GRE/pdf/gre_research_validity_data.pdf
>>>
>>> Just in case this is relevant, I ran it a clean debian container:
>>>
>>>      docker run -it -v c:/Users/me:/external debian:bullseye-slim
>>>
>>>      apt update
>>>      apt install openjdk-17-jre -y
>>>      apt install wget -y
>>>      wgethttps://dlcdn.apache.org/pdfbox/2.0.26/pdfbox-app-2.0.26.jar
>>>
>>> and then tested with:
>>>
>>>      java -jar pdfbox-app-2.0.26.jar ExtractImages -prefix
>>> /external/extract-test /external/gre_research_validity_data.pdf
>>>
>>> The screenshot is of the resulting extract-test-2.jpg file.
>>>
>>> There's obviously some problem with the colours, and also there's a
>>> lot of extra stuff in the page margins that Chrome somehow knows it
>>> ought to hide. Is there any way to configure this extraction process
>>> so the image to look like how Chrome displays it? And for this kind of
>>> accurate rendering to work for the majority of PDFs? (this being the
>>> first one I tried). Thanks!
>>> This email is from FISCAL Technologies Limited, a company registered
>>> in England and Wales with company number 4801836, whose registered
>>> office is at 448 Basingstoke Road, Reading, RG2 0LP, United Kingdom.
>>> This notice applies to this email and to any other email subsequently
>>> sent by anyone at FISCAL Technologies Limited and appearing in the
>>> same chain of email correspondence. References below to "this email"
>>> should be read accordingly. The contents of this email and any
>>> attachments (if any) are private and confidential. If you have
>>> received this message in error, please notify us immediately by
>>> returning it to the sender or call our switchboard on +44 (0) 845 680
>>> 1905 and remove it from your system, do not use, copy or disclose it.
>>> The opinions expressed within this communication are not necessarily
>>> those expressed by FISCAL Technologies Limited. Emails are not secure
>>> and may contain viruses and it is your responsibility to scan
>>> attachments (if any).  The e-mail system of FISCAL Technologies
>>> Limited is subject to random monitoring. For information about how we
>>> use your personal data (including your rights) please see our privacy
>>> policy -https://www.fiscaltec.com/uk/general/privacy-policy/
>>> Visit our website atwww.fiscaltec.co.uk<http://www.fiscaltec.co.uk>
>> --
>> --
>> Maruan Sahyoun
>>
>> FileAffairs GmbH
>> Josef-Schappe-Straße 21
>> 40882 Ratingen
>>
>> Tel: +49 (2102) 89497 88
>> Fax: +49 (2102) 89497 91
>> sahyoun@fileaffairs.de
>> www.fileaffairs.de
>>
>> Geschäftsführer: Maruan Sahyoun
>> Handelsregister: AG Düsseldorf, HRB 53837
>> UST.-ID: DE248275827
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail:users-help@pdfbox.apache.org
>>
>> This email is from FISCAL Technologies Limited, a company registered in England and Wales with company number 4801836, whose registered office is at 448 Basingstoke Road, Reading, RG2 0LP, United Kingdom. This notice applies to this email and to any other email subsequently sent by anyone at FISCAL Technologies Limited and appearing in the same chain of email correspondence. References below to "this email" should be read accordingly. The contents of this email and any attachments (if any) are private and confidential. If you have received this message in error, please notify us immediately by returning it to the sender or call our switchboard on +44 (0) 845 680 1905 and remove it from your system, do not use, copy or disclose it. The opinions expressed within this communication are not necessarily those expressed by FISCAL Technologies Limited. Emails are not secure and may contain viruses and it is your responsibility to scan attachments (if any).  The e-mail system of FISCAL Technologies Limited is subject to random monitoring. For information about how we use your personal data (including your rights) please see our privacy policy -https://www.fiscaltec.com/uk/general/privacy-policy/
>> Visit our website atwww.fiscaltec.co.uk<http://www.fiscaltec.co.uk>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail:users-help@pdfbox.apache.org
>>
>

Re: ExtractImages command test - images appear different

Posted by Tilman Hausherr <TH...@t-online.de>.

Hi,

Could you upload the image that you got? Here's my first page:


Tilman

Am 02.08.2022 um 18:09 schrieb Daniel Earwicker:
> Cool, thanks - that has made it crop the pages as expected but the colour is still as before. I tried specifying each of the documented color depths with the -color option and none of them resulted in the expected/normal appearance of the test PDF. (Also how would I know which option to pass for a random PDF?)
>
> Is there anything I can do to get an image automatically rendered how it is intended to look, in the same way that Ghostscript can? (I just tried gs with the same tests pdf and it renders it essentially the same as the Chrome viewer).
>
> -----Original Message-----
> From:sahyoun@fileaffairs.de  <sa...@fileaffairs.de>
> Sent: 02 August 2022 16:36
> To:users@pdfbox.apache.org
> Subject: Re: ExtractImages command test - images appear different
>
> Hi Daniel,
>
> the command you are using extracts images contaoined in the PDF but doesn't render the PDF into an Image.
>
> Usehttps://pdfbox.apache.org/2.0/commandline.html#pdftoimage
>
> BR
> Maruan
>
> Am Dienstag, dem 02.08.2022 um 15:31 +0000 schrieb Daniel Earwicker:
>> Hi, this project looks perfect for my needs - converting PDF pages
>> into images for easy rendering elsewhere. This is very much my first
>> try so apologies in advance if this is a stupid question, but in the
>> docs athttps://pdfbox.apache.org/2.0/commandline.html  I can't see any
>> options that might improve the output.
>>
>> Here's a side-by-side comparison, ExtractImages output on the left,
>> and the PDF opened in chrome on the right:
>>
>> https://imgur.com/a/KgNAZQ2
>>
>> The PDF is an example I got from:
>> https://www.ets.org/Media/Tests/GRE/pdf/gre_research_validity_data.pdf
>>
>> Just in case this is relevant, I ran it a clean debian container:
>>
>>      docker run -it -v c:/Users/me:/external debian:bullseye-slim
>>
>>      apt update
>>      apt install openjdk-17-jre -y
>>      apt install wget -y
>>      wgethttps://dlcdn.apache.org/pdfbox/2.0.26/pdfbox-app-2.0.26.jar
>>
>> and then tested with:
>>
>>      java -jar pdfbox-app-2.0.26.jar ExtractImages -prefix
>> /external/extract-test /external/gre_research_validity_data.pdf
>>
>> The screenshot is of the resulting extract-test-2.jpg file.
>>
>> There's obviously some problem with the colours, and also there's a
>> lot of extra stuff in the page margins that Chrome somehow knows it
>> ought to hide. Is there any way to configure this extraction process
>> so the image to look like how Chrome displays it? And for this kind of
>> accurate rendering to work for the majority of PDFs? (this being the
>> first one I tried). Thanks!
>> This email is from FISCAL Technologies Limited, a company registered
>> in England and Wales with company number 4801836, whose registered
>> office is at 448 Basingstoke Road, Reading, RG2 0LP, United Kingdom.
>> This notice applies to this email and to any other email subsequently
>> sent by anyone at FISCAL Technologies Limited and appearing in the
>> same chain of email correspondence. References below to "this email"
>> should be read accordingly. The contents of this email and any
>> attachments (if any) are private and confidential. If you have
>> received this message in error, please notify us immediately by
>> returning it to the sender or call our switchboard on +44 (0) 845 680
>> 1905 and remove it from your system, do not use, copy or disclose it.
>> The opinions expressed within this communication are not necessarily
>> those expressed by FISCAL Technologies Limited. Emails are not secure
>> and may contain viruses and it is your responsibility to scan
>> attachments (if any).  The e-mail system of FISCAL Technologies
>> Limited is subject to random monitoring. For information about how we
>> use your personal data (including your rights) please see our privacy
>> policy -https://www.fiscaltec.com/uk/general/privacy-policy/
>> Visit our website atwww.fiscaltec.co.uk<http://www.fiscaltec.co.uk>
> --
> --
> Maruan Sahyoun
>
> FileAffairs GmbH
> Josef-Schappe-Straße 21
> 40882 Ratingen
>
> Tel: +49 (2102) 89497 88
> Fax: +49 (2102) 89497 91
> sahyoun@fileaffairs.de
> www.fileaffairs.de
>
> Geschäftsführer: Maruan Sahyoun
> Handelsregister: AG Düsseldorf, HRB 53837
> UST.-ID: DE248275827
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail:users-help@pdfbox.apache.org
>
> This email is from FISCAL Technologies Limited, a company registered in England and Wales with company number 4801836, whose registered office is at 448 Basingstoke Road, Reading, RG2 0LP, United Kingdom. This notice applies to this email and to any other email subsequently sent by anyone at FISCAL Technologies Limited and appearing in the same chain of email correspondence. References below to "this email" should be read accordingly. The contents of this email and any attachments (if any) are private and confidential. If you have received this message in error, please notify us immediately by returning it to the sender or call our switchboard on +44 (0) 845 680 1905 and remove it from your system, do not use, copy or disclose it. The opinions expressed within this communication are not necessarily those expressed by FISCAL Technologies Limited. Emails are not secure and may contain viruses and it is your responsibility to scan attachments (if any).  The e-mail system of FISCAL Technologies Limited is subject to random monitoring. For information about how we use your personal data (including your rights) please see our privacy policy -https://www.fiscaltec.com/uk/general/privacy-policy/
> Visit our website atwww.fiscaltec.co.uk<http://www.fiscaltec.co.uk>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail:users-help@pdfbox.apache.org
>

RE: ExtractImages command test - images appear different

Posted by Daniel Earwicker <de...@fiscaltec.com.INVALID>.

Cool, thanks - that has made it crop the pages as expected but the colour is still as before. I tried specifying each of the documented color depths with the -color option and none of them resulted in the expected/normal appearance of the test PDF. (Also how would I know which option to pass for a random PDF?)

Is there anything I can do to get an image automatically rendered how it is intended to look, in the same way that Ghostscript can? (I just tried gs with the same tests pdf and it renders it essentially the same as the Chrome viewer).

-----Original Message-----
From: sahyoun@fileaffairs.de <sa...@fileaffairs.de>
Sent: 02 August 2022 16:36
To: users@pdfbox.apache.org
Subject: Re: ExtractImages command test - images appear different

Hi Daniel,

the command you are using extracts images contaoined in the PDF but doesn't render the PDF into an Image.

Use https://pdfbox.apache.org/2.0/commandline.html#pdftoimage

BR
Maruan

Am Dienstag, dem 02.08.2022 um 15:31 +0000 schrieb Daniel Earwicker:
> Hi, this project looks perfect for my needs - converting PDF pages
> into images for easy rendering elsewhere. This is very much my first
> try so apologies in advance if this is a stupid question, but in the
> docs at https://pdfbox.apache.org/2.0/commandline.html I can't see any
> options that might improve the output.
>
> Here's a side-by-side comparison, ExtractImages output on the left,
> and the PDF opened in chrome on the right:
>
> https://imgur.com/a/KgNAZQ2
>
> The PDF is an example I got from:
> https://www.ets.org/Media/Tests/GRE/pdf/gre_research_validity_data.pdf
>
> Just in case this is relevant, I ran it a clean debian container:
>
>     docker run -it -v c:/Users/me:/external debian:bullseye-slim
>
>     apt update
>     apt install openjdk-17-jre -y
>     apt install wget -y
>     wget https://dlcdn.apache.org/pdfbox/2.0.26/pdfbox-app-2.0.26.jar
>
> and then tested with:
>
>     java -jar pdfbox-app-2.0.26.jar ExtractImages -prefix
> /external/extract-test /external/gre_research_validity_data.pdf
>
> The screenshot is of the resulting extract-test-2.jpg file.
>
> There's obviously some problem with the colours, and also there's a
> lot of extra stuff in the page margins that Chrome somehow knows it
> ought to hide. Is there any way to configure this extraction process
> so the image to look like how Chrome displays it? And for this kind of
> accurate rendering to work for the majority of PDFs? (this being the
> first one I tried). Thanks!
> This email is from FISCAL Technologies Limited, a company registered
> in England and Wales with company number 4801836, whose registered
> office is at 448 Basingstoke Road, Reading, RG2 0LP, United Kingdom.
> This notice applies to this email and to any other email subsequently
> sent by anyone at FISCAL Technologies Limited and appearing in the
> same chain of email correspondence. References below to "this email"
> should be read accordingly. The contents of this email and any
> attachments (if any) are private and confidential. If you have
> received this message in error, please notify us immediately by
> returning it to the sender or call our switchboard on +44 (0) 845 680
> 1905 and remove it from your system, do not use, copy or disclose it.
> The opinions expressed within this communication are not necessarily
> those expressed by FISCAL Technologies Limited. Emails are not secure
> and may contain viruses and it is your responsibility to scan
> attachments (if any).  The e-mail system of FISCAL Technologies
> Limited is subject to random monitoring. For information about how we
> use your personal data (including your rights) please see our privacy
> policy - https://www.fiscaltec.com/uk/general/privacy-policy/
> Visit our website at www.fiscaltec.co.uk<http://www.fiscaltec.co.uk>

--
--
Maruan Sahyoun

FileAffairs GmbH
Josef-Schappe-Straße 21
40882 Ratingen

Tel: +49 (2102) 89497 88
Fax: +49 (2102) 89497 91
sahyoun@fileaffairs.de
www.fileaffairs.de

Geschäftsführer: Maruan Sahyoun
Handelsregister: AG Düsseldorf, HRB 53837
UST.-ID: DE248275827

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

This email is from FISCAL Technologies Limited, a company registered in England and Wales with company number 4801836, whose registered office is at 448 Basingstoke Road, Reading, RG2 0LP, United Kingdom. This notice applies to this email and to any other email subsequently sent by anyone at FISCAL Technologies Limited and appearing in the same chain of email correspondence. References below to "this email" should be read accordingly. The contents of this email and any attachments (if any) are private and confidential. If you have received this message in error, please notify us immediately by returning it to the sender or call our switchboard on +44 (0) 845 680 1905 and remove it from your system, do not use, copy or disclose it. The opinions expressed within this communication are not necessarily those expressed by FISCAL Technologies Limited. Emails are not secure and may contain viruses and it is your responsibility to scan attachments (if any).  The e-mail system of FISCAL Technologies Limited is subject to random monitoring. For information about how we use your personal data (including your rights) please see our privacy policy - https://www.fiscaltec.com/uk/general/privacy-policy/
Visit our website at www.fiscaltec.co.uk<http://www.fiscaltec.co.uk>

Re: ExtractImages command test - images appear different

Posted by "sahyoun@fileaffairs.de" <sa...@fileaffairs.de>.

Hi Daniel,

the command you are using extracts images contaoined in the PDF but
doesn't render the PDF into an Image. 

Use https://pdfbox.apache.org/2.0/commandline.html#pdftoimage

BR
Maruan

Am Dienstag, dem 02.08.2022 um 15:31 +0000 schrieb Daniel Earwicker:
> Hi, this project looks perfect for my needs - converting PDF pages
> into images for easy rendering elsewhere. This is very much my first
> try so apologies in advance if this is a stupid question, but in the
> docs at https://pdfbox.apache.org/2.0/commandline.html I can't see
> any options that might improve the output.
> 
> Here's a side-by-side comparison, ExtractImages output on the left,
> and the PDF opened in chrome on the right:
> 
> https://imgur.com/a/KgNAZQ2
> 
> The PDF is an example I got from:
> https://www.ets.org/Media/Tests/GRE/pdf/gre_research_validity_data.pdf
> 
> Just in case this is relevant, I ran it a clean debian container:
> 
>     docker run -it -v c:/Users/me:/external debian:bullseye-slim
> 
>     apt update
>     apt install openjdk-17-jre -y
>     apt install wget -y
>     wget https://dlcdn.apache.org/pdfbox/2.0.26/pdfbox-app-2.0.26.jar
> 
> and then tested with:
> 
>     java -jar pdfbox-app-2.0.26.jar ExtractImages -prefix
> /external/extract-test /external/gre_research_validity_data.pdf
> 
> The screenshot is of the resulting extract-test-2.jpg file.
> 
> There's obviously some problem with the colours, and also there's a
> lot of extra stuff in the page margins that Chrome somehow knows it
> ought to hide. Is there any way to configure this extraction process
> so the image to look like how Chrome displays it? And for this kind
> of accurate rendering to work for the majority of PDFs? (this being
> the first one I tried). Thanks!
> This email is from FISCAL Technologies Limited, a company registered
> in England and Wales with company number 4801836, whose registered
> office is at 448 Basingstoke Road, Reading, RG2 0LP, United Kingdom.
> This notice applies to this email and to any other email subsequently
> sent by anyone at FISCAL Technologies Limited and appearing in the
> same chain of email correspondence. References below to "this email"
> should be read accordingly. The contents of this email and any
> attachments (if any) are private and confidential. If you have
> received this message in error, please notify us immediately by
> returning it to the sender or call our switchboard on +44 (0) 845 680
> 1905 and remove it from your system, do not use, copy or disclose it.
> The opinions expressed within this communication are not necessarily
> those expressed by FISCAL Technologies Limited. Emails are not secure
> and may contain viruses and it is your responsibility to scan
> attachments (if any).  The e-mail system of FISCAL Technologies
> Limited is subject to random monitoring. For information about how we
> use your personal data (including your rights) please see our privacy
> policy - https://www.fiscaltec.com/uk/general/privacy-policy/
> Visit our website at www.fiscaltec.co.uk<http://www.fiscaltec.co.uk>

-- 
-- 
Maruan Sahyoun

FileAffairs GmbH
Josef-Schappe-Straße 21
40882 Ratingen

Tel: +49 (2102) 89497 88
Fax: +49 (2102) 89497 91
sahyoun@fileaffairs.de
www.fileaffairs.de

Geschäftsführer: Maruan Sahyoun
Handelsregister: AG Düsseldorf, HRB 53837
UST.-ID: DE248275827

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org