You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Jason Lewis <ja...@dickson.st> on 2016/05/16 12:26:54 UTC

problem extracting text from PDFs created in Windows 10

Hi,

I'm having a problem using PDFBox to extract text from PDFs.

I have an application that prints to a PDF printer device in Windows.
The PDF printer device is actually cups-pdf on a linux server.

Under Windows 7 I had the same problem extracting text from PDFs that
were generated in this way, they seemed unreadable by PDFBox. Eventually
I solved this by turning off the "Enable advanced printing features" in
the Windows printer driver settings. After that PDFBox was able to
extract the text perfectly.

In windows 10 however you can't turn this option off. From what I gather
Windows 10 uses "type 4" printer drivers and the option "enable advanced
printing features" is ticked but greyed out so you can't un-tick it.

I have a test PDF that PDFBox can read fine, but if I print that PDF in
windows to the CUPS PDF printer device, the resulting PDF is mangled in
some way that prevents PDFBox from parsing it.

Is there something I can do to make PDFBox be able to understand the
mangled PDF?

I've also noticed that I can't select text in the broken pdf. Maybe this
windows driver somehow outlines all the text so its no longer text but
vectors?

I'm using PDFBox like this:

java -jar pdfbox-app-2.0.1.jar ExtractText -encoding UTF-8 -console
-startPage 1 -endPage 1 test-pdf-broken.pdf


Link to working PDF:
https://www.dropbox.com/s/glcmhl7nkg8w45f/test-pdf-works.pdf?dl=0

link to broken PDF:
https://www.dropbox.com/s/uriq36brougr4z1/test-pdf-broken.pdf?dl=0

Any suggestions on how I might fix this?

Thanks,

Jason
-- 
Jason Lewis
http://emacstragic.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: problem extracting text from PDFs created in Windows 10

Posted by Jason Lewis <ja...@dickson.st>.

Thanks Tilman,

I managed to resolve this issue by installing the Xerox Global
Postscript Print Driver, which is a type 3 driver and allows the
unchecking of the "enable advanced printing features"

I realise this was not a PDFBox issue, thanks for your help.

Jason

Tilman Hausherr wrote on 16/05/16 23:51:
> Am 16.05.2016 um 15:23 schrieb Jason Lewis:
>> Thanks for confirming this. That's what I suspected.
>>
>>> >try this:
>>> >http://techspeeder.com/2014/03/06/how-to-fix-printer-properties-that-are-grayed-out/
>>>
>>> >
>>> >http://www.networksteve.com/forum/topic.php/Administrator_cannot_change_printer_properties_on_%22Advanced%22_tab/?TopicId=57069&Posts=4
>>>
>>> >
>>> >
>> Yes, for some reason, the Type 4 drivers show that option but it is
>> greyed out and cannot be unticked.
> 
> 
> Can you install a 3rd party PDF printer driver? e.g. PDFCreator or CIB
> PDF brewer?
> 
> Tilman
> 
> 
> 

-- 
Jason Lewis
http://emacstragic.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: problem extracting text from PDFs created in Windows 10

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 16.05.2016 um 15:23 schrieb Jason Lewis:
> Thanks for confirming this. That's what I suspected.
>
>> >try this:
>> >http://techspeeder.com/2014/03/06/how-to-fix-printer-properties-that-are-grayed-out/
>> >
>> >http://www.networksteve.com/forum/topic.php/Administrator_cannot_change_printer_properties_on_%22Advanced%22_tab/?TopicId=57069&Posts=4
>> >
>> >
> Yes, for some reason, the Type 4 drivers show that option but it is
> greyed out and cannot be unticked.


Can you install a 3rd party PDF printer driver? e.g. PDFCreator or CIB 
PDF brewer?

Tilman

Re: problem extracting text from PDFs created in Windows 10

Posted by Jason Lewis <ja...@dickson.st>.

Hi Tilman,

On 16/05/2016 10:54 PM, Tilman Hausherr wrote:

>> I have a test PDF that PDFBox can read fine, but if I print that PDF in
>> windows to the CUPS PDF printer device, the resulting PDF is mangled in
>> some way that prevents PDFBox from parsing it.
> 
> Why would you do that? You already have a PDF. Or was it just to
> explain, i.e. you're really printing from that application of yours,
> with the same problem, but you don't want to show that output because it
> is confidential?

It's just to explain the problem. We have an application (that is
supplied to us as is and that I can't modify) that prints to a PDF
device in windows as its way of making PDFs reports.

Either way, I can extract text from my PDF or a PDF generated from the
application in a version prior to windows 10. I'd like to do the same
with PDFs generated this way.

I've been reading more on it, and its definitely to do with the
Microsoft Type 4 drivers. I'll see if I can find a way to install a type
3 driver and test with that.

> 
>>
>> Is there something I can do to make PDFBox be able to understand the
>> mangled PDF?
> 
> No....
> 
>>
>> I've also noticed that I can't select text in the broken pdf. Maybe this
>> windows driver somehow outlines all the text so its no longer text but
>> vectors?
> 
> I had a look at the "printed" PDF with PDFDebugger. It has the text as a
> huge image, not as a text.
> 

Thanks for confirming this. That's what I suspected.

> try this:
> http://techspeeder.com/2014/03/06/how-to-fix-printer-properties-that-are-grayed-out/
> 
> http://www.networksteve.com/forum/topic.php/Administrator_cannot_change_printer_properties_on_%22Advanced%22_tab/?TopicId=57069&Posts=4
> 
> 
Yes, for some reason, the Type 4 drivers show that option but it is
greyed out and cannot be unticked.

Thanks

Jason

-- 
Jason Lewis
http://emacstragic.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: problem extracting text from PDFs created in Windows 10

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 16.05.2016 um 14:26 schrieb Jason Lewis:
> Hi,
>
> I'm having a problem using PDFBox to extract text from PDFs.
>
> I have an application that prints to a PDF printer device in Windows.
> The PDF printer device is actually cups-pdf on a linux server.
>
> Under Windows 7 I had the same problem extracting text from PDFs that
> were generated in this way, they seemed unreadable by PDFBox. Eventually
> I solved this by turning off the "Enable advanced printing features" in
> the Windows printer driver settings. After that PDFBox was able to
> extract the text perfectly.
>
> In windows 10 however you can't turn this option off. From what I gather
> Windows 10 uses "type 4" printer drivers and the option "enable advanced
> printing features" is ticked but greyed out so you can't un-tick it.
>
> I have a test PDF that PDFBox can read fine, but if I print that PDF in
> windows to the CUPS PDF printer device, the resulting PDF is mangled in
> some way that prevents PDFBox from parsing it.

Why would you do that? You already have a PDF. Or was it just to 
explain, i.e. you're really printing from that application of yours, 
with the same problem, but you don't want to show that output because it 
is confidential?

>
> Is there something I can do to make PDFBox be able to understand the
> mangled PDF?

No....

>
> I've also noticed that I can't select text in the broken pdf. Maybe this
> windows driver somehow outlines all the text so its no longer text but
> vectors?

I had a look at the "printed" PDF with PDFDebugger. It has the text as a 
huge image, not as a text.

try this:
http://techspeeder.com/2014/03/06/how-to-fix-printer-properties-that-are-grayed-out/
http://www.networksteve.com/forum/topic.php/Administrator_cannot_change_printer_properties_on_%22Advanced%22_tab/?TopicId=57069&Posts=4


Tilman

>
> I'm using PDFBox like this:
>
> java -jar pdfbox-app-2.0.1.jar ExtractText -encoding UTF-8 -console
> -startPage 1 -endPage 1 test-pdf-broken.pdf
>
>
> Link to working PDF:
> https://www.dropbox.com/s/glcmhl7nkg8w45f/test-pdf-works.pdf?dl=0
>
> link to broken PDF:
> https://www.dropbox.com/s/uriq36brougr4z1/test-pdf-broken.pdf?dl=0
>
> Any suggestions on how I might fix this?
>
> Thanks,
>
> Jason



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org