You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Ian Smith <Ia...@gossinteractive.com> on 2010/03/11 10:57:40 UTC

Odd characters in extracted text

Hi Folks,

I have linked to a PDF (~2MB) that produces unprintable characters in
the extracted text output.  These characters seem to be associated with
the first two pages of the document.

http://www.yourphp.org.uk/media/pdf/g/4/Annual_Report_0809.pdf

I believe the problem is caused by at least one of the embedded fonts in
the document; my debugging has shown that the strange characters are
associated with Identity-H encoding and/or Type 1 (CID) fonts and (only
perhaps) also the Mistral Font (KWTOGC+Mistral?).  Fonts that display
correctly seem to be associated with the WinAnsi encoding.

I have not been able to debug further owing to the large number of
deeply nested PDF objects (I don't really know anything about PDF!).
Hope this is the right place to report this, if not then please let me
know.

Regards,

Ian Smith.



Free User Group in Bristol on 11th March. More info here www.gossinteractive.com/usergroupmar10 

Web design and Content Management. www.twitter.com/gossinteractive 
Registered Office: c/o Bishop Fleming, Cobourg House, Mayflower Street, Plymouth, PL1 1LG.  Company Registration No: 3553908 

This email contains proprietary information, some or all of which may be legally privileged. It is for the intended recipient only. If an addressing or transmission error has misdirected this email, please notify the author by replying to this email. If you are not the intended recipient you may not use, disclose, distribute, copy, print or rely on this email. 

Email transmission cannot be guaranteed to be secure or error free, as information may be intercepted, corrupted, lost, destroyed, arrive late or incomplete or contain viruses. This email and any files attached to it have been checked with virus detection software before transmission. You should nonetheless carry out your own virus check before opening any attachment. GOSS Interactive Ltd accepts no liability for any loss or damage that may be caused by software viruses.



RE: Odd characters in extracted text

Posted by Ian Smith <Ia...@gossinteractive.com>.
From: Andreas Lehmkuehler [mailto:andreas@lehmi.de] 
Sent: 11 March 2010 20:00
To: users@pdfbox.apache.org
Subject: Re: Odd characters in extracted text

The current trunk (version 921494) contains an improvement for
Identity-H encoded text. I've extracted the text with the latest version
from your pdf and got the following result:

Poole Housing Partnership Ltd
Annual Report 08 09
???????????????????????????????????
20889 PHP R&A V1pk.indd   1 14/1/10   10:43:40
Contents
1. Welcome
2. The year in pictures and numbers
6. How we spend your money
8. Financial inclusions
10. Residents' involvement
12. Tenancy support & disabled adaptations 14. Improving services 16.
Customer insight 18. The environment 20. Leaseholders
???????????????????????????????????
20889 PHP R&A V1pk.indd   2 14/1/10   10:43:40
Welcome
They say time flies when you're busy, and the past year seems to have
flown by.
In June I had the pleasure, along with the Council portfolio holder of
being presented with an award for the services provided by PHP being
rated as excellent by the public sector watchdog, the Audit Commission.


Did you get the same or is it an improvement compared to your output?

		-----------------------

Very similar, except that the first set of bad characters (which
correspond to your ????s) has moved from the start of the output (in my
case) to just before the first footer (in your case).  I have confirmed
that my output has not changed using HEAD (rev 922281), so I need to
investigate using the precise version you tested to see why or whether
the output has moved from my perspective.

I will try this on Monday and report back here.  Thank you for replying.

Regards, Ian.






Free User Group in Bristol on 11th March. More info here www.gossinteractive.com/usergroupmar10 

Web design and Content Management. www.twitter.com/gossinteractive 
Registered Office: c/o Bishop Fleming, Cobourg House, Mayflower Street, Plymouth, PL1 1LG.  Company Registration No: 3553908 

This email contains proprietary information, some or all of which may be legally privileged. It is for the intended recipient only. If an addressing or transmission error has misdirected this email, please notify the author by replying to this email. If you are not the intended recipient you may not use, disclose, distribute, copy, print or rely on this email. 

Email transmission cannot be guaranteed to be secure or error free, as information may be intercepted, corrupted, lost, destroyed, arrive late or incomplete or contain viruses. This email and any files attached to it have been checked with virus detection software before transmission. You should nonetheless carry out your own virus check before opening any attachment. GOSS Interactive Ltd accepts no liability for any loss or damage that may be caused by software viruses.



RE: Odd characters in extracted text

Posted by Ian Smith <Ia...@gossinteractive.com>.
The current trunk (version 921494) contains an improvement for Identity-H encoded text. I've extracted the text with the latest version from your pdf and got the following result:

Poole Housing Partnership Ltd
Annual Report 08 09
???????????????????????????????????
20889 PHP R&A V1pk.indd   1 14/1/10   10:43:40
Contents
1. Welcome
2. The year in pictures and numbers
6. How we spend your money
8. Financial inclusions
10. Residents' involvement
12. Tenancy support & disabled adaptations 14. Improving services 16. Customer insight 18. The environment 20. Leaseholders ???????????????????????????????????
20889 PHP R&A V1pk.indd   2 14/1/10   10:43:40
Welcome
They say time flies when you're busy, and the past year seems to have flown by.
In June I had the pleasure, along with the Council portfolio holder of being presented with an award for the services provided by PHP being rated as excellent by the public sector watchdog, the Audit Commission.


Did you get the same or is it an improvement compared to your output?

BR
Andreas Lehmkühler

--------------------------------------------------

Hi Andreas,

Sorry about the delay in responding.  I have confirmed with the same revision that my output is subtly different to yours (my first set of strange chars is right at the beginning rather than just before the first header), this is the case with 1.0, rev 921494 and rev. 922165.  However, the characters are still there - do you have any idea whether they represent missing information or whether they are just extra artifacts . . . ?

???????????????????????????????????
Poole Housing Partnership Ltd
Annual Report 08 09
20889 PHP R&A V1pk.indd   1 14/1/10   10:43:40
Contents
1. Welcome
2. The year in pictures and numbers
6. How we spend your money
8. Financial inclusions
10. Residents' involvement
12. Tenancy support & disabled adaptations
14. Improving services
16. Customer insight
18. The environment
20. Leaseholders
???????????????????????????????????
20889 PHP R&A V1pk.indd   2 14/1/10   10:43:40
Welcome

Etc. . . .

Regards, Ian.



Free User Group in Bristol on 11th March. More info here www.gossinteractive.com/usergroupmar10 

Web design and Content Management. www.twitter.com/gossinteractive 
Registered Office: c/o Bishop Fleming, Cobourg House, Mayflower Street, Plymouth, PL1 1LG.  Company Registration No: 3553908 

This email contains proprietary information, some or all of which may be legally privileged. It is for the intended recipient only. If an addressing or transmission error has misdirected this email, please notify the author by replying to this email. If you are not the intended recipient you may not use, disclose, distribute, copy, print or rely on this email. 

Email transmission cannot be guaranteed to be secure or error free, as information may be intercepted, corrupted, lost, destroyed, arrive late or incomplete or contain viruses. This email and any files attached to it have been checked with virus detection software before transmission. You should nonetheless carry out your own virus check before opening any attachment. GOSS Interactive Ltd accepts no liability for any loss or damage that may be caused by software viruses.



Re: Odd characters in extracted text

Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Hi,

Ian Smith schrieb:
> Hi Folks,
> 
> I have linked to a PDF (~2MB) that produces unprintable characters in
> the extracted text output.  These characters seem to be associated with
> the first two pages of the document.
> 
> http://www.yourphp.org.uk/media/pdf/g/4/Annual_Report_0809.pdf
> 
What do you mean by unprintable?

> I believe the problem is caused by at least one of the embedded fonts in
> the document; my debugging has shown that the strange characters are
> associated with Identity-H encoding and/or Type 1 (CID) fonts and (only
> perhaps) also the Mistral Font (KWTOGC+Mistral?).  Fonts that display
> correctly seem to be associated with the WinAnsi encoding.
The current trunk (version 921494) contains an improvement for Identity-H
encoded text. I've extracted the text with the latest version from your pdf
and got the following result:

Poole Housing Partnership Ltd
Annual Report 08 09
???????????????????????????????????
20889 PHP R&A V1pk.indd   1 14/1/10   10:43:40
Contents
1. Welcome
2. The year in pictures and numbers
6. How we spend your money
8. Financial inclusions
10. Residents’ involvement
12. Tenancy support & disabled adaptations
14. Improving services
16. Customer insight
18. The environment
20. Leaseholders
???????????????????????????????????
20889 PHP R&A V1pk.indd   2 14/1/10   10:43:40
Welcome
They say time flies when you’re busy, and the past year seems to have flown by.
In June I had the pleasure, along with the Council portfolio
holder of being presented with an award for the services
provided by PHP being rated as excellent by the public
sector watchdog, the Audit Commission.


Did you get the same or is it an improvement compared to your output?

BR
Andreas Lehmkühler