You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Daniel Sánchez González <ds...@gmail.com> on 2011/06/23 13:54:30 UTC

Text extraction results in strange characters

When I try to convert a PDF to text the operation results in strange
characters. If I copy some text from PDF file and paste it in a text editor,
I've got the same result.

What is wrong?

Thanks in advance.

Dani

Re: Text extraction results in strange characters

Posted by Ad...@swmc.com.
I've never done any OCR stuff, so I have no idea.  However, I'd just like 
to mention that one problem that I foresee if that you won't know which 
way the text is facing.  For example, is it portrait or landscape?  Is it 
rotated at 90, 180, 270 degrees?  I'm not sure how to solve this.  One 
solution would be to do the OCR 4 times (once at each rotation 0, 90, 180, 
270) and just take the "best" result (which would probably mean the 
largest amount of text).  This would use 4 times the CPU time, but I'm not 
sure what your requirements are.  Maybe it's very important to go fast, or 
maybe you don't care how long it takes as long as the results are the best 
possible.

I've also heard that reducing the colors can help.  For example, instead 
of having greyscale, convert it to use 1-bit pixels (either black or 
white).  This will make sure all the edges are sharp and most OCR 
algorithms will work better that way.  Of course, this could backfire 
severely if the text is a light shade of grey (as the entire image would 
be converted to white), if the text is in a light color (yellow, light 
blue, etc.), or if the background is a dark color (green text on a black 
background, for instance).  Again here you could do analysis on the image 
to try to detect the right filters to run on the image (invert colors so 
you have dark text of a light background, color saturation, contrast, 
etc.) and you could run the same image through OCR with multiple different 
filters and take the best result.  It's just a matter of how creative you 
want to get, how much CPU power you have to work with, how much 
development time you have, and how important it is that the results are as 
close to perfect as possible.

But like I said, I've never actually done any OCR myself, so maybe the OCR 
libraries out there already take some/most/all of this into account. There 
might be someone else on this list who has experience and can provide some 
advice.  If not, check with the developers of OCR libs; I'm sure they'll 
have many good suggestions :-)

---- 
Thanks,
Adam





From:
Daniel Sánchez González <ds...@gmail.com>
To:
<us...@pdfbox.apache.org>
Date:
06/23/2011 09:47
Subject:
Re: Text extraction results in strange characters



Thank you very much for your explanation. I'll try to convert pdf to image 

and then to text via OCR. Which is the most accurate way to do this?




----- Original Message ----- 
From: <Ad...@swmc.com>
To: <us...@pdfbox.apache.org>
Cc: <us...@pdfbox.apache.org>
Sent: Thursday, June 23, 2011 6:12 PM
Subject: Re: Text extraction results in strange characters


Dani,
The type of font being used is probably embedded and mapped to images of
the characters.  This works great for viewing the document, but if you
don't have characters (ASCII or Unicode), you're not going to get
reasonable results when copying and pasting.  If my theory is correct,
you'll find that you will also be unable to copy & paste using Adobe
Reader.  The only way to get the text out of a file like this would be to
convert it to an image, and then try to use ocr (optical character
recognition) to extract the text.  As you probably already know, OCR is
not 100% accurate, but it'd be better than nothing.

Developers,
I suggest we add this to the FAQ on the website.  I've seen it come up a
few times, and it's a very interesting explanation.

---- 
Thanks,
Adam



From:
Daniel Sánchez González <ds...@gmail.com>
To:
users@pdfbox.apache.org
Date:
06/23/2011 04:55
Subject:
Text extraction results in strange characters



When I try to convert a PDF to text the operation results in strange
characters. If I copy some text from PDF file and paste it in a text
editor,
I've got the same result.

What is wrong?

Thanks in advance.

Dani





- FHA 203b; 203k; HECM; VA; USDA; Conventional
- Warehouse Lines; FHA-Authorized Originators
- Lending and Servicing in over 45 States
www.swmc.com   -  www.simplehecmcalculator.com   Visit 
www.swmc.com/resources   for helpful links on Training, Webinars, Lender 
Alerts and Submitting Conditions
This email and any content within or attached hereto from Sun West 
Mortgage 
Company, Inc. is confidential and/or legally privileged. The information 
is 
intended only for the use of the individual or entity named on this email. 

If you are not the intended recipient, you are hereby notified that any 
disclosure, copying, distribution or taking any action in reliance on the 
contents of this email information is strictly prohibited, and that the 
documents should be returned to this office immediately by email. Receipt 
by 
anyone other than the intended recipient is not a waiver of any privilege. 

Please do not include your social security number, account number, or any 
other personal or financial information in the content of the email. 
Should 
you have any questions, please call (800) 453 7884. 




- FHA 203b; 203k; HECM; VA; USDA; Conventional 
- Warehouse Lines; FHA-Authorized Originators 
- Lending and Servicing in over 45 States 
www.swmc.com   -  www.simplehecmcalculator.com   Visit  www.swmc.com/resources   for helpful links on Training, Webinars, Lender Alerts and Submitting Conditions  
This email and any content within or attached hereto from Sun West Mortgage Company, Inc. is confidential and/or legally privileged. The information is intended only for the use of the individual or entity named on this email. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or taking any action in reliance on the contents of this email information is strictly prohibited, and that the documents should be returned to this office immediately by email. Receipt by anyone other than the intended recipient is not a waiver of any privilege. Please do not include your social security number, account number, or any other personal or financial information in the content of the email. Should you have any questions, please call (800) 453 7884.  

Re: Text extraction results in strange characters

Posted by Thomas Fischer <fi...@aon.at>.
Hi Daniel,

I don't know if is working yet, but the EuDML project (www.eudml.eu) is working on a tool to do just that, see http://www.eudml.eu/first-year-demos.

All the best
Thomas


Am 23.06.2011 um 18:46 schrieb Daniel Sánchez González:

> Thank you very much for your explanation. I'll try to convert pdf to image and then to text via OCR. Which is the most accurate way to do this?


Re: Text extraction results in strange characters

Posted by Daniel Sánchez González <ds...@gmail.com>.
Thank you very much for your explanation. I'll try to convert pdf to image 
and then to text via OCR. Which is the most accurate way to do this?




----- Original Message ----- 
From: <Ad...@swmc.com>
To: <us...@pdfbox.apache.org>
Cc: <us...@pdfbox.apache.org>
Sent: Thursday, June 23, 2011 6:12 PM
Subject: Re: Text extraction results in strange characters


Dani,
The type of font being used is probably embedded and mapped to images of
the characters.  This works great for viewing the document, but if you
don't have characters (ASCII or Unicode), you're not going to get
reasonable results when copying and pasting.  If my theory is correct,
you'll find that you will also be unable to copy & paste using Adobe
Reader.  The only way to get the text out of a file like this would be to
convert it to an image, and then try to use ocr (optical character
recognition) to extract the text.  As you probably already know, OCR is
not 100% accurate, but it'd be better than nothing.

Developers,
I suggest we add this to the FAQ on the website.  I've seen it come up a
few times, and it's a very interesting explanation.

---- 
Thanks,
Adam



From:
Daniel Sánchez González <ds...@gmail.com>
To:
users@pdfbox.apache.org
Date:
06/23/2011 04:55
Subject:
Text extraction results in strange characters



When I try to convert a PDF to text the operation results in strange
characters. If I copy some text from PDF file and paste it in a text
editor,
I've got the same result.

What is wrong?

Thanks in advance.

Dani





- FHA 203b; 203k; HECM; VA; USDA; Conventional
- Warehouse Lines; FHA-Authorized Originators
- Lending and Servicing in over 45 States
www.swmc.com   -  www.simplehecmcalculator.com   Visit 
www.swmc.com/resources   for helpful links on Training, Webinars, Lender 
Alerts and Submitting Conditions
This email and any content within or attached hereto from Sun West Mortgage 
Company, Inc. is confidential and/or legally privileged. The information is 
intended only for the use of the individual or entity named on this email. 
If you are not the intended recipient, you are hereby notified that any 
disclosure, copying, distribution or taking any action in reliance on the 
contents of this email information is strictly prohibited, and that the 
documents should be returned to this office immediately by email. Receipt by 
anyone other than the intended recipient is not a waiver of any privilege. 
Please do not include your social security number, account number, or any 
other personal or financial information in the content of the email. Should 
you have any questions, please call (800) 453 7884. 


Re: Text extraction results in strange characters

Posted by Ad...@swmc.com.
Dani,
The type of font being used is probably embedded and mapped to images of 
the characters.  This works great for viewing the document, but if you 
don't have characters (ASCII or Unicode), you're not going to get 
reasonable results when copying and pasting.  If my theory is correct, 
you'll find that you will also be unable to copy & paste using Adobe 
Reader.  The only way to get the text out of a file like this would be to 
convert it to an image, and then try to use ocr (optical character 
recognition) to extract the text.  As you probably already know, OCR is 
not 100% accurate, but it'd be better than nothing.

Developers,
I suggest we add this to the FAQ on the website.  I've seen it come up a 
few times, and it's a very interesting explanation.

---- 
Thanks,
Adam



From:
Daniel Sánchez González <ds...@gmail.com>
To:
users@pdfbox.apache.org
Date:
06/23/2011 04:55
Subject:
Text extraction results in strange characters



When I try to convert a PDF to text the operation results in strange
characters. If I copy some text from PDF file and paste it in a text 
editor,
I've got the same result.

What is wrong?

Thanks in advance.

Dani





- FHA 203b; 203k; HECM; VA; USDA; Conventional 
- Warehouse Lines; FHA-Authorized Originators 
- Lending and Servicing in over 45 States 
www.swmc.com   -  www.simplehecmcalculator.com   Visit  www.swmc.com/resources   for helpful links on Training, Webinars, Lender Alerts and Submitting Conditions  
This email and any content within or attached hereto from Sun West Mortgage Company, Inc. is confidential and/or legally privileged. The information is intended only for the use of the individual or entity named on this email. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or taking any action in reliance on the contents of this email information is strictly prohibited, and that the documents should be returned to this office immediately by email. Receipt by anyone other than the intended recipient is not a waiver of any privilege. Please do not include your social security number, account number, or any other personal or financial information in the content of the email. Should you have any questions, please call (800) 453 7884.