You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Daniel Sánchez González <ds...@gmail.com> on 2011/06/23 13:54:30 UTC
Text extraction results in strange characters
When I try to convert a PDF to text the operation results in strange
characters. If I copy some text from PDF file and paste it in a text editor,
I've got the same result.
What is wrong?
Thanks in advance.
Dani
Re: Text extraction results in strange characters
Posted by Ad...@swmc.com.
I've never done any OCR stuff, so I have no idea. However, I'd just like
to mention that one problem that I foresee if that you won't know which
way the text is facing. For example, is it portrait or landscape? Is it
rotated at 90, 180, 270 degrees? I'm not sure how to solve this. One
solution would be to do the OCR 4 times (once at each rotation 0, 90, 180,
270) and just take the "best" result (which would probably mean the
largest amount of text). This would use 4 times the CPU time, but I'm not
sure what your requirements are. Maybe it's very important to go fast, or
maybe you don't care how long it takes as long as the results are the best
possible.
I've also heard that reducing the colors can help. For example, instead
of having greyscale, convert it to use 1-bit pixels (either black or
white). This will make sure all the edges are sharp and most OCR
algorithms will work better that way. Of course, this could backfire
severely if the text is a light shade of grey (as the entire image would
be converted to white), if the text is in a light color (yellow, light
blue, etc.), or if the background is a dark color (green text on a black
background, for instance). Again here you could do analysis on the image
to try to detect the right filters to run on the image (invert colors so
you have dark text of a light background, color saturation, contrast,
etc.) and you could run the same image through OCR with multiple different
filters and take the best result. It's just a matter of how creative you
want to get, how much CPU power you have to work with, how much
development time you have, and how important it is that the results are as
close to perfect as possible.
But like I said, I've never actually done any OCR myself, so maybe the OCR
libraries out there already take some/most/all of this into account. There
might be someone else on this list who has experience and can provide some
advice. If not, check with the developers of OCR libs; I'm sure they'll
have many good suggestions :-)
----
Thanks,
Adam
From:
Daniel Sánchez González <ds...@gmail.com>
To:
<us...@pdfbox.apache.org>
Date:
06/23/2011 09:47
Subject:
Re: Text extraction results in strange characters
Thank you very much for your explanation. I'll try to convert pdf to image
and then to text via OCR. Which is the most accurate way to do this?
----- Original Message -----
From: <Ad...@swmc.com>
To: <us...@pdfbox.apache.org>
Cc: <us...@pdfbox.apache.org>
Sent: Thursday, June 23, 2011 6:12 PM
Subject: Re: Text extraction results in strange characters
Dani,
The type of font being used is probably embedded and mapped to images of
the characters. This works great for viewing the document, but if you
don't have characters (ASCII or Unicode), you're not going to get
reasonable results when copying and pasting. If my theory is correct,
you'll find that you will also be unable to copy & paste using Adobe
Reader. The only way to get the text out of a file like this would be to
convert it to an image, and then try to use ocr (optical character
recognition) to extract the text. As you probably already know, OCR is
not 100% accurate, but it'd be better than nothing.
Developers,
I suggest we add this to the FAQ on the website. I've seen it come up a
few times, and it's a very interesting explanation.
----
Thanks,
Adam
From:
Daniel Sánchez González <ds...@gmail.com>
To:
users@pdfbox.apache.org
Date:
06/23/2011 04:55
Subject:
Text extraction results in strange characters
When I try to convert a PDF to text the operation results in strange
characters. If I copy some text from PDF file and paste it in a text
editor,
I've got the same result.
What is wrong?
Thanks in advance.
Dani
- FHA 203b; 203k; HECM; VA; USDA; Conventional
- Warehouse Lines; FHA-Authorized Originators
- Lending and Servicing in over 45 States
www.swmc.com - www.simplehecmcalculator.com Visit
www.swmc.com/resources for helpful links on Training, Webinars, Lender
Alerts and Submitting Conditions
This email and any content within or attached hereto from Sun West
Mortgage
Company, Inc. is confidential and/or legally privileged. The information
is
intended only for the use of the individual or entity named on this email.
If you are not the intended recipient, you are hereby notified that any
disclosure, copying, distribution or taking any action in reliance on the
contents of this email information is strictly prohibited, and that the
documents should be returned to this office immediately by email. Receipt
by
anyone other than the intended recipient is not a waiver of any privilege.
Please do not include your social security number, account number, or any
other personal or financial information in the content of the email.
Should
you have any questions, please call (800) 453 7884.
- FHA 203b; 203k; HECM; VA; USDA; Conventional
- Warehouse Lines; FHA-Authorized Originators
- Lending and Servicing in over 45 States
www.swmc.com - www.simplehecmcalculator.com Visit www.swmc.com/resources for helpful links on Training, Webinars, Lender Alerts and Submitting Conditions
This email and any content within or attached hereto from Sun West Mortgage Company, Inc. is confidential and/or legally privileged. The information is intended only for the use of the individual or entity named on this email. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or taking any action in reliance on the contents of this email information is strictly prohibited, and that the documents should be returned to this office immediately by email. Receipt by anyone other than the intended recipient is not a waiver of any privilege. Please do not include your social security number, account number, or any other personal or financial information in the content of the email. Should you have any questions, please call (800) 453 7884.
Re: Text extraction results in strange characters
Posted by Thomas Fischer <fi...@aon.at>.
Hi Daniel,
I don't know if is working yet, but the EuDML project (www.eudml.eu) is working on a tool to do just that, see http://www.eudml.eu/first-year-demos.
All the best
Thomas
Am 23.06.2011 um 18:46 schrieb Daniel Sánchez González:
> Thank you very much for your explanation. I'll try to convert pdf to image and then to text via OCR. Which is the most accurate way to do this?
Re: Text extraction results in strange characters
Posted by Daniel Sánchez González <ds...@gmail.com>.
Thank you very much for your explanation. I'll try to convert pdf to image
and then to text via OCR. Which is the most accurate way to do this?
----- Original Message -----
From: <Ad...@swmc.com>
To: <us...@pdfbox.apache.org>
Cc: <us...@pdfbox.apache.org>
Sent: Thursday, June 23, 2011 6:12 PM
Subject: Re: Text extraction results in strange characters
Dani,
The type of font being used is probably embedded and mapped to images of
the characters. This works great for viewing the document, but if you
don't have characters (ASCII or Unicode), you're not going to get
reasonable results when copying and pasting. If my theory is correct,
you'll find that you will also be unable to copy & paste using Adobe
Reader. The only way to get the text out of a file like this would be to
convert it to an image, and then try to use ocr (optical character
recognition) to extract the text. As you probably already know, OCR is
not 100% accurate, but it'd be better than nothing.
Developers,
I suggest we add this to the FAQ on the website. I've seen it come up a
few times, and it's a very interesting explanation.
----
Thanks,
Adam
From:
Daniel Sánchez González <ds...@gmail.com>
To:
users@pdfbox.apache.org
Date:
06/23/2011 04:55
Subject:
Text extraction results in strange characters
When I try to convert a PDF to text the operation results in strange
characters. If I copy some text from PDF file and paste it in a text
editor,
I've got the same result.
What is wrong?
Thanks in advance.
Dani
- FHA 203b; 203k; HECM; VA; USDA; Conventional
- Warehouse Lines; FHA-Authorized Originators
- Lending and Servicing in over 45 States
www.swmc.com - www.simplehecmcalculator.com Visit
www.swmc.com/resources for helpful links on Training, Webinars, Lender
Alerts and Submitting Conditions
This email and any content within or attached hereto from Sun West Mortgage
Company, Inc. is confidential and/or legally privileged. The information is
intended only for the use of the individual or entity named on this email.
If you are not the intended recipient, you are hereby notified that any
disclosure, copying, distribution or taking any action in reliance on the
contents of this email information is strictly prohibited, and that the
documents should be returned to this office immediately by email. Receipt by
anyone other than the intended recipient is not a waiver of any privilege.
Please do not include your social security number, account number, or any
other personal or financial information in the content of the email. Should
you have any questions, please call (800) 453 7884.
Re: Text extraction results in strange characters
Posted by Ad...@swmc.com.
Dani,
The type of font being used is probably embedded and mapped to images of
the characters. This works great for viewing the document, but if you
don't have characters (ASCII or Unicode), you're not going to get
reasonable results when copying and pasting. If my theory is correct,
you'll find that you will also be unable to copy & paste using Adobe
Reader. The only way to get the text out of a file like this would be to
convert it to an image, and then try to use ocr (optical character
recognition) to extract the text. As you probably already know, OCR is
not 100% accurate, but it'd be better than nothing.
Developers,
I suggest we add this to the FAQ on the website. I've seen it come up a
few times, and it's a very interesting explanation.
----
Thanks,
Adam
From:
Daniel Sánchez González <ds...@gmail.com>
To:
users@pdfbox.apache.org
Date:
06/23/2011 04:55
Subject:
Text extraction results in strange characters
When I try to convert a PDF to text the operation results in strange
characters. If I copy some text from PDF file and paste it in a text
editor,
I've got the same result.
What is wrong?
Thanks in advance.
Dani
- FHA 203b; 203k; HECM; VA; USDA; Conventional
- Warehouse Lines; FHA-Authorized Originators
- Lending and Servicing in over 45 States
www.swmc.com - www.simplehecmcalculator.com Visit www.swmc.com/resources for helpful links on Training, Webinars, Lender Alerts and Submitting Conditions
This email and any content within or attached hereto from Sun West Mortgage Company, Inc. is confidential and/or legally privileged. The information is intended only for the use of the individual or entity named on this email. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or taking any action in reliance on the contents of this email information is strictly prohibited, and that the documents should be returned to this office immediately by email. Receipt by anyone other than the intended recipient is not a waiver of any privilege. Please do not include your social security number, account number, or any other personal or financial information in the content of the email. Should you have any questions, please call (800) 453 7884.