You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Ralph Cook <rc...@pobox.com> on 2022/01/23 18:02:08 UTC
Problem with text extraction
I am using PDFBox's PDFTextStripper.getText() for a particular kind of
PDF file generated by a government agency, and the text I'm getting does
not match that displayed by Acrobat Reader for the same files. The
getText() calls occasionally get characters Reader does not display, and
in one case getText() gets an "O" instead of the "U" displayed by
Reader. I would like to know if there's some way I can get same text as
Reader displays.
The text from Reader is "correct", i.e., it is (clearly) the text
intended by the program(s) generating the files. The extracted text
contains typos and misspelled words.
Unfortunately, I cannot share any of the PDF files. They contain
confidential information.
The rest of this email relates various things I have tried, mostly to
understand the problem better.
I copied the text within Reader, just using control-A / control-C, then
pasted the text into a text editor. The text pasted this way matches the
extracted text, not the Reader-displayed text (the copied/pasted text
does not have the line breaks that getText() gives). With my newfound
(very limited) knowledge of how PDFs are constructed, this made me
wonder if some of the content displayed by Reader is somewhere other
than the Tj streams in the document.
I've downloaded and attempted to extract information with various tools
-- mupdf, qpdf, and XpdfReader, so far. I've found it difficult to
figure out how to use them, mostly because their help text assumes you
know things about PDF that I'm still trying to learn. I have not yet
managed, with any of them, to get an uncompressed text document that
shows the PDF commands and their arguments in readable form. I thought
if I could do that I might at least figure out the location of the
information that is displayed by Reader but not extracted by PDFBox. I
haven't gotten much useful out of them yet.
I downloaded PDFBox source and stepped through code to follow how
getText() works. I ran across the LegacyPDFStreamEngine class comments
indicating that it is only to be used for PDFTextStripper. At least
sometimes, a word from the file is passed to
PDFTextStripper.showText(byte[] string) as a byte array of PDF letter
codes, and then showGlyph() is called on each one. Oddly, the spacing
for each glyph is not a constant, which I expected it to be for a
fixed-width font, but if it's only used for extraction, I guess that
doesn't matter.
I put a trace statement on PDFTextStripper.processTextPosition(); for
every character on page 6 of a particular document, it displayed the
page number, character string, the flag indicating whether the character
is to be shown (always true), and the X and Y position of the character.
I put the result into a spreadsheet and sorted it by Y then by X, to see
if the Reader-displayed characters showed up out of sequence. None of
them do.
In the case of "O" instead of "U" -- part of the page header on the
Reader displayed printout has a line with "CUST ID" on it; for pages 2-5
of this file, the extracted text shows "CUST ID" correctly, but "COST
ID" on the 6th page.
Here's the Reader version of those lines:
And the extracted text of the same lines.
The "CUST ID" is part of a page header; on pages 2-5, "CUST ID" is
displayed and extracted correctly. This is the only case I've noticed so
far where there's a seeming change in a character, as opposed to extra
characters.
Here are some other redacted lines from the Reader display of this report:
And here is the extracted text from the same part of the file:
I included these images inline; I also attached them, since I don't know
what facilities people have to read inline attachments.
The similarity of the errors on these lines -- that all three of the
error lines had dates in February in the second position on the line and
all had the same error -- must mean something, but I don't know what.
I've got other information, but I don't know how much of it (or of what
I've provided) is helpful.
I do not expect anyone to 'solve the problem' based on this information.
But I was hoping to get pointers to ways I could attempt to get the same
text that Acrobat Reader displays, hopefully using PDFBox, but I'll
change libraries or methods if I need to.
rc
Re: Problem with text extraction
Posted by Kevin Day <ke...@trumpetinc.com>.
This sounds a lot like OCR.
If you zoom in on one of the problem words, is it pixelated? If so, then
this is an image with an invisible layer of OCRed text on top of it.
You can also check this by selecting and copying the text in Acro Reader,
then pasting into a text editor. If the text is the same as what
TextStripper is giving you, then it's an OCR accuracy thing, and there is
nothing you can do about it (short of getting better OCR - but even the
best OCR is 99.9% accurate - which sounds good, until you count up the
number of words in a document and realize that .001 of them is still quite
a few errors).
On Sun, Jan 23, 2022, 11:39 AM Tilman Hausherr <TH...@t-online.de>
wrote:
> Hi,
>
> Your screenshots didn't get through. There are so many things that can
> go wrong with PDF, so it's difficult to tell without the file.
>
> "then pasted the text into a text editor. The text pasted this way
> matches the extracted text"
>
> Then it means PDFBox is correct. It's possible that the unicode text for
> a glyph is a wrong one. Sometimes it is intended to make text extraction
> difficult. It could also be a crappy OCR.
>
> "I have not yet managed, with any of them, to get an uncompressed text
> document that shows the PDF commands and their arguments in readable form"
>
> Try PDFBox PDFDebugger!
>
> Tilman
>
> Am 23.01.2022 um 19:02 schrieb Ralph Cook:
> > I am using PDFBox's PDFTextStripper.getText() for a particular kind of
> > PDF file generated by a government agency, and the text I'm getting
> > does not match that displayed by Acrobat Reader for the same files.
> > The getText() calls occasionally get characters Reader does not
> > display, and in one case getText() gets an "O" instead of the "U"
> > displayed by Reader. I would like to know if there's some way I can
> > get same text as Reader displays.
> >
> > The text from Reader is "correct", i.e., it is (clearly) the text
> > intended by the program(s) generating the files. The extracted text
> > contains typos and misspelled words.
> >
> > Unfortunately, I cannot share any of the PDF files. They contain
> > confidential information.
> >
> > The rest of this email relates various things I have tried, mostly to
> > understand the problem better.
> >
> > I copied the text within Reader, just using control-A / control-C,
> > then pasted the text into a text editor. The text pasted this way
> > matches the extracted text, not the Reader-displayed text (the
> > copied/pasted text does not have the line breaks that getText()
> > gives). With my newfound (very limited) knowledge of how PDFs are
> > constructed, this made me wonder if some of the content displayed by
> > Reader is somewhere other than the Tj streams in the document.
> >
> > I've downloaded and attempted to extract information with various
> > tools -- mupdf, qpdf, and XpdfReader, so far. I've found it difficult
> > to figure out how to use them, mostly because their help text assumes
> > you know things about PDF that I'm still trying to learn. I have not
> > yet managed, with any of them, to get an uncompressed text document
> > that shows the PDF commands and their arguments in readable form. I
> > thought if I could do that I might at least figure out the location of
> > the information that is displayed by Reader but not extracted by
> > PDFBox. I haven't gotten much useful out of them yet.
> >
> > I downloaded PDFBox source and stepped through code to follow how
> > getText() works. I ran across the LegacyPDFStreamEngine class comments
> > indicating that it is only to be used for PDFTextStripper. At least
> > sometimes, a word from the file is passed to
> > PDFTextStripper.showText(byte[] string) as a byte array of PDF letter
> > codes, and then showGlyph() is called on each one. Oddly, the spacing
> > for each glyph is not a constant, which I expected it to be for a
> > fixed-width font, but if it's only used for extraction, I guess that
> > doesn't matter.
> >
> > I put a trace statement on PDFTextStripper.processTextPosition(); for
> > every character on page 6 of a particular document, it displayed the
> > page number, character string, the flag indicating whether the
> > character is to be shown (always true), and the X and Y position of
> > the character. I put the result into a spreadsheet and sorted it by Y
> > then by X, to see if the Reader-displayed characters showed up out of
> > sequence. None of them do.
> >
> > In the case of "O" instead of "U" -- part of the page header on the
> > Reader displayed printout has a line with "CUST ID" on it; for pages
> > 2-5 of this file, the extracted text shows "CUST ID" correctly, but
> > "COST ID" on the 6th page.
> >
> > Here's the Reader version of those lines:
> >
> >
> >
> > And the extracted text of the same lines.
> >
> >
> >
> > The "CUST ID" is part of a page header; on pages 2-5, "CUST ID" is
> > displayed and extracted correctly. This is the only case I've noticed
> > so far where there's a seeming change in a character, as opposed to
> > extra characters.
> >
> > Here are some other redacted lines from the Reader display of this
> > report:
> >
> >
> >
> > And here is the extracted text from the same part of the file:
> >
> >
> >
> > I included these images inline; I also attached them, since I don't
> > know what facilities people have to read inline attachments.
> >
> > The similarity of the errors on these lines -- that all three of the
> > error lines had dates in February in the second position on the line
> > and all had the same error -- must mean something, but I don't know what.
> >
> > I've got other information, but I don't know how much of it (or of
> > what I've provided) is helpful.
> >
> > I do not expect anyone to 'solve the problem' based on this
> > information. But I was hoping to get pointers to ways I could attempt
> > to get the same text that Acrobat Reader displays, hopefully using
> > PDFBox, but I'll change libraries or methods if I need to.
> >
> > rc
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> > For additional commands, e-mail: users-help@pdfbox.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>
Re: Problem with text extraction
Posted by Tilman Hausherr <TH...@t-online.de>.
Hi,
Your screenshots didn't get through. There are so many things that can
go wrong with PDF, so it's difficult to tell without the file.
"then pasted the text into a text editor. The text pasted this way
matches the extracted text"
Then it means PDFBox is correct. It's possible that the unicode text for
a glyph is a wrong one. Sometimes it is intended to make text extraction
difficult. It could also be a crappy OCR.
"I have not yet managed, with any of them, to get an uncompressed text
document that shows the PDF commands and their arguments in readable form"
Try PDFBox PDFDebugger!
Tilman
Am 23.01.2022 um 19:02 schrieb Ralph Cook:
> I am using PDFBox's PDFTextStripper.getText() for a particular kind of
> PDF file generated by a government agency, and the text I'm getting
> does not match that displayed by Acrobat Reader for the same files.
> The getText() calls occasionally get characters Reader does not
> display, and in one case getText() gets an "O" instead of the "U"
> displayed by Reader. I would like to know if there's some way I can
> get same text as Reader displays.
>
> The text from Reader is "correct", i.e., it is (clearly) the text
> intended by the program(s) generating the files. The extracted text
> contains typos and misspelled words.
>
> Unfortunately, I cannot share any of the PDF files. They contain
> confidential information.
>
> The rest of this email relates various things I have tried, mostly to
> understand the problem better.
>
> I copied the text within Reader, just using control-A / control-C,
> then pasted the text into a text editor. The text pasted this way
> matches the extracted text, not the Reader-displayed text (the
> copied/pasted text does not have the line breaks that getText()
> gives). With my newfound (very limited) knowledge of how PDFs are
> constructed, this made me wonder if some of the content displayed by
> Reader is somewhere other than the Tj streams in the document.
>
> I've downloaded and attempted to extract information with various
> tools -- mupdf, qpdf, and XpdfReader, so far. I've found it difficult
> to figure out how to use them, mostly because their help text assumes
> you know things about PDF that I'm still trying to learn. I have not
> yet managed, with any of them, to get an uncompressed text document
> that shows the PDF commands and their arguments in readable form. I
> thought if I could do that I might at least figure out the location of
> the information that is displayed by Reader but not extracted by
> PDFBox. I haven't gotten much useful out of them yet.
>
> I downloaded PDFBox source and stepped through code to follow how
> getText() works. I ran across the LegacyPDFStreamEngine class comments
> indicating that it is only to be used for PDFTextStripper. At least
> sometimes, a word from the file is passed to
> PDFTextStripper.showText(byte[] string) as a byte array of PDF letter
> codes, and then showGlyph() is called on each one. Oddly, the spacing
> for each glyph is not a constant, which I expected it to be for a
> fixed-width font, but if it's only used for extraction, I guess that
> doesn't matter.
>
> I put a trace statement on PDFTextStripper.processTextPosition(); for
> every character on page 6 of a particular document, it displayed the
> page number, character string, the flag indicating whether the
> character is to be shown (always true), and the X and Y position of
> the character. I put the result into a spreadsheet and sorted it by Y
> then by X, to see if the Reader-displayed characters showed up out of
> sequence. None of them do.
>
> In the case of "O" instead of "U" -- part of the page header on the
> Reader displayed printout has a line with "CUST ID" on it; for pages
> 2-5 of this file, the extracted text shows "CUST ID" correctly, but
> "COST ID" on the 6th page.
>
> Here's the Reader version of those lines:
>
>
>
> And the extracted text of the same lines.
>
>
>
> The "CUST ID" is part of a page header; on pages 2-5, "CUST ID" is
> displayed and extracted correctly. This is the only case I've noticed
> so far where there's a seeming change in a character, as opposed to
> extra characters.
>
> Here are some other redacted lines from the Reader display of this
> report:
>
>
>
> And here is the extracted text from the same part of the file:
>
>
>
> I included these images inline; I also attached them, since I don't
> know what facilities people have to read inline attachments.
>
> The similarity of the errors on these lines -- that all three of the
> error lines had dates in February in the second position on the line
> and all had the same error -- must mean something, but I don't know what.
>
> I've got other information, but I don't know how much of it (or of
> what I've provided) is helpful.
>
> I do not expect anyone to 'solve the problem' based on this
> information. But I was hoping to get pointers to ways I could attempt
> to get the same text that Acrobat Reader displays, hopefully using
> PDFBox, but I'll change libraries or methods if I need to.
>
> rc
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org
Problem with text extraction
Posted by John Lussmyer <Co...@CasaDelGato.com>.
On Sun Jan 23 10:02:08 PST 2022 rcook@pobox.com said:
>I am using PDFBox's PDFTextStripper.getText() for a particular kind of
>PDF file generated by a government agency, and the text I'm getting does
>not match that displayed by Acrobat Reader for the same files. The
>getText() calls occasionally get characters Reader does not display, and
>in one case getText() gets an "O" instead of the "U" displayed by
>Reader. I would like to know if there's some way I can get same text as
>Reader displays.
Have you checked for embedded Fonts in the PDF? It's quite possible to have fonts where the code for "A" is NOT the save as the ASCII "A".
--
Worlds only All Electric F-250 truck! http://john.casadelgato.com/Electric-Vehicles/1995-Ford-F-250