You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Ralph Cook <rc...@pobox.com> on 2022/01/23 18:02:08 UTC

Problem with text extraction

I am using PDFBox's PDFTextStripper.getText() for a particular kind of 
PDF file generated by a government agency, and the text I'm getting does 
not match that displayed by Acrobat Reader for the same files. The 
getText() calls occasionally get characters Reader does not display, and 
in one case getText() gets an "O" instead of the "U" displayed by 
Reader. I would like to know if there's some way I can get same text as 
Reader displays.

The text from Reader is "correct", i.e., it is (clearly) the text 
intended by the program(s) generating the files. The extracted text 
contains typos and misspelled words.

Unfortunately, I cannot share any of the PDF files. They contain 
confidential information.

The rest of this email relates various things I have tried, mostly to 
understand the problem better.

I copied the text within Reader, just using control-A / control-C, then 
pasted the text into a text editor. The text pasted this way matches the 
extracted text, not the Reader-displayed text (the copied/pasted text 
does not have the line breaks that getText() gives). With my newfound 
(very limited) knowledge of how PDFs are constructed, this made me 
wonder if some of the content displayed by Reader is somewhere other 
than the Tj streams in the document.

I've downloaded and attempted to extract information with various tools 
-- mupdf, qpdf, and XpdfReader, so far. I've found it difficult to 
figure out how to use them, mostly because their help text assumes you 
know things about PDF that I'm still trying to learn. I have not yet 
managed, with any of them, to get an uncompressed text document that 
shows the PDF commands and their arguments in readable form. I thought 
if I could do that I might at least figure out the location of the 
information that is displayed by Reader but not extracted by PDFBox. I 
haven't gotten much useful out of them yet.

I downloaded PDFBox source and stepped through code to follow how 
getText() works. I ran across the LegacyPDFStreamEngine class comments 
indicating that it is only to be used for PDFTextStripper. At least 
sometimes, a word from the file is passed to 
PDFTextStripper.showText(byte[] string) as a byte array of PDF letter 
codes, and then showGlyph() is called on each one. Oddly, the spacing 
for each glyph is not a constant, which I expected it to be for a 
fixed-width font, but if it's only used for extraction, I guess that 
doesn't matter.

I put a trace statement on PDFTextStripper.processTextPosition(); for 
every character on page 6 of a particular document, it displayed the 
page number, character string, the flag indicating whether the character 
is to be shown (always true), and the X and Y position of the character. 
I put the result into a spreadsheet and sorted it by Y then by X, to see 
if the Reader-displayed characters showed up out of sequence. None of 
them do.

In the case of "O" instead of "U" -- part of the page header on the 
Reader displayed printout has a line with "CUST ID" on it; for pages 2-5 
of this file, the extracted text shows "CUST ID" correctly, but "COST 
ID" on the 6th page.

Here's the Reader version of those lines:



And the extracted text of the same lines.



The "CUST ID" is part of a page header; on pages 2-5, "CUST ID" is 
displayed and extracted correctly. This is the only case I've noticed so 
far where there's a seeming change in a character, as opposed to extra 
characters.

Here are some other redacted lines from the Reader display of this report:



And here is the extracted text from the same part of the file:



I included these images inline; I also attached them, since I don't know 
what facilities people have to read inline attachments.

The similarity of the errors on these lines -- that all three of the 
error lines had dates in February in the second position on the line and 
all had the same error -- must mean something, but I don't know what.

I've got other information, but I don't know how much of it (or of what 
I've provided) is helpful.

I do not expect anyone to 'solve the problem' based on this information. 
But I was hoping to get pointers to ways I could attempt to get the same 
text that Acrobat Reader displays, hopefully using PDFBox, but I'll 
change libraries or methods if I need to.

rc

Re: Problem with text extraction

Posted by Kevin Day <ke...@trumpetinc.com>.
This sounds a lot like OCR.

If you zoom in on one of the problem words, is it pixelated? If so, then
this is an image with an invisible layer of OCRed text on top of it.

You can also check this by selecting and copying the text in Acro Reader,
then pasting into a text editor. If the text is the same as what
TextStripper is giving you, then it's an OCR accuracy thing, and there is
nothing you can do about it (short of getting better OCR - but even the
best OCR is 99.9% accurate - which sounds good, until you count up the
number of words in a document and realize that .001 of them is still quite
a few errors).

On Sun, Jan 23, 2022, 11:39 AM Tilman Hausherr <TH...@t-online.de>
wrote:

> Hi,
>
> Your screenshots didn't get through. There are so many things that can
> go wrong with PDF, so it's difficult to tell without the file.
>
> "then pasted the text into a text editor. The text pasted this way
> matches the extracted text"
>
> Then it means PDFBox is correct. It's possible that the unicode text for
> a glyph is a wrong one. Sometimes it is intended to make text extraction
> difficult. It could also be a crappy OCR.
>
> "I have not yet managed, with any of them, to get an uncompressed text
> document that shows the PDF commands and their arguments in readable form"
>
> Try PDFBox PDFDebugger!
>
> Tilman
>
> Am 23.01.2022 um 19:02 schrieb Ralph Cook:
> > I am using PDFBox's PDFTextStripper.getText() for a particular kind of
> > PDF file generated by a government agency, and the text I'm getting
> > does not match that displayed by Acrobat Reader for the same files.
> > The getText() calls occasionally get characters Reader does not
> > display, and in one case getText() gets an "O" instead of the "U"
> > displayed by Reader. I would like to know if there's some way I can
> > get same text as Reader displays.
> >
> > The text from Reader is "correct", i.e., it is (clearly) the text
> > intended by the program(s) generating the files. The extracted text
> > contains typos and misspelled words.
> >
> > Unfortunately, I cannot share any of the PDF files. They contain
> > confidential information.
> >
> > The rest of this email relates various things I have tried, mostly to
> > understand the problem better.
> >
> > I copied the text within Reader, just using control-A / control-C,
> > then pasted the text into a text editor. The text pasted this way
> > matches the extracted text, not the Reader-displayed text (the
> > copied/pasted text does not have the line breaks that getText()
> > gives). With my newfound (very limited) knowledge of how PDFs are
> > constructed, this made me wonder if some of the content displayed by
> > Reader is somewhere other than the Tj streams in the document.
> >
> > I've downloaded and attempted to extract information with various
> > tools -- mupdf, qpdf, and XpdfReader, so far. I've found it difficult
> > to figure out how to use them, mostly because their help text assumes
> > you know things about PDF that I'm still trying to learn. I have not
> > yet managed, with any of them, to get an uncompressed text document
> > that shows the PDF commands and their arguments in readable form. I
> > thought if I could do that I might at least figure out the location of
> > the information that is displayed by Reader but not extracted by
> > PDFBox. I haven't gotten much useful out of them yet.
> >
> > I downloaded PDFBox source and stepped through code to follow how
> > getText() works. I ran across the LegacyPDFStreamEngine class comments
> > indicating that it is only to be used for PDFTextStripper. At least
> > sometimes, a word from the file is passed to
> > PDFTextStripper.showText(byte[] string) as a byte array of PDF letter
> > codes, and then showGlyph() is called on each one. Oddly, the spacing
> > for each glyph is not a constant, which I expected it to be for a
> > fixed-width font, but if it's only used for extraction, I guess that
> > doesn't matter.
> >
> > I put a trace statement on PDFTextStripper.processTextPosition(); for
> > every character on page 6 of a particular document, it displayed the
> > page number, character string, the flag indicating whether the
> > character is to be shown (always true), and the X and Y position of
> > the character. I put the result into a spreadsheet and sorted it by Y
> > then by X, to see if the Reader-displayed characters showed up out of
> > sequence. None of them do.
> >
> > In the case of "O" instead of "U" -- part of the page header on the
> > Reader displayed printout has a line with "CUST ID" on it; for pages
> > 2-5 of this file, the extracted text shows "CUST ID" correctly, but
> > "COST ID" on the 6th page.
> >
> > Here's the Reader version of those lines:
> >
> >
> >
> > And the extracted text of the same lines.
> >
> >
> >
> > The "CUST ID" is part of a page header; on pages 2-5, "CUST ID" is
> > displayed and extracted correctly. This is the only case I've noticed
> > so far where there's a seeming change in a character, as opposed to
> > extra characters.
> >
> > Here are some other redacted lines from the Reader display of this
> > report:
> >
> >
> >
> > And here is the extracted text from the same part of the file:
> >
> >
> >
> > I included these images inline; I also attached them, since I don't
> > know what facilities people have to read inline attachments.
> >
> > The similarity of the errors on these lines -- that all three of the
> > error lines had dates in February in the second position on the line
> > and all had the same error -- must mean something, but I don't know what.
> >
> > I've got other information, but I don't know how much of it (or of
> > what I've provided) is helpful.
> >
> > I do not expect anyone to 'solve the problem' based on this
> > information. But I was hoping to get pointers to ways I could attempt
> > to get the same text that Acrobat Reader displays, hopefully using
> > PDFBox, but I'll change libraries or methods if I need to.
> >
> > rc
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> > For additional commands, e-mail: users-help@pdfbox.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Re: Problem with text extraction

Posted by Tilman Hausherr <TH...@t-online.de>.
Hi,

Your screenshots didn't get through. There are so many things that can 
go wrong with PDF, so it's difficult to tell without the file.

"then pasted the text into a text editor. The text pasted this way 
matches the extracted text"

Then it means PDFBox is correct. It's possible that the unicode text for 
a glyph is a wrong one. Sometimes it is intended to make text extraction 
difficult. It could also be a crappy OCR.

"I have not yet managed, with any of them, to get an uncompressed text 
document that shows the PDF commands and their arguments in readable form"

Try PDFBox PDFDebugger!

Tilman

Am 23.01.2022 um 19:02 schrieb Ralph Cook:
> I am using PDFBox's PDFTextStripper.getText() for a particular kind of 
> PDF file generated by a government agency, and the text I'm getting 
> does not match that displayed by Acrobat Reader for the same files. 
> The getText() calls occasionally get characters Reader does not 
> display, and in one case getText() gets an "O" instead of the "U" 
> displayed by Reader. I would like to know if there's some way I can 
> get same text as Reader displays.
>
> The text from Reader is "correct", i.e., it is (clearly) the text 
> intended by the program(s) generating the files. The extracted text 
> contains typos and misspelled words.
>
> Unfortunately, I cannot share any of the PDF files. They contain 
> confidential information.
>
> The rest of this email relates various things I have tried, mostly to 
> understand the problem better.
>
> I copied the text within Reader, just using control-A / control-C, 
> then pasted the text into a text editor. The text pasted this way 
> matches the extracted text, not the Reader-displayed text (the 
> copied/pasted text does not have the line breaks that getText() 
> gives). With my newfound (very limited) knowledge of how PDFs are 
> constructed, this made me wonder if some of the content displayed by 
> Reader is somewhere other than the Tj streams in the document.
>
> I've downloaded and attempted to extract information with various 
> tools -- mupdf, qpdf, and XpdfReader, so far. I've found it difficult 
> to figure out how to use them, mostly because their help text assumes 
> you know things about PDF that I'm still trying to learn. I have not 
> yet managed, with any of them, to get an uncompressed text document 
> that shows the PDF commands and their arguments in readable form. I 
> thought if I could do that I might at least figure out the location of 
> the information that is displayed by Reader but not extracted by 
> PDFBox. I haven't gotten much useful out of them yet.
>
> I downloaded PDFBox source and stepped through code to follow how 
> getText() works. I ran across the LegacyPDFStreamEngine class comments 
> indicating that it is only to be used for PDFTextStripper. At least 
> sometimes, a word from the file is passed to 
> PDFTextStripper.showText(byte[] string) as a byte array of PDF letter 
> codes, and then showGlyph() is called on each one. Oddly, the spacing 
> for each glyph is not a constant, which I expected it to be for a 
> fixed-width font, but if it's only used for extraction, I guess that 
> doesn't matter.
>
> I put a trace statement on PDFTextStripper.processTextPosition(); for 
> every character on page 6 of a particular document, it displayed the 
> page number, character string, the flag indicating whether the 
> character is to be shown (always true), and the X and Y position of 
> the character. I put the result into a spreadsheet and sorted it by Y 
> then by X, to see if the Reader-displayed characters showed up out of 
> sequence. None of them do.
>
> In the case of "O" instead of "U" -- part of the page header on the 
> Reader displayed printout has a line with "CUST ID" on it; for pages 
> 2-5 of this file, the extracted text shows "CUST ID" correctly, but 
> "COST ID" on the 6th page.
>
> Here's the Reader version of those lines:
>
>
>
> And the extracted text of the same lines.
>
>
>
> The "CUST ID" is part of a page header; on pages 2-5, "CUST ID" is 
> displayed and extracted correctly. This is the only case I've noticed 
> so far where there's a seeming change in a character, as opposed to 
> extra characters.
>
> Here are some other redacted lines from the Reader display of this 
> report:
>
>
>
> And here is the extracted text from the same part of the file:
>
>
>
> I included these images inline; I also attached them, since I don't 
> know what facilities people have to read inline attachments.
>
> The similarity of the errors on these lines -- that all three of the 
> error lines had dates in February in the second position on the line 
> and all had the same error -- must mean something, but I don't know what.
>
> I've got other information, but I don't know how much of it (or of 
> what I've provided) is helpful.
>
> I do not expect anyone to 'solve the problem' based on this 
> information. But I was hoping to get pointers to ways I could attempt 
> to get the same text that Acrobat Reader displays, hopefully using 
> PDFBox, but I'll change libraries or methods if I need to.
>
> rc
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Problem with text extraction

Posted by John Lussmyer <Co...@CasaDelGato.com>.
On Sun Jan 23 10:02:08 PST 2022 rcook@pobox.com said:
>I am using PDFBox's PDFTextStripper.getText() for a particular kind of
>PDF file generated by a government agency, and the text I'm getting does
>not match that displayed by Acrobat Reader for the same files. The
>getText() calls occasionally get characters Reader does not display, and
>in one case getText() gets an "O" instead of the "U" displayed by
>Reader. I would like to know if there's some way I can get same text as
>Reader displays.

Have you checked for embedded Fonts in the PDF?  It's quite possible to have fonts where the code for "A" is NOT the save as the ASCII "A".


--

Worlds only All Electric F-250 truck! http://john.casadelgato.com/Electric-Vehicles/1995-Ford-F-250