You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by JZ Q <cc...@gmail.com> on 2018/11/08 14:54:37 UTC

PDFTextStripper() does not extract text correct

Hi everyone,

I used the following code (lib version 2.0.12) to extract text from some
PDF file. It appears number "3" is occasionally interpreted as "6", for
example, E4283211 becomes E4286211.

Is it normally? Is the code using OCR? Thanks.


PDFTextStripper pdfStripper = new PDFTextStripper();
pdfStripper.setStartPage(i);
pdfStripper.setEndPage(i);

String text = pdfStripper.getText(pdDoc);
String[] docxLines = text.split(System.lineSeparator());
for (String line : docxLines) {

-- 
Best Wishes,
Jason

Re: PDFTextStripper() does not extract text correct

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 08.11.2018 um 15:54 schrieb JZ Q:
> Hi everyone,
>
> I used the following code (lib version 2.0.12) to extract text from some
> PDF file. It appears number "3" is occasionally interpreted as "6", for
> example, E4283211 becomes E4286211.
>
> Is it normally? Is the code using OCR? Thanks.

No, PDFBox doesn't have OCR (but Tika has it as an option).

It could be that your PDF is an image with invisible OCR. Please link to 
your PDF somewhere. (don't attach)

Tilman

>
>
> PDFTextStripper pdfStripper = new PDFTextStripper();
> pdfStripper.setStartPage(i);
> pdfStripper.setEndPage(i);
>
> String text = pdfStripper.getText(pdDoc);
> String[] docxLines = text.split(System.lineSeparator());
> for (String line : docxLines) {
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org