You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Tiago Santos <ti...@yahoo.com.br.INVALID> on 2019/02/13 16:44:41 UTC
Support for extract text
Hi fellows,
I need help to solve a problem to extract text from my pdf file.
My code is like this:
private static void pdf3() { PDDocument document = null; PDFTextStripperByArea stripper = null; try { File fl = new File("c:\\java\\energy.pdf");//new File("c:\\java\\Tigas Piloto.pdf"); document = PDDocument.load(fl); stripper = new PDFTextStripperByArea(); stripper.setSortByPosition(true); Rectangle rect = new Rectangle( 10, 280, 275, 60 ); stripper.addRegion( "class1", rect ); PDPage firstPage = document.getPage(0); stripper.extractRegions( firstPage ); System.out.println( "Text in the area:" + rect ); System.out.println( stripper.getTextForRegion( "class1" ) ); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } }
And the result is below:
fev 13, 2019 2:09:35 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicodeADVERTÊNCIA: No Unicode mapping for CID+101 (101) in font AllAndNone2fev 13, 2019 2:09:35 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicodeADVERTÊNCIA: No Unicode mapping for CID+84 (84) in font AllAndNone2fev 13, 2019 2:09:35 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicodeADVERTÊNCIA: No Unicode mapping for CID+64 (64) in font AllAndNone2fev 13, 2019 2:09:35 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicodeADVERTÊNCIA: No Unicode mapping for CID+1 (1) in font AllAndNone2fev 13, 2019 2:09:35 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicodeADVERTÊNCIA: No Unicode mapping for CID+10 (10) in font AllAndNone2fev 13, 2019 2:09:35 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicodeADVERTÊNCIA: No Unicode mapping for CID+42 (42) in font AllAndNone2fev 13, 2019 2:09:35 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicodeADVERTÊNCIA: No Unicode mapping for CID+41 (41) in font AllAndNone2fev 13, 2019 2:09:35 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicodeADVERTÊNCIA: No Unicode mapping for CID+8 (8) in font AllAndNone2fev 13, 2019 2:09:35 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicodeADVERTÊNCIA: No Unicode mapping for CID+31 (31) in font AllAndNone2fev 13, 2019 2:09:35 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicodeADVERTÊNCIA: No Unicode mapping for CID+5 (5) in font AllAndNone2fev 13, 2019 2:09:35 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicodeADVERTÊNCIA: No Unicode mapping for CID+21 (21) in font AllAndNone2fev 13, 2019 2:09:35 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicodeADVERTÊNCIA: No Unicode mapping for CID+81 (81) in font AllAndNone2fev 13, 2019 2:09:35 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicodeADVERTÊNCIA: No Unicode mapping for CID+85 (85) in font AllAndNone2fev 13, 2019 2:09:35 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicodeADVERTÊNCIA: No Unicode mapping for CID+11 (11) in font AllAndNone2fev 13, 2019 2:09:35 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicodeADVERTÊNCIA: No Unicode mapping for CID+94 (94) in font AllAndNone2fev 13, 2019 2:09:35 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicodeADVERTÊNCIA: No Unicode mapping for CID+90 (90) in font AllAndNone2fev 13, 2019 2:09:35 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicodeADVERTÊNCIA: No Unicode mapping for CID+93 (93) in font AllAndNone2fev 13, 2019 2:09:35 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicodeADVERTÊNCIA: No Unicode mapping for CID+100 (100) in font AllAndNone2fev 13, 2019 2:09:35 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicodeADVERTÊNCIA: No Unicode mapping for CID+102 (102) in font AllAndNone2
Thanks____________________________________
Tiago Santos e-mail : tiagosantos12@yahoo.com.br
Re: Support for extract text
Posted by Tilman Hausherr <TH...@t-online.de>.
Hi,
"No Unicode mapping" means just that - it is missing so you won't have
any text for these glyphs. A glyph is just the visualization of a
character. You'll have to use OCR.
See also: https://pdfbox.apache.org/2.0/faq.html#text-extraction
Feel free to upload your file somewhere, I'll have a look at it. (But
read the FAQ first)
Tilman
Am 13.02.2019 um 17:44 schrieb Tiago Santos:
> Hi fellows,
> I need help to solve a problem to extract text from my pdf file.
> My code is like this:
> private static void pdf3() { PDDocument document = null; PDFTextStripperByArea stripper = null; try { File fl = new File("c:\\java\\energy.pdf");//new File("c:\\java\\Tigas Piloto.pdf"); document = PDDocument.load(fl); stripper = new PDFTextStripperByArea(); stripper.setSortByPosition(true); Rectangle rect = new Rectangle( 10, 280, 275, 60 ); stripper.addRegion( "class1", rect ); PDPage firstPage = document.getPage(0); stripper.extractRegions( firstPage ); System.out.println( "Text in the area:" + rect ); System.out.println( stripper.getTextForRegion( "class1" ) ); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } }
>
> And the result is below:
> fev 13, 2019 2:09:35 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicodeADVERTÊNCIA: No Unicode mapping for CID+101 (101) in font AllAndNone2fev 13, 2019 2:09:35 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicodeADVERTÊNCIA: No Unicode mapping for CID+84 (84) in font AllAndNone2fev 13, 2019 2:09:35 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicodeADVERTÊNCIA: No Unicode mapping for CID+64 (64) in font AllAndNone2fev 13, 2019 2:09:35 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicodeADVERTÊNCIA: No Unicode mapping for CID+1 (1) in font AllAndNone2fev 13, 2019 2:09:35 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicodeADVERTÊNCIA: No Unicode mapping for CID+10 (10) in font AllAndNone2fev 13, 2019 2:09:35 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicodeADVERTÊNCIA: No Unicode mapping for CID+42 (42) in font AllAndNone2fev 13, 2019 2:09:35 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicodeADVERTÊNCIA: No Unicode mapping for CID+41 (41) in font AllAndNone2fev 13, 2019 2:09:35 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicodeADVERTÊNCIA: No Unicode mapping for CID+8 (8) in font AllAndNone2fev 13, 2019 2:09:35 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicodeADVERTÊNCIA: No Unicode mapping for CID+31 (31) in font AllAndNone2fev 13, 2019 2:09:35 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicodeADVERTÊNCIA: No Unicode mapping for CID+5 (5) in font AllAndNone2fev 13, 2019 2:09:35 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicodeADVERTÊNCIA: No Unicode mapping for CID+21 (21) in font AllAndNone2fev 13, 2019 2:09:35 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicodeADVERTÊNCIA: No Unicode mapping for CID+81 (81) in font AllAndNone2fev 13, 2019 2:09:35 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicodeADVERTÊNCIA: No Unicode mapping for CID+85 (85) in font AllAndNone2fev 13, 2019 2:09:35 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicodeADVERTÊNCIA: No Unicode mapping for CID+11 (11) in font AllAndNone2fev 13, 2019 2:09:35 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicodeADVERTÊNCIA: No Unicode mapping for CID+94 (94) in font AllAndNone2fev 13, 2019 2:09:35 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicodeADVERTÊNCIA: No Unicode mapping for CID+90 (90) in font AllAndNone2fev 13, 2019 2:09:35 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicodeADVERTÊNCIA: No Unicode mapping for CID+93 (93) in font AllAndNone2fev 13, 2019 2:09:35 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicodeADVERTÊNCIA: No Unicode mapping for CID+100 (100) in font AllAndNone2fev 13, 2019 2:09:35 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicodeADVERTÊNCIA: No Unicode mapping for CID+102 (102) in font AllAndNone2
>
> Thanks____________________________________
> Tiago Santos e-mail : tiagosantos12@yahoo.com.br
>
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org