You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Miran Damjanovic <MI...@statoil.com> on 2013/04/10 10:56:43 UTC
PDFBox Parsing problem - EOF
Hello,
I have been using PDFBox to get text from PDF's and validate some of it. Recently I have had
Problems parsing the PDF's, more precisely I get an java.io.ioexception. I use the following code
To get the text from PDF:
public String getTextFromPDF(URL url, int readTimeout, int connectTimeout) throws IOException {
try {
//open connection
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
//set caching to false
conn.setUseCaches( false );
//set read timeout
conn.setReadTimeout( readTimeout );
//set connect timeout
conn.setConnectTimeout( connectTimeout );
//get input stream from connection
InputStream fileToParse = conn.getInputStream();
System.out.println( fileToParse.toString());
//parser object
PDFParser parser = new PDFParser(fileToParse, null, true);
parser.parse();
//do parse
//parser.parse();
//get document
PDDocument pdoc = parser.getPDDocument();
//get stripper object
PDFTextStripper stripper = new PDFTextStripper();
//get text
String text = stripper.getText( pdoc );
//close doc
pdoc.close();
//disconnect
conn.disconnect();
//reset connection (set to nothing)
conn = null;
//reset file
fileToParse = null;
//reset parser
parser = null;
//return content
return text;
}
The error message I get is this (line 51 is where I call parser.parse() above):
[cid:image001.png@01CE35DA.164DC060]
I appreciate any tips and help you can provide, in advance many thank you
Miran Damjanovic
-------------------------------------------------------------------
The information contained in this message may be CONFIDENTIAL and is
intended for the addressee only. Any unauthorised use, dissemination of the
information or copying of this message is prohibited. If you are not the
addressee, please notify the sender immediately by return e-mail and delete
this message.
Thank you
Re: PDFBox Parsing problem - EOF
Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
Hi,
unfortunately the error message don't make it through. Could you write it down and resend?
You can also simplify some of your code by using the .load or .loadNonSeq method of PDDocument
BR
Maruan Sahyoun
Am 10.04.2013 um 10:56 schrieb Miran Damjanovic <MI...@statoil.com>:
> Hello,
>
> I have been using PDFBox to get text from PDF’s and validate some of it. Recently I have had
> Problems parsing the PDF’s, more precisely I get an java.io.ioexception. I use the following code
> To get the text from PDF:
> public String getTextFromPDF(URL url, int readTimeout, int connectTimeout) throws IOException {
> try {
> //open connection
> HttpURLConnection conn = (HttpURLConnection) url.openConnection();
>
> //set caching to false
> conn.setUseCaches( false );
>
> //set read timeout
> conn.setReadTimeout( readTimeout );
>
> //set connect timeout
> conn.setConnectTimeout( connectTimeout );
>
> //get input stream from connection
> InputStream fileToParse = conn.getInputStream();
>
> System.out.println( fileToParse.toString());
>
> //parser object
> PDFParser parser = new PDFParser(fileToParse, null, true);
>
> parser.parse();
> //do parse
> //parser.parse();
>
> //get document
> PDDocument pdoc = parser.getPDDocument();
>
> //get stripper object
> PDFTextStripper stripper = new PDFTextStripper();
>
> //get text
> String text = stripper.getText( pdoc );
>
> //close doc
> pdoc.close();
>
> //disconnect
> conn.disconnect();
>
> //reset connection (set to nothing)
> conn = null;
>
> //reset file
> fileToParse = null;
>
> //reset parser
> parser = null;
>
> //return content
> return text;
>
> }
> The error message I get is this (line 51 is where I call parser.parse() above):
>
>
> I appreciate any tips and help you can provide, in advance many thank you
> Miran Damjanovic
>
>
> -------------------------------------------------------------------
> The information contained in this message may be CONFIDENTIAL and is
> intended for the addressee only. Any unauthorised use, dissemination of the
> information or copying of this message is prohibited. If you are not the
> addressee, please notify the sender immediately by return e-mail and delete
> this message.
> Thank you