You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Miran Damjanovic <MI...@statoil.com> on 2013/04/10 10:56:43 UTC

PDFBox Parsing problem - EOF

Hello,

I have been using PDFBox to get text from PDF's and validate some of it. Recently I have had
Problems parsing the PDF's, more precisely I get an java.io.ioexception. I use the following code
To get the text from PDF:
public String getTextFromPDF(URL url, int  readTimeout, int connectTimeout) throws IOException {
            try {
                  //open connection
                  HttpURLConnection conn =  (HttpURLConnection) url.openConnection();

                  //set caching to false
                  conn.setUseCaches( false );

                  //set read timeout
                  conn.setReadTimeout( readTimeout );

                  //set connect timeout
                  conn.setConnectTimeout( connectTimeout );

                  //get input stream from connection
                  InputStream fileToParse = conn.getInputStream();

                  System.out.println( fileToParse.toString());

                  //parser object
                  PDFParser parser = new PDFParser(fileToParse, null, true);

                  parser.parse();
                  //do parse
                  //parser.parse();

                  //get document
                  PDDocument pdoc = parser.getPDDocument();

                  //get stripper object
                  PDFTextStripper stripper = new PDFTextStripper();

                  //get text
                  String text = stripper.getText( pdoc );

                  //close doc
                  pdoc.close();

                  //disconnect
                  conn.disconnect();

                  //reset connection (set to nothing)
                  conn = null;

                  //reset file
                  fileToParse = null;

                  //reset parser
                  parser = null;

                  //return content
                  return text;

            }
The error message I get is this (line 51 is where I call parser.parse() above):
[cid:image001.png@01CE35DA.164DC060]

I appreciate any tips and help you can provide, in advance many thank you
Miran Damjanovic


-------------------------------------------------------------------
The information contained in this message may be CONFIDENTIAL and is
intended for the addressee only. Any unauthorised use, dissemination of the
information or copying of this message is prohibited. If you are not the
addressee, please notify the sender immediately by return e-mail and delete
this message.
Thank you

Re: PDFBox Parsing problem - EOF

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.

Hi,

unfortunately the error message don't make it through. Could you write it down and resend?

You can also simplify some of your code by using the .load or .loadNonSeq method of PDDocument 

BR
Maruan Sahyoun

Am 10.04.2013 um 10:56 schrieb Miran Damjanovic <MI...@statoil.com>:

> Hello,
>  
> I have been using PDFBox to get text from PDF’s and validate some of it. Recently I have had
> Problems parsing the PDF’s, more precisely I get an java.io.ioexception. I use the following code
> To get the text from PDF:
> public String getTextFromPDF(URL url, int  readTimeout, int connectTimeout) throws IOException {
>             try {
>                   //open connection
>                   HttpURLConnection conn =  (HttpURLConnection) url.openConnection();
>                  
>                   //set caching to false
>                   conn.setUseCaches( false );
>  
>                   //set read timeout
>                   conn.setReadTimeout( readTimeout );
>                  
>                   //set connect timeout
>                   conn.setConnectTimeout( connectTimeout );
>                  
>                   //get input stream from connection
>                   InputStream fileToParse = conn.getInputStream();
>                  
>                   System.out.println( fileToParse.toString());
>                  
>                   //parser object
>                   PDFParser parser = new PDFParser(fileToParse, null, true);
>                  
>                   parser.parse();
>                   //do parse
>                   //parser.parse();
>                  
>                   //get document
>                   PDDocument pdoc = parser.getPDDocument();
>                  
>                   //get stripper object
>                   PDFTextStripper stripper = new PDFTextStripper();
>  
>                   //get text
>                   String text = stripper.getText( pdoc );
>  
>                   //close doc
>                   pdoc.close();
>  
>                   //disconnect
>                   conn.disconnect();
>  
>                   //reset connection (set to nothing)
>                   conn = null;
>  
>                   //reset file
>                   fileToParse = null;
>  
>                   //reset parser
>                   parser = null;
>                  
>                   //return content
>                   return text;
>                  
>             }
> The error message I get is this (line 51 is where I call parser.parse() above):
> 
>  
> I appreciate any tips and help you can provide, in advance many thank you
> Miran Damjanovic
>  
> 
> -------------------------------------------------------------------
> The information contained in this message may be CONFIDENTIAL and is
> intended for the addressee only. Any unauthorised use, dissemination of the
> information or copying of this message is prohibited. If you are not the
> addressee, please notify the sender immediately by return e-mail and delete
> this message.
> Thank you