You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Golovko Anna <an...@yandex.ru> on 2015/03/26 09:11:18 UTC

problem with parsing pdf with PDFBox

Hello!

My name is Anna Yakubenko. I'm a Java-developer and now support application, which can parse pdf to txt with PDFBox and then store data to xml file as an output. Early every pdf files were parsed by PDFBox properly, but now I have got a pdf file, which is parsed in the way I couldn't expect. It seems, that customer add new layer with picture, colontitul and footer to pdf. And now PDFBox extarct information only from colontitul and footer from every page, and miss important information in the middle of the page. 

I use next source code to call PDFBox API:

import java.io.File;
import java.io.FileInputStream;
import java.io.PrintStream;
import java.io.PrintWriter;
import org.pdfbox.cos.COSDocument;
import org.pdfbox.pdfparser.PDFParser;
import org.pdfbox.pdmodel.PDDocument;
import org.pdfbox.pdmodel.PDDocumentInformation;
import org.pdfbox.util.PDFTextStripper;

public class PDFTextParser
{
  PDFParser parser;
  String parsedText;
  PDFTextStripper pdfStripper;
  PDDocument pdDoc;
  COSDocument cosDoc;
  PDDocumentInformation pdDocInfo;
  
  String pdftoText(String fileName)
  {
    System.out.println("Parsing text from PDF file " + fileName + "....");
    File f = new File(fileName);
    if (!f.isFile())
    {
      System.out.println("File " + fileName + " does not exist.");
      return null;
    }
    try
    {
      System.out.println("Jetzt wird der Parser definiert: new PDFParser ");
      this.parser = new PDFParser(new FileInputStream(f));
    }
    catch (Exception e)
    {
      System.out.println("Unable to open PDF Parser.");
      return null;
    }
    try
    {
      System.out.println("Jetzt wird mit dem  Parser gearbeitet:  ");
      this.parser.parse();
      this.cosDoc = this.parser.getDocument();
      this.pdfStripper = new PDFTextStripper();
      this.pdDoc = new PDDocument(this.cosDoc);
      this.parsedText = this.pdfStripper.getText(this.pdDoc);
    }
    catch (Exception e)
    {
      System.out.println("An exception occured in parsing the PDF Document.");
      e.printStackTrace();
      try
      {
        if (this.cosDoc != null) {
          this.cosDoc.close();
        }
        if (this.pdDoc != null) {
          this.pdDoc.close();
        }
      }
      catch (Exception e1)
      {
        e.printStackTrace();
      }
      return null;
    }
    System.out.println("Done.");
    return this.parsedText;
  }
  
  void writeTexttoFile(String pdfText, String fileName)
  {
    System.out.println("\nWriting PDF text to output text file " + fileName + "....");
    try
    {
      PrintWriter pw = new PrintWriter(fileName);
      pw.print(pdfText);
      pw.close();
    }
    catch (Exception e)
    {
      System.out.println("An exception occured in writing the pdf text to file.");
      e.printStackTrace();
    }
    System.out.println("Done.");
  }
  
  public static void main(String[] args)
  {
    if (args.length != 2)
    {
      System.out.println("Usage: java PDFTextParser <InputPDFFilename> <OutputTextFile>");
      System.exit(1);
    }
    System.out.println(" MAIN: Beginn, alle beiden Dateien sind übergeben ");
    System.out.println(" MAIN:  PDF-Datei (arg 0) : " + args[0]);
    System.out.println(" MAIN:  Text-Datei (arg 1) : " + args[1]);
    PDFTextParser pdfTextParserObj = new PDFTextParser();
    String pdfToText = pdfTextParserObj.pdftoText(args[0]);
    if (pdfToText == null)
    {
      System.out.println("PDF to Text Conversion failed.");
    }
    else
    {
      System.out.println("\nThe text parsed from the PDF Document....\n" + pdfToText);
      pdfTextParserObj.writeTexttoFile(pdfToText, args[1]);
    }
  }
}


Could you advice me please, how can I extract all information from pdf file or at least data from the middle of page, I don't really need text in colontitul and footer?

I can send my pdf and txt, if it is needed?

Many thanks in advanced!!!

Best regards,
Anna Yakubenko

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: problem with parsing pdf with PDFBox

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.

Hello Anna,

> Am 26.03.2015 um 09:11 schrieb Golovko Anna <an...@yandex.ru>:
> 
> Hello!
> 
> My name is Anna Yakubenko. I'm a Java-developer and now support application, which can parse pdf to txt with PDFBox and then store data to xml file as an output. Early every pdf files were parsed by PDFBox properly, but now I have got a pdf file, which is parsed in the way I couldn't expect. It seems, that customer add new layer with picture, colontitul and footer to pdf. And now PDFBox extarct information only from colontitul and footer from every page, and miss important information in the middle of the page. 
> 
> I use next source code to call PDFBox API:
> 
> import java.io.File;
> import java.io.FileInputStream;
> import java.io.PrintStream;
> import java.io.PrintWriter;
> import org.pdfbox.cos.COSDocument;
> import org.pdfbox.pdfparser.PDFParser;
> import org.pdfbox.pdmodel.PDDocument;
> import org.pdfbox.pdmodel.PDDocumentInformation;
> import org.pdfbox.util.PDFTextStripper;
> 
> public class PDFTextParser
> {
>  PDFParser parser;
>  String parsedText;
>  PDFTextStripper pdfStripper;
>  PDDocument pdDoc;
>  COSDocument cosDoc;
>  PDDocumentInformation pdDocInfo;
> 
>  String pdftoText(String fileName)
>  {
>    System.out.println("Parsing text from PDF file " + fileName + "....");
>    File f = new File(fileName);
>    if (!f.isFile())
>    {
>      System.out.println("File " + fileName + " does not exist.");
>      return null;
>    }
>    try
>    {
>      System.out.println("Jetzt wird der Parser definiert: new PDFParser ");
>      this.parser = new PDFParser(new FileInputStream(f));
>    }
>    catch (Exception e)
>    {
>      System.out.println("Unable to open PDF Parser.");
>      return null;
>    }
>    try
>    {
>      System.out.println("Jetzt wird mit dem  Parser gearbeitet:  ");
>      this.parser.parse();
>      this.cosDoc = this.parser.getDocument();
>      this.pdfStripper = new PDFTextStripper();
>      this.pdDoc = new PDDocument(this.cosDoc);
>      this.parsedText = this.pdfStripper.getText(this.pdDoc);
>    }
>    catch (Exception e)
>    {
>      System.out.println("An exception occured in parsing the PDF Document.");
>      e.printStackTrace();
>      try
>      {
>        if (this.cosDoc != null) {
>          this.cosDoc.close();
>        }
>        if (this.pdDoc != null) {
>          this.pdDoc.close();
>        }
>      }
>      catch (Exception e1)
>      {
>        e.printStackTrace();
>      }
>      return null;
>    }
>    System.out.println("Done.");
>    return this.parsedText;
>  }
> 
>  void writeTexttoFile(String pdfText, String fileName)
>  {
>    System.out.println("\nWriting PDF text to output text file " + fileName + "....");
>    try
>    {
>      PrintWriter pw = new PrintWriter(fileName);
>      pw.print(pdfText);
>      pw.close();
>    }
>    catch (Exception e)
>    {
>      System.out.println("An exception occured in writing the pdf text to file.");
>      e.printStackTrace();
>    }
>    System.out.println("Done.");
>  }
> 
>  public static void main(String[] args)
>  {
>    if (args.length != 2)
>    {
>      System.out.println("Usage: java PDFTextParser <InputPDFFilename> <OutputTextFile>");
>      System.exit(1);
>    }
>    System.out.println(" MAIN: Beginn, alle beiden Dateien sind übergeben ");
>    System.out.println(" MAIN:  PDF-Datei (arg 0) : " + args[0]);
>    System.out.println(" MAIN:  Text-Datei (arg 1) : " + args[1]);
>    PDFTextParser pdfTextParserObj = new PDFTextParser();
>    String pdfToText = pdfTextParserObj.pdftoText(args[0]);
>    if (pdfToText == null)
>    {
>      System.out.println("PDF to Text Conversion failed.");
>    }
>    else
>    {
>      System.out.println("\nThe text parsed from the PDF Document....\n" + pdfToText);
>      pdfTextParserObj.writeTexttoFile(pdfToText, args[1]);
>    }
>  }
> }
> 

you could simplify your code a lot doing something similar to (haven't tested it - there might be typos)  - as the typical way to parse a PDF document is by doing PDDocument.load which does the rest in the background for you and already returns the PDDocument you need for the PDFTextStripper

    void pdftoText(String pdfFile, String outputFile)
    {

        System.out.println("Parsing text from PDF file " + pdfFile + "....");
        File f = new File(pdfFile);
        if (!f.isFile())
        {
            System.out.println("File " + pdfFile + " does not exist.");
        }
        
        PDDocument pdDoc = null;
        Writer output = null;
        try
        {
            pdDoc = PDDocument.load(f);
            output = new OutputStreamWriter( new FileOutputStream( outputFile ));
            PDFTextStripper pdfStripper = new PDFTextStripper();
            pdfStripper.writeText(pdDoc, output);
        }
        catch (IOException e)
        {
            System.out.println("An exception occured in parsing the PDF Document.");
            e.printStackTrace();
        }
        finally
        {
            IOUtils.closeQuietly(pdDoc);
            IOUtils.closeQuietly(output);
        }

        System.out.println("Done.");
    }

In addition there is already a command line app ExtractText which does that for you. 



> 
> Could you advice me please, how can I extract all information from pdf file or at least data from the middle of page, I don't really need text in colontitul and footer?
> 
> I can send my pdf and txt, if it is needed?
> 

wrt to the PDF could you upload it to a public location so we can give it a try.

BR
Maruan


> Many thanks in advanced!!!
> 
> Best regards,
> Anna Yakubenko
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>