You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Nitin Shukla <Ni...@mindtree.com> on 2009/11/19 12:24:25 UTC

Text extraction - Any tutorials?

Hello,

I am looking out to extract text, text location; font etc details from PDF file and looking out for pdf libraries that can help me do this. I came across the PDFBox today and wanted to evaluate it.

I am looking for any quick tutorial that can help me get started on how to use of PDFBox library to extract text from pdf and it's font information, text location etc. Can anyone point me to such tutorial that shows how to make use of PDFBox APIs to extract text etc?


I tried using running the command line utility that is bundled with PDFBox jar to extract text as follows.

$ java -cp log4j-1.2.15.jar;pdfbox-0.8.0-incubating.jar org/apache/pdfbox/ExtractText "D:\Test Lab\Murex Sample Reports\INVOICE00009.pdf" INVOICE00009.txt

But the above command execution threw the following error.

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/commons/logging/LogFactory

I don't see the org/apache/commons/logging/LogFactory in the pdfbox-0.8.0-incubating.jar nor in the log4j-1.2.15.jar. Can someone help point what am I doing wrong? Am I missing something??

Thanks n Regards,
Nitin


________________________________
http://www.mindtree.com/email/disclaimer.html

RE: Text extraction - Any tutorials?

Posted by Nitin Shukla <Ni...@mindtree.com>.

Thanks Patrick for the response.

I could get the Extract the text from pdf.

Apart from extracting the text from pdf, is it possible to extract the font information, location or position of the text layout in the pdf using pdfbox? Any pointers are appreciated.

Thanks.

Regards,
Nitin


-----Original Message-----
From: Patrick Herber [mailto:patrick.herber@gmail.com] 
Sent: Thursday, November 19, 2009 5:00 PM
To: users@pdfbox.apache.org
Subject: Re: Text extraction - Any tutorials?

Hello,

you should add in your classpath also the commons-logging-1.1.1.jar File.

To extract Text from a PDF FIle (given as inputstream) I'm using 
following method (perhaps is not the best one):

 
    private String parsePdfFile(InputStream stream) throws Exception {
        StringWriter output = new StringWriter(4096);
        PDDocument document = null;
        try {
            document = PDDocument.load(stream);
            if (document.isEncrypted()) {
                try {
                    document.decrypt("");
                } catch (Throwable e) {
                    log.warn("Could not parse PDF File since the 
document is encrypted");
                    return "";
                }
            }
            PDFTextStripper stripper = new PDFTextStripper();
            stripper.setStartPage(1);
            stripper.setEndPage(Integer.MAX_VALUE);
            stripper.writeText(document, output);
            return output.toString();
        } catch (EOFException eofe) {
            log.warn("EOF Exception parsing PDF Document");
            return "";
        } catch (Exception e) {
            log.info("Exception parsing PDF document", e);
            return "";
        } finally {
            if (document != null) {
                try {
                    document.close();
                } catch (Exception e) {
                    /* ignore */
                }
            }
        }
    }

Regards,
Patrick

Nitin Shukla wrote:
> Hello,
>
> I am looking out to extract text, text location; font etc details from PDF file and looking out for pdf libraries that can help me do this. I came across the PDFBox today and wanted to evaluate it.
>
> I am looking for any quick tutorial that can help me get started on how to use of PDFBox library to extract text from pdf and it's font information, text location etc. Can anyone point me to such tutorial that shows how to make use of PDFBox APIs to extract text etc?
>
>
> I tried using running the command line utility that is bundled with PDFBox jar to extract text as follows.
>
> $ java -cp log4j-1.2.15.jar;pdfbox-0.8.0-incubating.jar org/apache/pdfbox/ExtractText "D:\Test Lab\Murex Sample Reports\INVOICE00009.pdf" INVOICE00009.txt
>
> But the above command execution threw the following error.
>
> Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/commons/logging/LogFactory
>
> I don't see the org/apache/commons/logging/LogFactory in the pdfbox-0.8.0-incubating.jar nor in the log4j-1.2.15.jar. Can someone help point what am I doing wrong? Am I missing something??
>
> Thanks n Regards,
> Nitin
>
>
> ________________________________
> http://www.mindtree.com/email/disclaimer.html
>
>

Re: Text extraction - Any tutorials?

Posted by Stephen Haggai <st...@gmail.com>.

Thanks

On Fri, Nov 20, 2009 at 6:11 PM, Patrick Herber
<pa...@gmail.com> wrote:
> Hello
>
> See perhaps this answer to a similar question:
>
> http://www.mail-archive.com/pdfbox-dev@incubator.apache.org/msg01812.html
>
> Regards,
> Patrick
>
> Stephen Haggai wrote:
>>
>> Hello,
>>
>> I too tried to extract text from a PDF file but I keep getting these
>> errors though the text seems to be fully extracted (not verified
>> though).
>>
>> My code:
>>
>> import java.io.*;
>> import org.apache.pdfbox.pdmodel.*;
>> import org.apache.pdfbox.util.*;
>>
>> public class PDFTest {
>>
>>  public static void main(String[] args){
>>  PDDocument pd;
>>  BufferedWriter wr;
>>  try {
>>         File input = new File("C:\\invoice.pdf");
>>         File output = new File("C:\\SampleText.txt");
>>         pd = PDDocument.load(input);
>>         System.out.println(pd.getNumberOfPages());
>>         System.out.println(pd.isEncrypted());
>>         //pd.save("new.pdf");
>>         PDFTextStripper stripper = new PDFTextStripper();
>>         //String text = stripper.getText(pd);
>>         wr = new BufferedWriter(new OutputStreamWriter(new
>> FileOutputStream(output)));
>>         stripper.writeText(pd, wr);
>>         //System.out.println(text);
>>         if (pd != null) {
>>             pd.close();
>>         }
>>  } catch (Exception e){
>>         e.printStackTrace();
>>                }
>>        }
>> }
>>
>> The "error' or message that I get is
>>
>> --------------------Configuration: <Default>--------------------
>> 5
>> false
>> 20/11/2009 2:17:24 AM org.apache.pdfbox.util.PDFStreamEngine
>> processOperator
>> INFO: unsupported/disabled operation: g
>> 20/11/2009 2:17:24 AM org.apache.pdfbox.util.PDFStreamEngine
>> processOperator
>> INFO: unsupported/disabled operation: rg
>> 20/11/2009 2:17:24 AM org.apache.pdfbox.util.PDFStreamEngine
>> processOperator
>> INFO: unsupported/disabled operation: RG
>> 20/11/2009 2:17:24 AM org.apache.pdfbox.util.PDFStreamEngine
>> processOperator
>> INFO: unsupported/disabled operation: n
>> 20/11/2009 2:17:24 AM org.apache.pdfbox.util.PDFStreamEngine
>> processOperator
>> INFO: unsupported/disabled operation: re
>> 20/11/2009 2:17:24 AM org.apache.pdfbox.util.PDFStreamEngine
>> processOperator
>> INFO: unsupported/disabled operation: W
>> 20/11/2009 2:17:24 AM org.apache.pdfbox.util.PDFStreamEngine
>> processOperator
>> INFO: unsupported/disabled operation: BI
>> 20/11/2009 2:17:24 AM org.apache.pdfbox.util.PDFStreamEngine
>> processOperator
>> INFO: unsupported/disabled operation: EI
>> 20/11/2009 2:17:26 AM org.apache.pdfbox.util.PDFStreamEngine
>> processOperator
>> INFO: unsupported/disabled operation: m
>> 20/11/2009 2:17:26 AM org.apache.pdfbox.util.PDFStreamEngine
>> processOperator
>> INFO: unsupported/disabled operation: l
>> 20/11/2009 2:17:26 AM org.apache.pdfbox.util.PDFStreamEngine
>> processOperator
>> INFO: unsupported/disabled operation: h
>> 20/11/2009 2:17:26 AM org.apache.pdfbox.util.PDFStreamEngine
>> processOperator
>> INFO: unsupported/disabled operation: S
>>
>> Process completed.
>>
>> Is my code wrong somewhere?
>>
>> Thanks,
>> Stephen
>>
>>
>

Re: Text extraction - Any tutorials?

Posted by Patrick Herber <pa...@gmail.com>.

Hello

See perhaps this answer to a similar question:

http://www.mail-archive.com/pdfbox-dev@incubator.apache.org/msg01812.html

Regards,
Patrick

Stephen Haggai wrote:
> Hello,
>
> I too tried to extract text from a PDF file but I keep getting these
> errors though the text seems to be fully extracted (not verified
> though).
>
> My code:
>
> import java.io.*;
> import org.apache.pdfbox.pdmodel.*;
> import org.apache.pdfbox.util.*;
>
> public class PDFTest {
>
>  public static void main(String[] args){
>  PDDocument pd;
>  BufferedWriter wr;
>  try {
>          File input = new File("C:\\invoice.pdf");
>          File output = new File("C:\\SampleText.txt");
>          pd = PDDocument.load(input);
>          System.out.println(pd.getNumberOfPages());
>          System.out.println(pd.isEncrypted());
>          //pd.save("new.pdf");
>          PDFTextStripper stripper = new PDFTextStripper();
>          //String text = stripper.getText(pd);
>          wr = new BufferedWriter(new OutputStreamWriter(new
> FileOutputStream(output)));
>          stripper.writeText(pd, wr);
>          //System.out.println(text);
>          if (pd != null) {
>              pd.close();
>          }
>  } catch (Exception e){
>          e.printStackTrace();
> 		}
>  	}
> }
>
> The "error' or message that I get is
>
> --------------------Configuration: <Default>--------------------
> 5
> false
> 20/11/2009 2:17:24 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
> INFO: unsupported/disabled operation: g
> 20/11/2009 2:17:24 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
> INFO: unsupported/disabled operation: rg
> 20/11/2009 2:17:24 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
> INFO: unsupported/disabled operation: RG
> 20/11/2009 2:17:24 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
> INFO: unsupported/disabled operation: n
> 20/11/2009 2:17:24 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
> INFO: unsupported/disabled operation: re
> 20/11/2009 2:17:24 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
> INFO: unsupported/disabled operation: W
> 20/11/2009 2:17:24 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
> INFO: unsupported/disabled operation: BI
> 20/11/2009 2:17:24 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
> INFO: unsupported/disabled operation: EI
> 20/11/2009 2:17:26 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
> INFO: unsupported/disabled operation: m
> 20/11/2009 2:17:26 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
> INFO: unsupported/disabled operation: l
> 20/11/2009 2:17:26 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
> INFO: unsupported/disabled operation: h
> 20/11/2009 2:17:26 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
> INFO: unsupported/disabled operation: S
>
> Process completed.
>
> Is my code wrong somewhere?
>
> Thanks,
> Stephen
>
>

Re: Text extraction - Any tutorials?

Posted by Stephen Haggai <st...@gmail.com>.

Hello,

I too tried to extract text from a PDF file but I keep getting these
errors though the text seems to be fully extracted (not verified
though).

My code:

import java.io.*;
import org.apache.pdfbox.pdmodel.*;
import org.apache.pdfbox.util.*;

public class PDFTest {

 public static void main(String[] args){
 PDDocument pd;
 BufferedWriter wr;
 try {
         File input = new File("C:\\invoice.pdf");
         File output = new File("C:\\SampleText.txt");
         pd = PDDocument.load(input);
         System.out.println(pd.getNumberOfPages());
         System.out.println(pd.isEncrypted());
         //pd.save("new.pdf");
         PDFTextStripper stripper = new PDFTextStripper();
         //String text = stripper.getText(pd);
         wr = new BufferedWriter(new OutputStreamWriter(new
FileOutputStream(output)));
         stripper.writeText(pd, wr);
         //System.out.println(text);
         if (pd != null) {
             pd.close();
         }
 } catch (Exception e){
         e.printStackTrace();
		}
 	}
}

The "error' or message that I get is

--------------------Configuration: <Default>--------------------
5
false
20/11/2009 2:17:24 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: g
20/11/2009 2:17:24 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: rg
20/11/2009 2:17:24 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: RG
20/11/2009 2:17:24 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: n
20/11/2009 2:17:24 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: re
20/11/2009 2:17:24 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: W
20/11/2009 2:17:24 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: BI
20/11/2009 2:17:24 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: EI
20/11/2009 2:17:26 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: m
20/11/2009 2:17:26 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: l
20/11/2009 2:17:26 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: h
20/11/2009 2:17:26 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: S

Process completed.

Is my code wrong somewhere?

Thanks,
Stephen

Re: Text extraction - Any tutorials?

Posted by Patrick Herber <pa...@gmail.com>.

Hello,

you should add in your classpath also the commons-logging-1.1.1.jar File.

To extract Text from a PDF FIle (given as inputstream) I'm using 
following method (perhaps is not the best one):

 
    private String parsePdfFile(InputStream stream) throws Exception {
        StringWriter output = new StringWriter(4096);
        PDDocument document = null;
        try {
            document = PDDocument.load(stream);
            if (document.isEncrypted()) {
                try {
                    document.decrypt("");
                } catch (Throwable e) {
                    log.warn("Could not parse PDF File since the 
document is encrypted");
                    return "";
                }
            }
            PDFTextStripper stripper = new PDFTextStripper();
            stripper.setStartPage(1);
            stripper.setEndPage(Integer.MAX_VALUE);
            stripper.writeText(document, output);
            return output.toString();
        } catch (EOFException eofe) {
            log.warn("EOF Exception parsing PDF Document");
            return "";
        } catch (Exception e) {
            log.info("Exception parsing PDF document", e);
            return "";
        } finally {
            if (document != null) {
                try {
                    document.close();
                } catch (Exception e) {
                    /* ignore */
                }
            }
        }
    }

Regards,
Patrick

Nitin Shukla wrote:
> Hello,
>
> I am looking out to extract text, text location; font etc details from PDF file and looking out for pdf libraries that can help me do this. I came across the PDFBox today and wanted to evaluate it.
>
> I am looking for any quick tutorial that can help me get started on how to use of PDFBox library to extract text from pdf and it's font information, text location etc. Can anyone point me to such tutorial that shows how to make use of PDFBox APIs to extract text etc?
>
>
> I tried using running the command line utility that is bundled with PDFBox jar to extract text as follows.
>
> $ java -cp log4j-1.2.15.jar;pdfbox-0.8.0-incubating.jar org/apache/pdfbox/ExtractText "D:\Test Lab\Murex Sample Reports\INVOICE00009.pdf" INVOICE00009.txt
>
> But the above command execution threw the following error.
>
> Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/commons/logging/LogFactory
>
> I don't see the org/apache/commons/logging/LogFactory in the pdfbox-0.8.0-incubating.jar nor in the log4j-1.2.15.jar. Can someone help point what am I doing wrong? Am I missing something??
>
> Thanks n Regards,
> Nitin
>
>
> ________________________________
> http://www.mindtree.com/email/disclaimer.html
>
>

Re: Text extraction - Any tutorials?

Posted by "Hesham G." <he...@gmail.com>.

There is a bunch of examples attached with PDFBox.
Please check this : 
http://www.java2s.com/Open-Source/Java-Document/PDF/PDFBox-0.7.3/org.pdfbox.examples.pdmodel.htm

Best regards ,
Hesham