You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Miroslaw Milewski <mi...@redshrimp.com> on 2004/07/28 23:23:47 UTC

pdfbox performance.

  Hi,

  I have a serious performance problem while extracting text from pdf.

  Here is the code (w/o try/catch blocks):

  File file = new File("test.pdf");
  FileInputStream reader = new FileInputStream(file);

  PDFParser parser = new PDFParser(reader);
  parser.parse();
  PDDocument pdDoc = parser.getPDDocument();

  PDFTextStripper stripper = new PDFTextStripper();
  String pdftext = stripper.getText(pdDoc);

  pdDoc.close();

  Now, the whole process takes:
  - 37,4 sec w. a 74 kB file (parsing took 5,3 sec.)
  - 156,7 sec w. a 150 kB file (parsing: 11,0 sec.)
  - 157,8 sec w. a 270 kB file (parsing: 34,3 sec.)
  - 313,3 sec w. a 151 kB file (parsing: 5,9 sec.)

  Now, I can't really get the point here. Is this performance standard 
for pdfbox? Or is it my system (win2k, PIII 700, 512 RAM), or the code, 
or maybe the pdf docs (text only, the last one with some UML diags.)

  I am writing a knowledge base system at the moment, and planned to do 
real-time text extraction and indexing (using Lucene.) But this is not 
realistic, considering the extraction thime.
  Then maybe it is a better idea to run the extraction and indexing once 
every 24 h, processing all the documents added during that period.

  TIA for any comments/suggestions.

-- 
	Miroslaw Milewski


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: pdfbox performance.

Posted by Tatu Saloranta <ta...@hypermall.net>.

On Wednesday 28 July 2004 15:44, Paul Smith wrote:
> The first thing that I would do is wrap the FileInputStream with a
> BufferedInputStream.
....
> You get a significant boost reading in from a buffer, particularly as the
> size of the file grows.

Benchmarking is good; whether there's any significant performance difference 
depends on how the app reads data from the stream. Most high-performance apps 
read straight to a local buffer, in which case BufferedInputStream offers 
nothing more than the buffer overhead... :-)
You do get nice performance improvement only if most reads are done using 
single byte read methods (read()), though, so there's always a chance I 
guess.

-+ Tatu +-

> Try that first, and then rebenchmark.
> Cheers
> Paul Smith
>
> > -----Original Message-----
> > From: Miroslaw Milewski [mailto:miro@redshrimp.com]
> > Sent: Thursday, July 29, 2004 7:24 AM
> > To: lucene-user@jakarta.apache.org
> > Subject: pdfbox performance.
> >
> >
> >   Hi,
> >
> >   I have a serious performance problem while extracting text from pdf.
> >
> >   Here is the code (w/o try/catch blocks):
> >
> >   File file = new File("test.pdf");
> >   FileInputStream reader = new FileInputStream(file);
> >
> >   PDFParser parser = new PDFParser(reader);
> >   parser.parse();
> >   PDDocument pdDoc = parser.getPDDocument();
> >
> >   PDFTextStripper stripper = new PDFTextStripper();
> >   String pdftext = stripper.getText(pdDoc);
> >
> >   pdDoc.close();
> >
> >   Now, the whole process takes:
> >   - 37,4 sec w. a 74 kB file (parsing took 5,3 sec.)
> >   - 156,7 sec w. a 150 kB file (parsing: 11,0 sec.)
> >   - 157,8 sec w. a 270 kB file (parsing: 34,3 sec.)
> >   - 313,3 sec w. a 151 kB file (parsing: 5,9 sec.)
> >
> >   Now, I can't really get the point here. Is this performance standard
> > for pdfbox? Or is it my system (win2k, PIII 700, 512 RAM), or the code,
> > or maybe the pdf docs (text only, the last one with some UML diags.)
> >
> >   I am writing a knowledge base system at the moment, and planned to do
> > real-time text extraction and indexing (using Lucene.) But this is not
> > realistic, considering the extraction thime.
> >   Then maybe it is a better idea to run the extraction and indexing once
> > every 24 h, processing all the documents added during that period.
> >
> >   TIA for any comments/suggestions.
> >
> > --
> > 	Miroslaw Milewski
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: pdfbox performance.

Posted by Miroslaw Milewski <mi...@redshrimp.com>.

Ben Litchfield wrote:

  > Different PDFs will exhibit different extraction speeds because of 
the way
  > that PDF documents are structured.

  Yes, I am aware of that - this is the reason I picked pdfs containting 
only text, arranged in one column. Anwyay, there probably are lots of 
different factors to consider, so the whole benchmark thing was greatly 
simplified.
  All wanted to actually find out is whether the speed of extraction I 
encountered is 'standard' considering the system, the API version and my 
code. But then, considering the PDF structure and other factors, there 
may be no definitive answer.

  > I assume you are using the latest version 0.6.6, could you give 0.6.5 a
  > try and see if you notice faster speeds.

  Oh, yes, I forgot to specify the version. It is 0.6.6. I'll give the 
previous one a try.

  thx,
-- 
	Miroslaw Milewski

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: pdfbox performance.

Posted by Ben Litchfield <be...@csh.rit.edu>.


Different PDFs will exhibit different extraction speeds because of the way
that PDF documents are structured.

I assume you are using the latest version 0.6.6, could you give 0.6.5 a
try and see if you notice faster speeds.

Ben

On Thu, 29 Jul 2004, Miroslaw Milewski wrote:

> Paul Smith wrote:
>
>   > The first thing that I would do is wrap the FileInputStream with a
>   > BufferedInputStream.
>   > Change:
>   > > FileInputStream reader = new FileInputStream(file);
>   > To:
>   > InputStream reader = new BufferedInputStream(new
>   > FileInputStream(file));
>   > You get a significant boost reading in from a buffer, particularly as
>   > the size of the file grows. Try that first, and then rebenchmark.
>
>   I tested both, here is the code:
>
> File file = new File("test.pdf");
> InputStream reader = null;
>
> for(int i=1; i<=6; i++) {
>
>    long step01 = Calendar.getInstance().getTimeInMillis();
>    String stream = null;
>
>    if(i%2 == 0) {
>      reader = new BufferedInputStream(new FileInputStream(file));
>        stream = "buffered";
>    }
>    else {
>      reader = new FileInputStream(file);
>      stream = "no buffer";
>    }
>
>    PDFParser parser = null;
>    PDDocument pdDoc = null;
>
>    parser = new PDFParser(reader);
>    parser.parse();
>    pdDoc = parser.getPDDocument();
>
>    long step02 = Calendar.getInstance().getTimeInMillis();
>
>    PDFTextStripper stripper = new PDFTextStripper();
>    tring pdftext = stripper.getText(pdDoc);
>
>    long step03 = Calendar.getInstance().getTimeInMillis();
>
>    pdDoc.close();
>
>    long end = Calendar.getInstance().getTimeInMillis();
>
>    System.out.println("iteration: " + i + " - " + stream);
>    System.out.println("start: " + start);
>    System.out.println("step01: " + (step01-start));
>    System.out.println("step02: " + (step02-start));
>    System.out.println("step03: " + (step03-start));
>    System.out.println("end: " + (end-start));
> }
>
>   And below are the benchmarks for buffered and unbuffered readers. The
> difference is not stunning. It seems to get better with time, but this
> is prably due to some VM optimisation. And I'll extract the text only
> once :-).
>
> file: 9kB, text only;
>
> iteration: 1 - no buffer
> step01: 0; step02: 1492; step03: 13850; end: 13880
>
> iteration: 2 - buffered
> step01: 0; step02: 912; step03: 10245; end: 10265
>
> iteration: 3 - no buffer
> step01: 0; step02: 951 ;step03: 9924; end: 9944
>
> iteration: 4 - buffered
> step01: 0; step02: 842; step03: 10075; end: 10105
>
> iteration: 5 - no buffer
> step01: 0; step02: 831; step03: 9934; end: 9954
>
> iteration: 6 - buffered
> step01: 0; step02: 932; step03: 9944; end: 9965
>
>
> file: 74 kB; text only
>
> iteration: 1 - no buffer
> step01: 0; step02: 4918; step03: 33959; end: 33989
>
> iteration: 2 - buffered
> step01: 0; step02: 4367; step03: 32367; end: 32407
>
> iteration: 3 - no buffer
> step01: 0; step02: 4306; step03: 30995; end: 31025
>
> iteration: 4 - buffered
> step01: 0; step02: 4296; step03: 30734; end: 30764
>
> iteration: 5 - no buffer
> step01: 0; step02: 4266; step03: 30754; end: 30784
>
> iteration: 6 - buffered
> step01: 0; step02: 4256; step03: 30634; end: 30664
>
>
> file: 270 kB, text only
>
> iteration: 1 - no buffer
> step01: 0; step02: 30634; step03: 142225; end: 142265
>
> iteration: 2 - buffered
> step01: 0; step02: 29893; step03: 135354; end: 135394
>
> iteration: 3 - no buffer
> step01: 0; step02: 29553; step03: 134654; end: 134694
>
> iteration: 4 - buffered
> step01: 0; step02: 29613; step03: 134944; end: 134984
>
> iteration: 5 - no buffer
> step01: 0; step02: 29543; step03: 139070; end: 139110
>
> iteration: 6 - buffered
> step01: 0; step02: 32427; step03: 150457; end: 150487
>
>   Anyway, I suppose I made a wrong assumption while designing my app. I
> don't think I can get a performance boost of 90% or so. Thus the
> documents (at least the .pdfs) won't be extracted and indexed at the
> time of adding them to the knowledge base.
>   Since I also have a db involved, I can keep the basic data there, and
> extract and index in the meantime - most likely using a different thread.
>
>   thx,
> --
> 	Miroslaw Milewski
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: pdfbox performance.

Posted by Miroslaw Milewski <mi...@redshrimp.com>.

Paul Smith wrote:

  > The first thing that I would do is wrap the FileInputStream with a
  > BufferedInputStream.
  > Change:
  > > FileInputStream reader = new FileInputStream(file);
  > To:
  > InputStream reader = new BufferedInputStream(new
  > FileInputStream(file));
  > You get a significant boost reading in from a buffer, particularly as
  > the size of the file grows. Try that first, and then rebenchmark.

  I tested both, here is the code:

File file = new File("test.pdf");
InputStream reader = null;

for(int i=1; i<=6; i++) {

   long step01 = Calendar.getInstance().getTimeInMillis();
   String stream = null;
	
   if(i%2 == 0) {
     reader = new BufferedInputStream(new FileInputStream(file));
       stream = "buffered";
   }
   else {
     reader = new FileInputStream(file);
     stream = "no buffer";
   }

   PDFParser parser = null;
   PDDocument pdDoc = null;

   parser = new PDFParser(reader);
   parser.parse();
   pdDoc = parser.getPDDocument();

   long step02 = Calendar.getInstance().getTimeInMillis();

   PDFTextStripper stripper = new PDFTextStripper();
   tring pdftext = stripper.getText(pdDoc);

   long step03 = Calendar.getInstance().getTimeInMillis();

   pdDoc.close();

   long end = Calendar.getInstance().getTimeInMillis();

   System.out.println("iteration: " + i + " - " + stream);
   System.out.println("start: " + start);
   System.out.println("step01: " + (step01-start));
   System.out.println("step02: " + (step02-start));
   System.out.println("step03: " + (step03-start));
   System.out.println("end: " + (end-start));
}

  And below are the benchmarks for buffered and unbuffered readers. The 
difference is not stunning. It seems to get better with time, but this 
is prably due to some VM optimisation. And I'll extract the text only 
once :-).

file: 9kB, text only;

iteration: 1 - no buffer
step01: 0; step02: 1492; step03: 13850; end: 13880

iteration: 2 - buffered
step01: 0; step02: 912; step03: 10245; end: 10265

iteration: 3 - no buffer
step01: 0; step02: 951 ;step03: 9924; end: 9944

iteration: 4 - buffered
step01: 0; step02: 842; step03: 10075; end: 10105

iteration: 5 - no buffer
step01: 0; step02: 831; step03: 9934; end: 9954

iteration: 6 - buffered
step01: 0; step02: 932; step03: 9944; end: 9965


file: 74 kB; text only

iteration: 1 - no buffer
step01: 0; step02: 4918; step03: 33959; end: 33989

iteration: 2 - buffered
step01: 0; step02: 4367; step03: 32367; end: 32407

iteration: 3 - no buffer
step01: 0; step02: 4306; step03: 30995; end: 31025

iteration: 4 - buffered
step01: 0; step02: 4296; step03: 30734; end: 30764

iteration: 5 - no buffer
step01: 0; step02: 4266; step03: 30754; end: 30784

iteration: 6 - buffered
step01: 0; step02: 4256; step03: 30634; end: 30664


file: 270 kB, text only

iteration: 1 - no buffer
step01: 0; step02: 30634; step03: 142225; end: 142265

iteration: 2 - buffered
step01: 0; step02: 29893; step03: 135354; end: 135394

iteration: 3 - no buffer
step01: 0; step02: 29553; step03: 134654; end: 134694

iteration: 4 - buffered
step01: 0; step02: 29613; step03: 134944; end: 134984

iteration: 5 - no buffer
step01: 0; step02: 29543; step03: 139070; end: 139110

iteration: 6 - buffered
step01: 0; step02: 32427; step03: 150457; end: 150487

  Anyway, I suppose I made a wrong assumption while designing my app. I 
don't think I can get a performance boost of 90% or so. Thus the 
documents (at least the .pdfs) won't be extracted and indexed at the 
time of adding them to the knowledge base.
  Since I also have a db involved, I can keep the basic data there, and 
extract and index in the meantime - most likely using a different thread.

  thx,
-- 
	Miroslaw Milewski


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

RE: pdfbox performance.

Posted by Paul Smith <ps...@aconex.com>.

The first thing that I would do is wrap the FileInputStream with a
BufferedInputStream.
Change: 
>   FileInputStream reader = new FileInputStream(file);
To:
InputStream reader = new BufferedInputStream(new FileInputStream(file));
You get a significant boost reading in from a buffer, particularly as the
size of the file grows.
Try that first, and then rebenchmark.
Cheers
Paul Smith
> -----Original Message-----
> From: Miroslaw Milewski [mailto:miro@redshrimp.com]
> Sent: Thursday, July 29, 2004 7:24 AM
> To: lucene-user@jakarta.apache.org
> Subject: pdfbox performance.
> 
> 
>   Hi,
> 
>   I have a serious performance problem while extracting text from pdf.
> 
>   Here is the code (w/o try/catch blocks):
> 
>   File file = new File("test.pdf");
>   FileInputStream reader = new FileInputStream(file);
> 
>   PDFParser parser = new PDFParser(reader);
>   parser.parse();
>   PDDocument pdDoc = parser.getPDDocument();
> 
>   PDFTextStripper stripper = new PDFTextStripper();
>   String pdftext = stripper.getText(pdDoc);
> 
>   pdDoc.close();
> 
>   Now, the whole process takes:
>   - 37,4 sec w. a 74 kB file (parsing took 5,3 sec.)
>   - 156,7 sec w. a 150 kB file (parsing: 11,0 sec.)
>   - 157,8 sec w. a 270 kB file (parsing: 34,3 sec.)
>   - 313,3 sec w. a 151 kB file (parsing: 5,9 sec.)
> 
>   Now, I can't really get the point here. Is this performance standard
> for pdfbox? Or is it my system (win2k, PIII 700, 512 RAM), or the code,
> or maybe the pdf docs (text only, the last one with some UML diags.)
> 
>   I am writing a knowledge base system at the moment, and planned to do
> real-time text extraction and indexing (using Lucene.) But this is not
> realistic, considering the extraction thime.
>   Then maybe it is a better idea to run the extraction and indexing once
> every 24 h, processing all the documents added during that period.
> 
>   TIA for any comments/suggestions.
> 
> --
> 	Miroslaw Milewski
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org