You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Miroslaw Milewski <mi...@redshrimp.com> on 2004/07/28 23:23:47 UTC
pdfbox performance.
Hi,
I have a serious performance problem while extracting text from pdf.
Here is the code (w/o try/catch blocks):
File file = new File("test.pdf");
FileInputStream reader = new FileInputStream(file);
PDFParser parser = new PDFParser(reader);
parser.parse();
PDDocument pdDoc = parser.getPDDocument();
PDFTextStripper stripper = new PDFTextStripper();
String pdftext = stripper.getText(pdDoc);
pdDoc.close();
Now, the whole process takes:
- 37,4 sec w. a 74 kB file (parsing took 5,3 sec.)
- 156,7 sec w. a 150 kB file (parsing: 11,0 sec.)
- 157,8 sec w. a 270 kB file (parsing: 34,3 sec.)
- 313,3 sec w. a 151 kB file (parsing: 5,9 sec.)
Now, I can't really get the point here. Is this performance standard
for pdfbox? Or is it my system (win2k, PIII 700, 512 RAM), or the code,
or maybe the pdf docs (text only, the last one with some UML diags.)
I am writing a knowledge base system at the moment, and planned to do
real-time text extraction and indexing (using Lucene.) But this is not
realistic, considering the extraction thime.
Then maybe it is a better idea to run the extraction and indexing once
every 24 h, processing all the documents added during that period.
TIA for any comments/suggestions.
--
Miroslaw Milewski
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: pdfbox performance.
Posted by Tatu Saloranta <ta...@hypermall.net>.
On Wednesday 28 July 2004 15:44, Paul Smith wrote:
> The first thing that I would do is wrap the FileInputStream with a
> BufferedInputStream.
....
> You get a significant boost reading in from a buffer, particularly as the
> size of the file grows.
Benchmarking is good; whether there's any significant performance difference
depends on how the app reads data from the stream. Most high-performance apps
read straight to a local buffer, in which case BufferedInputStream offers
nothing more than the buffer overhead... :-)
You do get nice performance improvement only if most reads are done using
single byte read methods (read()), though, so there's always a chance I
guess.
-+ Tatu +-
> Try that first, and then rebenchmark.
> Cheers
> Paul Smith
>
> > -----Original Message-----
> > From: Miroslaw Milewski [mailto:miro@redshrimp.com]
> > Sent: Thursday, July 29, 2004 7:24 AM
> > To: lucene-user@jakarta.apache.org
> > Subject: pdfbox performance.
> >
> >
> > Hi,
> >
> > I have a serious performance problem while extracting text from pdf.
> >
> > Here is the code (w/o try/catch blocks):
> >
> > File file = new File("test.pdf");
> > FileInputStream reader = new FileInputStream(file);
> >
> > PDFParser parser = new PDFParser(reader);
> > parser.parse();
> > PDDocument pdDoc = parser.getPDDocument();
> >
> > PDFTextStripper stripper = new PDFTextStripper();
> > String pdftext = stripper.getText(pdDoc);
> >
> > pdDoc.close();
> >
> > Now, the whole process takes:
> > - 37,4 sec w. a 74 kB file (parsing took 5,3 sec.)
> > - 156,7 sec w. a 150 kB file (parsing: 11,0 sec.)
> > - 157,8 sec w. a 270 kB file (parsing: 34,3 sec.)
> > - 313,3 sec w. a 151 kB file (parsing: 5,9 sec.)
> >
> > Now, I can't really get the point here. Is this performance standard
> > for pdfbox? Or is it my system (win2k, PIII 700, 512 RAM), or the code,
> > or maybe the pdf docs (text only, the last one with some UML diags.)
> >
> > I am writing a knowledge base system at the moment, and planned to do
> > real-time text extraction and indexing (using Lucene.) But this is not
> > realistic, considering the extraction thime.
> > Then maybe it is a better idea to run the extraction and indexing once
> > every 24 h, processing all the documents added during that period.
> >
> > TIA for any comments/suggestions.
> >
> > --
> > Miroslaw Milewski
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: pdfbox performance.
Posted by Miroslaw Milewski <mi...@redshrimp.com>.
Ben Litchfield wrote:
> Different PDFs will exhibit different extraction speeds because of
the way
> that PDF documents are structured.
Yes, I am aware of that - this is the reason I picked pdfs containting
only text, arranged in one column. Anwyay, there probably are lots of
different factors to consider, so the whole benchmark thing was greatly
simplified.
All wanted to actually find out is whether the speed of extraction I
encountered is 'standard' considering the system, the API version and my
code. But then, considering the PDF structure and other factors, there
may be no definitive answer.
> I assume you are using the latest version 0.6.6, could you give 0.6.5 a
> try and see if you notice faster speeds.
Oh, yes, I forgot to specify the version. It is 0.6.6. I'll give the
previous one a try.
thx,
--
Miroslaw Milewski
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: pdfbox performance.
Posted by Ben Litchfield <be...@csh.rit.edu>.
Different PDFs will exhibit different extraction speeds because of the way
that PDF documents are structured.
I assume you are using the latest version 0.6.6, could you give 0.6.5 a
try and see if you notice faster speeds.
Ben
On Thu, 29 Jul 2004, Miroslaw Milewski wrote:
> Paul Smith wrote:
>
> > The first thing that I would do is wrap the FileInputStream with a
> > BufferedInputStream.
> > Change:
> > > FileInputStream reader = new FileInputStream(file);
> > To:
> > InputStream reader = new BufferedInputStream(new
> > FileInputStream(file));
> > You get a significant boost reading in from a buffer, particularly as
> > the size of the file grows. Try that first, and then rebenchmark.
>
> I tested both, here is the code:
>
> File file = new File("test.pdf");
> InputStream reader = null;
>
> for(int i=1; i<=6; i++) {
>
> long step01 = Calendar.getInstance().getTimeInMillis();
> String stream = null;
>
> if(i%2 == 0) {
> reader = new BufferedInputStream(new FileInputStream(file));
> stream = "buffered";
> }
> else {
> reader = new FileInputStream(file);
> stream = "no buffer";
> }
>
> PDFParser parser = null;
> PDDocument pdDoc = null;
>
> parser = new PDFParser(reader);
> parser.parse();
> pdDoc = parser.getPDDocument();
>
> long step02 = Calendar.getInstance().getTimeInMillis();
>
> PDFTextStripper stripper = new PDFTextStripper();
> tring pdftext = stripper.getText(pdDoc);
>
> long step03 = Calendar.getInstance().getTimeInMillis();
>
> pdDoc.close();
>
> long end = Calendar.getInstance().getTimeInMillis();
>
> System.out.println("iteration: " + i + " - " + stream);
> System.out.println("start: " + start);
> System.out.println("step01: " + (step01-start));
> System.out.println("step02: " + (step02-start));
> System.out.println("step03: " + (step03-start));
> System.out.println("end: " + (end-start));
> }
>
> And below are the benchmarks for buffered and unbuffered readers. The
> difference is not stunning. It seems to get better with time, but this
> is prably due to some VM optimisation. And I'll extract the text only
> once :-).
>
> file: 9kB, text only;
>
> iteration: 1 - no buffer
> step01: 0; step02: 1492; step03: 13850; end: 13880
>
> iteration: 2 - buffered
> step01: 0; step02: 912; step03: 10245; end: 10265
>
> iteration: 3 - no buffer
> step01: 0; step02: 951 ;step03: 9924; end: 9944
>
> iteration: 4 - buffered
> step01: 0; step02: 842; step03: 10075; end: 10105
>
> iteration: 5 - no buffer
> step01: 0; step02: 831; step03: 9934; end: 9954
>
> iteration: 6 - buffered
> step01: 0; step02: 932; step03: 9944; end: 9965
>
>
> file: 74 kB; text only
>
> iteration: 1 - no buffer
> step01: 0; step02: 4918; step03: 33959; end: 33989
>
> iteration: 2 - buffered
> step01: 0; step02: 4367; step03: 32367; end: 32407
>
> iteration: 3 - no buffer
> step01: 0; step02: 4306; step03: 30995; end: 31025
>
> iteration: 4 - buffered
> step01: 0; step02: 4296; step03: 30734; end: 30764
>
> iteration: 5 - no buffer
> step01: 0; step02: 4266; step03: 30754; end: 30784
>
> iteration: 6 - buffered
> step01: 0; step02: 4256; step03: 30634; end: 30664
>
>
> file: 270 kB, text only
>
> iteration: 1 - no buffer
> step01: 0; step02: 30634; step03: 142225; end: 142265
>
> iteration: 2 - buffered
> step01: 0; step02: 29893; step03: 135354; end: 135394
>
> iteration: 3 - no buffer
> step01: 0; step02: 29553; step03: 134654; end: 134694
>
> iteration: 4 - buffered
> step01: 0; step02: 29613; step03: 134944; end: 134984
>
> iteration: 5 - no buffer
> step01: 0; step02: 29543; step03: 139070; end: 139110
>
> iteration: 6 - buffered
> step01: 0; step02: 32427; step03: 150457; end: 150487
>
> Anyway, I suppose I made a wrong assumption while designing my app. I
> don't think I can get a performance boost of 90% or so. Thus the
> documents (at least the .pdfs) won't be extracted and indexed at the
> time of adding them to the knowledge base.
> Since I also have a db involved, I can keep the basic data there, and
> extract and index in the meantime - most likely using a different thread.
>
> thx,
> --
> Miroslaw Milewski
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: pdfbox performance.
Posted by Miroslaw Milewski <mi...@redshrimp.com>.
Paul Smith wrote:
> The first thing that I would do is wrap the FileInputStream with a
> BufferedInputStream.
> Change:
> > FileInputStream reader = new FileInputStream(file);
> To:
> InputStream reader = new BufferedInputStream(new
> FileInputStream(file));
> You get a significant boost reading in from a buffer, particularly as
> the size of the file grows. Try that first, and then rebenchmark.
I tested both, here is the code:
File file = new File("test.pdf");
InputStream reader = null;
for(int i=1; i<=6; i++) {
long step01 = Calendar.getInstance().getTimeInMillis();
String stream = null;
if(i%2 == 0) {
reader = new BufferedInputStream(new FileInputStream(file));
stream = "buffered";
}
else {
reader = new FileInputStream(file);
stream = "no buffer";
}
PDFParser parser = null;
PDDocument pdDoc = null;
parser = new PDFParser(reader);
parser.parse();
pdDoc = parser.getPDDocument();
long step02 = Calendar.getInstance().getTimeInMillis();
PDFTextStripper stripper = new PDFTextStripper();
tring pdftext = stripper.getText(pdDoc);
long step03 = Calendar.getInstance().getTimeInMillis();
pdDoc.close();
long end = Calendar.getInstance().getTimeInMillis();
System.out.println("iteration: " + i + " - " + stream);
System.out.println("start: " + start);
System.out.println("step01: " + (step01-start));
System.out.println("step02: " + (step02-start));
System.out.println("step03: " + (step03-start));
System.out.println("end: " + (end-start));
}
And below are the benchmarks for buffered and unbuffered readers. The
difference is not stunning. It seems to get better with time, but this
is prably due to some VM optimisation. And I'll extract the text only
once :-).
file: 9kB, text only;
iteration: 1 - no buffer
step01: 0; step02: 1492; step03: 13850; end: 13880
iteration: 2 - buffered
step01: 0; step02: 912; step03: 10245; end: 10265
iteration: 3 - no buffer
step01: 0; step02: 951 ;step03: 9924; end: 9944
iteration: 4 - buffered
step01: 0; step02: 842; step03: 10075; end: 10105
iteration: 5 - no buffer
step01: 0; step02: 831; step03: 9934; end: 9954
iteration: 6 - buffered
step01: 0; step02: 932; step03: 9944; end: 9965
file: 74 kB; text only
iteration: 1 - no buffer
step01: 0; step02: 4918; step03: 33959; end: 33989
iteration: 2 - buffered
step01: 0; step02: 4367; step03: 32367; end: 32407
iteration: 3 - no buffer
step01: 0; step02: 4306; step03: 30995; end: 31025
iteration: 4 - buffered
step01: 0; step02: 4296; step03: 30734; end: 30764
iteration: 5 - no buffer
step01: 0; step02: 4266; step03: 30754; end: 30784
iteration: 6 - buffered
step01: 0; step02: 4256; step03: 30634; end: 30664
file: 270 kB, text only
iteration: 1 - no buffer
step01: 0; step02: 30634; step03: 142225; end: 142265
iteration: 2 - buffered
step01: 0; step02: 29893; step03: 135354; end: 135394
iteration: 3 - no buffer
step01: 0; step02: 29553; step03: 134654; end: 134694
iteration: 4 - buffered
step01: 0; step02: 29613; step03: 134944; end: 134984
iteration: 5 - no buffer
step01: 0; step02: 29543; step03: 139070; end: 139110
iteration: 6 - buffered
step01: 0; step02: 32427; step03: 150457; end: 150487
Anyway, I suppose I made a wrong assumption while designing my app. I
don't think I can get a performance boost of 90% or so. Thus the
documents (at least the .pdfs) won't be extracted and indexed at the
time of adding them to the knowledge base.
Since I also have a db involved, I can keep the basic data there, and
extract and index in the meantime - most likely using a different thread.
thx,
--
Miroslaw Milewski
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
RE: pdfbox performance.
Posted by Paul Smith <ps...@aconex.com>.
The first thing that I would do is wrap the FileInputStream with a
BufferedInputStream.
Change:
> FileInputStream reader = new FileInputStream(file);
To:
InputStream reader = new BufferedInputStream(new FileInputStream(file));
You get a significant boost reading in from a buffer, particularly as the
size of the file grows.
Try that first, and then rebenchmark.
Cheers
Paul Smith
> -----Original Message-----
> From: Miroslaw Milewski [mailto:miro@redshrimp.com]
> Sent: Thursday, July 29, 2004 7:24 AM
> To: lucene-user@jakarta.apache.org
> Subject: pdfbox performance.
>
>
> Hi,
>
> I have a serious performance problem while extracting text from pdf.
>
> Here is the code (w/o try/catch blocks):
>
> File file = new File("test.pdf");
> FileInputStream reader = new FileInputStream(file);
>
> PDFParser parser = new PDFParser(reader);
> parser.parse();
> PDDocument pdDoc = parser.getPDDocument();
>
> PDFTextStripper stripper = new PDFTextStripper();
> String pdftext = stripper.getText(pdDoc);
>
> pdDoc.close();
>
> Now, the whole process takes:
> - 37,4 sec w. a 74 kB file (parsing took 5,3 sec.)
> - 156,7 sec w. a 150 kB file (parsing: 11,0 sec.)
> - 157,8 sec w. a 270 kB file (parsing: 34,3 sec.)
> - 313,3 sec w. a 151 kB file (parsing: 5,9 sec.)
>
> Now, I can't really get the point here. Is this performance standard
> for pdfbox? Or is it my system (win2k, PIII 700, 512 RAM), or the code,
> or maybe the pdf docs (text only, the last one with some UML diags.)
>
> I am writing a knowledge base system at the moment, and planned to do
> real-time text extraction and indexing (using Lucene.) But this is not
> realistic, considering the extraction thime.
> Then maybe it is a better idea to run the extraction and indexing once
> every 24 h, processing all the documents added during that period.
>
> TIA for any comments/suggestions.
>
> --
> Miroslaw Milewski
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org