You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Jukka Zitting <ju...@gmail.com> on 2009/06/03 12:18:02 UTC

Major speed improvements in package parsing

Hi,

Inspired by TIKA-236, I ran the following ad-hoc test:

$ time java -jar tika-0.3-standalone.jar --text lucene-2.0.0-src.zip >
output-0.3.txt
real	0m29.844s
user	0m39.686s
sys	0m0.840s
$ time java -jar tika-app-0.4-SNAPSHOT.jar --text lucene-2.0.0-src.zip
> output-0.4.txt
real	0m12.587s
user	0m15.911s
sys	0m0.495s

This is especially impressive as the 0.4 version is able to extract
almost twice as much text from the archive:

$ du -h output-*
6.8M	output-0.3.txt
13M	output-0.4.txt

This speed increase is mostly the result of the TIKA-204 and TIKA-238
improvements.

Looking deeper at the output reveals some minor issues that I'll be
filing bugs for. However, in general the result of the extraction
seems pretty good.

BR,

Jukka Zitting

Re: Major speed improvements in package parsing

Posted by og...@yahoo.com.
Nice, thanks for the clarification! :)

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Jukka Zitting <ju...@gmail.com>
> To: tika-dev@lucene.apache.org
> Sent: Wednesday, June 3, 2009 8:05:36 AM
> Subject: Re: Major speed improvements in package parsing
> 
> Hi,
> 
> On Wed, Jun 3, 2009 at 1:33 PM,  wrote:
> > Nice, thanks for sharing!  You observed the same speed increase pattern
> > after running this several times to avoid any cold/hot cache side-effects?
> 
> Yes. This wasn't a carefully crafted benchmark, but I did run a number
> of similar test using both the 0.3 and 0.4 versions and the same input
> zip before taking the final measurements, so caching should not affect
> the relative performance figures.
> 
> For the record, I ran the tests using Sun Java 1.6.0_07 on a quad-core
> Dell Optiplex 755 desktop (Intel Core 2 Quad Q6600 @ 2.4GHz, 4GB RAM)
> with Fedora Core 9 (Linux kernel 2.6.27.19-78.2.30.fc9.i686).
> 
> BR,
> 
> Jukka Zitting


Re: Major speed improvements in package parsing

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Wed, Jun 3, 2009 at 1:33 PM,  <og...@yahoo.com> wrote:
> Nice, thanks for sharing!  You observed the same speed increase pattern
> after running this several times to avoid any cold/hot cache side-effects?

Yes. This wasn't a carefully crafted benchmark, but I did run a number
of similar test using both the 0.3 and 0.4 versions and the same input
zip before taking the final measurements, so caching should not affect
the relative performance figures.

For the record, I ran the tests using Sun Java 1.6.0_07 on a quad-core
Dell Optiplex 755 desktop (Intel Core 2 Quad Q6600 @ 2.4GHz, 4GB RAM)
with Fedora Core 9 (Linux kernel 2.6.27.19-78.2.30.fc9.i686).

BR,

Jukka Zitting

Re: Major speed improvements in package parsing

Posted by og...@yahoo.com.
Nice, thanks for sharing!  You observed the same speed increase pattern after running this several times to avoid any cold/hot cache side-effects?

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Jukka Zitting <ju...@gmail.com>
> To: tika-dev@lucene.apache.org
> Sent: Wednesday, June 3, 2009 6:18:02 AM
> Subject: Major speed improvements in package parsing
> 
> Hi,
> 
> Inspired by TIKA-236, I ran the following ad-hoc test:
> 
> $ time java -jar tika-0.3-standalone.jar --text lucene-2.0.0-src.zip >
> output-0.3.txt
> real    0m29.844s
> user    0m39.686s
> sys    0m0.840s
> $ time java -jar tika-app-0.4-SNAPSHOT.jar --text lucene-2.0.0-src.zip
> > output-0.4.txt
> real    0m12.587s
> user    0m15.911s
> sys    0m0.495s
> 
> This is especially impressive as the 0.4 version is able to extract
> almost twice as much text from the archive:
> 
> $ du -h output-*
> 6.8M    output-0.3.txt
> 13M    output-0.4.txt
> 
> This speed increase is mostly the result of the TIKA-204 and TIKA-238
> improvements.
> 
> Looking deeper at the output reveals some minor issues that I'll be
> filing bugs for. However, in general the result of the extraction
> seems pretty good.
> 
> BR,
> 
> Jukka Zitting