You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2006/03/22 00:56:28 UTC

[Nutch Wiki] Update of "RegexURLFiltersBenchs" by JeromeCharron

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The following page has been changed by JeromeCharron:
http://wiki.apache.org/nutch/RegexURLFiltersBenchs

The comment on the change is:
Creation

New page:
== Introduction ==

This page provides some performance benchmarks of the regular expressions based URLFilters in Nutch (currently urlfilter-regex and urlfilter-automaton). 
The '''urlfilter-regex''' plugin is based on the standard jdk [http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/package-summary.html java.util.regex] implementation, whereas the '''urlfilter-automaton''' plugin is based on [http://www.brics.dk/automaton/ dk.brics.automaton]
Finite-State Automata for Java.

== Performance ==

=== Data set ===

These ''performance'' benchmarks were produced by collecting the results of the unit tests of each plugin using the same rule file (`Benchmarks.rules`) and the same set of urls to filter (`Benchmarks.urls`).

=== Raw results ===

The following matrix shows the '''urlfilter-regex''' and '''urlfilter-automaton''' plugins processing time in ''ms'' for many numbers of loops on the `Benchmarks.urls` file filtering.

|| ||'''50'''||'''100'''||'''200'''||'''400'''||'''800'''||
||'''regex'''||459||899||1917||3703||7873||
||'''automaton'''||335||419||657||1119||1997||

=== Graphical representation ===

[http://frutch.free.fr/images/nutch/regexfilters-benchs.png]

=== Conclusion ===

'''urlfilter-automaton''' supports less operators than '''urlfilter-regex''' but provides some really best performance. It can probably be usefull in some contexts.

A next step could be to mix the usage of these two plugins in order to take the best of each one by using the '''`urlfilter.order`''' configuration property.