You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Stefan Groschupf (JIRA)" <ji...@apache.org> on 2006/08/22 07:54:14 UTC
[jira] Updated: (NUTCH-357) crawling simulation
[ http://issues.apache.org/jira/browse/NUTCH-357?page=all ]
Stefan Groschupf updated NUTCH-357:
-----------------------------------
Attachment: protocol-simulation-pluginV1.patch
A very first preview of a plugin that helps to simulate crawls. This protocol plugin can be used to replace the http protocol plugin and return defined content during a fetch. To simulate custom scenarios a interface names Simulator can be implemented with just one method.
The plugin comes with a very simple basic Simulator implementation, however this already allows to simulate the by today known nutch scoring problems, like pages pointing to itself or link chains.
For more details see the java doc, however I plan to improve the java doc with a native speaker.
Feedback is welcome.
> crawling simulation
> -------------------
>
> Key: NUTCH-357
> URL: http://issues.apache.org/jira/browse/NUTCH-357
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 0.8.1, 0.9.0
> Reporter: Stefan Groschupf
> Fix For: 0.9.0
>
> Attachments: protocol-simulation-pluginV1.patch
>
>
> We recently discovered some serious issue related to crawling and scoring. Reproducing these problems is a kind of difficult, since first of all it is not polite to re-crawl a set of pages again and again, secondly it is difficult to catch the page that cause a problem.
> Therefore it would be very useful to have a testbed to simulate crawls where we can control the response of "web servers".
> For the very beginning simulate very basic situation like a page points to it self, link chains or internal links would already be very usefully.
> However later on simulate crawls against existing data collections like TREC or a webgraph would be much more interesting, for instance to caculate the quality of the nutch OPIC implementation against page rank scores of the webgraph or evaluaing crawling strategies.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Junit testing, was: Re: [jira] Updated: (NUTCH-357) crawling simulation
Posted by Stefan Groschupf <sg...@101tec.com>.
> One must also remember that proper junit testing can be used to
> verify functionality.
> There's lot of code currently that is not guarded by unit tests and
> I hereby invite everybody to participate in this endless effort and
> make Nutch unit tests better ;)
I completely agree!!!
Nutch has more bugs than ever before since most of the .8 code was
developed without tests.
Stefan
Junit testing, was: Re: [jira] Updated: (NUTCH-357) crawling simulation
Posted by Sami Siren <ss...@gmail.com>.
Stefan Groschupf (JIRA) wrote:
>
> A very first preview of a plugin that helps to simulate crawls. This
> protocol plugin can be used to replace the http protocol plugin and
> return defined content during a fetch. To simulate custom scenarios a
> interface names Simulator can be implemented with just one method.
> The plugin comes with a very simple basic Simulator implementation,
> however this already allows to simulate the by today known nutch
> scoring problems, like pages pointing to itself or link chains. For
> more details see the java doc, however I plan to improve the java doc
> with a native speaker.
>
> Feedback is welcome.
One must also remember that proper junit testing can be used to verify
functionality.
There's lot of code currently that is not guarded by unit tests and I
hereby invite everybody to participate in this endless effort and make
Nutch unit tests better ;)
--
Sami Siren