You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Stefan Groschupf (JIRA)" <ji...@apache.org> on 2006/08/22 07:54:14 UTC

[jira] Updated: (NUTCH-357) crawling simulation

     [ http://issues.apache.org/jira/browse/NUTCH-357?page=all ]

Stefan Groschupf updated NUTCH-357:
-----------------------------------

    Attachment: protocol-simulation-pluginV1.patch

A very first preview of a plugin that helps to simulate crawls. This protocol plugin can be used to replace the http protocol plugin and return defined content during a fetch. To simulate custom scenarios a interface names Simulator can be implemented with just one method. 
The plugin comes with a very simple basic Simulator implementation, however this already allows to simulate the by today known nutch scoring problems, like pages pointing to itself or link chains. 
For more details see the java doc, however I plan to improve the java doc with a native speaker. 

Feedback is welcome. 

> crawling simulation
> -------------------
>
>                 Key: NUTCH-357
>                 URL: http://issues.apache.org/jira/browse/NUTCH-357
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.8.1, 0.9.0
>            Reporter: Stefan Groschupf
>             Fix For: 0.9.0
>
>         Attachments: protocol-simulation-pluginV1.patch
>
>
> We recently discovered  some serious issue related to crawling and scoring. Reproducing these problems is a kind of difficult, since first of all it is not polite to re-crawl a set of pages again and again, secondly it is difficult to catch the page that cause a problem. 
> Therefore it would be very useful to have a testbed to simulate crawls where  we can control the response of  "web servers". 
> For the very beginning simulate very basic situation like a page points to it self,  link chains or internal links would already be very usefully. 
> However later on simulate crawls against existing data collections like TREC or a webgraph would be much more interesting, for instance to caculate the quality of the nutch OPIC implementation against page rank scores of the webgraph or evaluaing crawling strategies.    

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Re: Junit testing, was: Re: [jira] Updated: (NUTCH-357) crawling simulation

Posted by Stefan Groschupf <sg...@101tec.com>.
> One must also remember that proper junit testing can be used to  
> verify functionality.
> There's lot of code currently that is not guarded by unit tests and  
> I hereby invite everybody to participate in this endless effort and  
> make Nutch unit tests better ;)
I completely agree!!!
Nutch has more bugs than ever before since most of the .8 code was  
developed without tests.

Stefan

Junit testing, was: Re: [jira] Updated: (NUTCH-357) crawling simulation

Posted by Sami Siren <ss...@gmail.com>.
Stefan Groschupf (JIRA) wrote:
> 
> A very first preview of a plugin that helps to simulate crawls. This
> protocol plugin can be used to replace the http protocol plugin and
> return defined content during a fetch. To simulate custom scenarios a
> interface names Simulator can be implemented with just one method. 
> The plugin comes with a very simple basic Simulator implementation,
> however this already allows to simulate the by today known nutch
> scoring problems, like pages pointing to itself or link chains. For
> more details see the java doc, however I plan to improve the java doc
> with a native speaker.
> 
> Feedback is welcome.

One must also remember that proper junit testing can be used to verify 
functionality.

There's lot of code currently that is not guarded by unit tests and I 
hereby invite everybody to participate in this endless effort and make 
Nutch unit tests better ;)

--
  Sami Siren