You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2009/02/06 14:45:59 UTC

[jira] Closed: (NUTCH-357) crawling simulation

     [ https://issues.apache.org/jira/browse/NUTCH-357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  closed NUTCH-357.
-----------------------------------

    Resolution: Won't Fix
      Assignee: Andrzej Bialecki 

> crawling simulation
> -------------------
>
>                 Key: NUTCH-357
>                 URL: https://issues.apache.org/jira/browse/NUTCH-357
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.8.1, 0.9.0
>            Reporter: Stefan Groschupf
>            Assignee: Andrzej Bialecki 
>             Fix For: 1.0.0
>
>         Attachments: protocol-simulation-pluginV1.patch
>
>
> We recently discovered  some serious issue related to crawling and scoring. Reproducing these problems is a kind of difficult, since first of all it is not polite to re-crawl a set of pages again and again, secondly it is difficult to catch the page that cause a problem. 
> Therefore it would be very useful to have a testbed to simulate crawls where  we can control the response of  "web servers". 
> For the very beginning simulate very basic situation like a page points to it self,  link chains or internal links would already be very usefully. 
> However later on simulate crawls against existing data collections like TREC or a webgraph would be much more interesting, for instance to caculate the quality of the nutch OPIC implementation against page rank scores of the webgraph or evaluaing crawling strategies.    

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.