You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Lewis John McGibbney (JIRA)" <ji...@apache.org> on 2016/06/20 02:13:05 UTC

[jira] [Updated] (NUTCH-2020) Establish Butch - the Continuous Benchmarking Evaluation for Nutch

     [ https://issues.apache.org/jira/browse/NUTCH-2020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lewis John McGibbney updated NUTCH-2020:
----------------------------------------
    Summary: Establish Butch - the Continuous Benchmarking Evaluation for Nutch  (was: Estalbish Butch - the Continuous Benchmarking Evaluation for Nutch)

> Establish Butch - the Continuous Benchmarking Evaluation for Nutch
> ------------------------------------------------------------------
>
>                 Key: NUTCH-2020
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2020
>             Project: Nutch
>          Issue Type: New Feature
>          Components: deployment
>    Affects Versions: 2.4, 1.11
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>
> I would like to initiate something I've provisionally called BUTCH wit the aim of providing a continuous benchmarking evaluation for Nutch. 
> I wrote a utility script called [nipt](https://github.com/lewismc/nipt/blob/master/bootstrap.sh) which essentially pulls the top 1M URL's from Alexa, does some simple reformatting using sed and provides us with a flat file containing the top 1M URLs.
> Loads of these are obviously porn (and god knows whatever else) related so I would not advise injecting this garbage into any crawldb that you own or administer.
> I want to augment the [Benchmark tool](https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/tools/Benchmark.java) to imitate injecting the script and fetching the URLs. Essentially this could run continuously with us sending results to the dev@ list or making them available via some GUI.
> The first step is for me to code this up. The second stage is for me to get Apache Infra to provide us with some nice machines (courtesy of Rackspace) which can host this for us. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)