You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by aabbcc <we...@hotmail.it> on 2012/08/09 01:26:20 UTC

Nutch script to crawl a whole domain

Hi,

my problem is that i have a domain (es http://*.apache.org) and I want to
crawl every document and page in this website and indicize them with Solr.
I was able to do it using the basic command to crawl with nutch:

    bin/nutch crawl urls -solr http://localhost:8983/solr/

but the indicization part comes at the end of the process. So I have to wait
for the whole crawl to end befor I can access my data.
I would like to create a script that ciclically crawls a certain nuber of
pages (for example 10000) and than indicize them.
In the nutch tutorial wiki I found this:

    bin/nutch generate crawl/crawldb crawl/segments -topN 1000
    s2=`ls -d crawl/segments/2* | tail -1`
    echo $s2

    bin/nutch fetch $s2
    bin/nutch parse $s2
    bin/nutch updatedb crawl/crawldb $s2

but I don't know how to specify it to stops when he had crawled the enteire
domain.

Thanks for your help.






--
View this message in context: http://lucene.472066.n3.nabble.com/Nutch-script-to-crawl-a-whole-domain-tp3999975.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Nutch script to crawl a whole domain

Posted by Julien Nioche <li...@gmail.com>.

The version of Nutch in the trunk has a useful crawl script in the bin dir
which does all the typical steps of a crawl and sends the docs to SOLR for
indexing at the end of each fetching round. The script is also more robust
and can work in both local and deployed mode

HTH

Julien

On 9 August 2012 00:26, aabbcc <we...@hotmail.it> wrote:

> Hi,
>
> my problem is that i have a domain (es http://*.apache.org) and I want to
> crawl every document and page in this website and indicize them with Solr.
> I was able to do it using the basic command to crawl with nutch:
>
>     bin/nutch crawl urls -solr http://localhost:8983/solr/
>
> but the indicization part comes at the end of the process. So I have to
> wait
> for the whole crawl to end befor I can access my data.
> I would like to create a script that ciclically crawls a certain nuber of
> pages (for example 10000) and than indicize them.
> In the nutch tutorial wiki I found this:
>
>     bin/nutch generate crawl/crawldb crawl/segments -topN 1000
>     s2=`ls -d crawl/segments/2* | tail -1`
>     echo $s2
>
>     bin/nutch fetch $s2
>     bin/nutch parse $s2
>     bin/nutch updatedb crawl/crawldb $s2
>
> but I don't know how to specify it to stops when he had crawled the enteire
> domain.
>
> Thanks for your help.
>
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Nutch-script-to-crawl-a-whole-domain-tp3999975.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Nutch script to crawl a whole domain

Posted by Niccolò Becchi <ni...@gmail.com>.

Hi, I think the best start point could be:
http://wiki.apache.org/nutch/Nutch_0.9_Crawl_Script_Tutorial
You can modify the order of same steps.

On Thu, Aug 9, 2012 at 1:26 AM, aabbcc <we...@hotmail.it> wrote:

> Hi,
>
> my problem is that i have a domain (es http://*.apache.org) and I want to
> crawl every document and page in this website and indicize them with Solr.
> I was able to do it using the basic command to crawl with nutch:
>
>     bin/nutch crawl urls -solr http://localhost:8983/solr/
>
> but the indicization part comes at the end of the process. So I have to
> wait
> for the whole crawl to end befor I can access my data.
> I would like to create a script that ciclically crawls a certain nuber of
> pages (for example 10000) and than indicize them.
> In the nutch tutorial wiki I found this:
>
>     bin/nutch generate crawl/crawldb crawl/segments -topN 1000
>     s2=`ls -d crawl/segments/2* | tail -1`
>     echo $s2
>
>     bin/nutch fetch $s2
>     bin/nutch parse $s2
>     bin/nutch updatedb crawl/crawldb $s2
>
> but I don't know how to specify it to stops when he had crawled the enteire
> domain.
>
> Thanks for your help.
>
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Nutch-script-to-crawl-a-whole-domain-tp3999975.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>