You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Matteo Simoncini <si...@gmail.com> on 2012/08/30 00:13:43 UTC
Crawl a whole domain with indicization
Hi,
I'm using Nutch version 1.5. My problem is to crawl every URL in a domain.
I also want to indicize everything using Solr but, instead of doing that in
the endo of the process, since is a very big domain, I would like to call
the indiciziong command of Solr every X URL (for example let's say every
10000 URL).
Since now all I was capable to do is this script:
#!/bin/bash
# inject the initial seed into crawlDB
bin/nutch inject test/crawldb urls
# initialization of the variables
counter=1
error=0
#while there is no error
while [ $error -ne 1 ]
do
# crawl 500 URL
echo [ Script ] Starting generating phase
bin/nutch generate test/crawldb test/segments -topN 10000
if [ $? -ne 0 ]
then
echo [ Script ] Stopping: No more URLs to fetch.
error=1
break
fi
segment=`ls -d test/segments/2* | tail -1`
#fetching phase
echo [ Script ] Starting fetching phase
bin/nutch fetch $segment -threads 20
if [ $? -ne 0 ]
then
echo [ Script ] Fetch $segment failed. Deleting it.
rm -rf $segment
continue
fi
#parsing phase
echo [ Script ] Starting parsing phase
bin/nutch parse $segment
#updateDB phase
echo [ Script ] Starting updateDB phase
bin/nutch updatedb test/crawldb $segment
#indicizing with solr
bin/nutch invertlinks test/linkdb -dir test/segments
bin/nutch solrindex http://127.0.0.1:8983/solr/ test/crawldb -linkdb
test/linkdb test/segments/*
done
but it seems to not work. In fact crawling using the command:
bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 20
and testing on the apache.org domain I get more URL than using the script
(command: 1676, script: 1658)
Can anyone tell me what's wrong with my script? Is there a better way to
solve my problem?
Thanks,
Matteo
RE: Crawl a whole domain with indicization
Posted by Markus Jelsma <ma...@openindex.io>.
There is nothing wrong with your script but it depends on your data how much URL's are generated. The difference in your script and the crawl command (both are almost identical) could also be explained by the state of your CrawlDb.
-----Original message-----
> From:Matteo Simoncini <si...@gmail.com>
> Sent: Thu 30-Aug-2012 00:16
> To: user@nutch.apache.org
> Subject: Crawl a whole domain with indicization
>
> Hi,
>
> I'm using Nutch version 1.5. My problem is to crawl every URL in a domain.
> I also want to indicize everything using Solr but, instead of doing that in
> the endo of the process, since is a very big domain, I would like to call
> the indiciziong command of Solr every X URL (for example let's say every
> 10000 URL).
>
> Since now all I was capable to do is this script:
>
> #!/bin/bash
> # inject the initial seed into crawlDB
> bin/nutch inject test/crawldb urls
>
> # initialization of the variables
> counter=1
> error=0
>
> #while there is no error
> while [ $error -ne 1 ]
> do
>
> # crawl 500 URL
>
> echo [ Script ] Starting generating phase
>
> bin/nutch generate test/crawldb test/segments -topN 10000
>
>
> if [ $? -ne 0 ]
>
> then
>
> echo [ Script ] Stopping: No more URLs to fetch.
>
> error=1
>
> break
>
> fi
>
> segment=`ls -d test/segments/2* | tail -1`
>
>
> #fetching phase
>
> echo [ Script ] Starting fetching phase
>
> bin/nutch fetch $segment -threads 20
>
> if [ $? -ne 0 ]
>
> then
>
> echo [ Script ] Fetch $segment failed. Deleting it.
>
> rm -rf $segment
>
> continue
>
> fi
>
> #parsing phase
>
> echo [ Script ] Starting parsing phase
>
> bin/nutch parse $segment
>
>
> #updateDB phase
>
> echo [ Script ] Starting updateDB phase
>
> bin/nutch updatedb test/crawldb $segment
>
>
> #indicizing with solr
>
> bin/nutch invertlinks test/linkdb -dir test/segments
>
> bin/nutch solrindex http://127.0.0.1:8983/solr/ test/crawldb -linkdb
> test/linkdb test/segments/*
>
> done
>
>
> but it seems to not work. In fact crawling using the command:
>
> bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 20
>
>
> and testing on the apache.org domain I get more URL than using the script
> (command: 1676, script: 1658)
> Can anyone tell me what's wrong with my script? Is there a better way to
> solve my problem?
>
> Thanks,
>
> Matteo
>