You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Matteo Simoncini <si...@gmail.com> on 2012/08/30 00:13:43 UTC

Crawl a whole domain with indicization

Hi,

I'm using Nutch version 1.5. My problem is to crawl every URL in a domain.
I also want to indicize everything using Solr but, instead of doing that in
the endo of the process, since is a very big domain, I would like to call
the indiciziong command of Solr every X URL (for example let's say every
10000 URL).

Since now all I was capable to do is this script:

#!/bin/bash
# inject the initial seed into crawlDB
bin/nutch inject test/crawldb urls

# initialization of the variables
counter=1
error=0

#while there is no error
while [ $error -ne 1 ]
do

# crawl 500 URL

echo [ Script ] Starting generating phase

bin/nutch generate test/crawldb test/segments -topN 10000


if [ $? -ne 0 ]

then

echo [ Script ] Stopping: No more URLs to fetch.

error=1

break

fi

segment=`ls -d test/segments/2* | tail -1`


#fetching phase

echo [ Script ] Starting fetching phase

bin/nutch fetch $segment -threads 20

if [ $? -ne 0 ]

then

echo [ Script ] Fetch $segment failed. Deleting it.

rm -rf $segment

continue

fi

#parsing phase

echo [ Script ] Starting parsing phase

bin/nutch parse $segment


#updateDB phase

echo [ Script ] Starting updateDB phase

bin/nutch updatedb test/crawldb $segment


#indicizing with solr

bin/nutch invertlinks test/linkdb -dir test/segments

bin/nutch solrindex http://127.0.0.1:8983/solr/ test/crawldb -linkdb
test/linkdb test/segments/*

done


but it seems to not work. In fact crawling using the command:

bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 20


and testing on the apache.org domain I get more URL than using the script
(command: 1676, script: 1658)
Can anyone tell me what's wrong with my script? Is there a better way to
solve my problem?

Thanks,

Matteo

RE: Crawl a whole domain with indicization

Posted by Markus Jelsma <ma...@openindex.io>.
There is nothing wrong with your script but it depends on your data how much URL's are generated. The difference in your script and the crawl command (both are almost identical) could also be explained by the state of your CrawlDb.


 
 
-----Original message-----
> From:Matteo Simoncini <si...@gmail.com>
> Sent: Thu 30-Aug-2012 00:16
> To: user@nutch.apache.org
> Subject: Crawl a whole domain with indicization
> 
> Hi,
> 
> I'm using Nutch version 1.5. My problem is to crawl every URL in a domain.
> I also want to indicize everything using Solr but, instead of doing that in
> the endo of the process, since is a very big domain, I would like to call
> the indiciziong command of Solr every X URL (for example let's say every
> 10000 URL).
> 
> Since now all I was capable to do is this script:
> 
> #!/bin/bash
> # inject the initial seed into crawlDB
> bin/nutch inject test/crawldb urls
> 
> # initialization of the variables
> counter=1
> error=0
> 
> #while there is no error
> while [ $error -ne 1 ]
> do
> 
> # crawl 500 URL
> 
> echo [ Script ] Starting generating phase
> 
> bin/nutch generate test/crawldb test/segments -topN 10000
> 
> 
> if [ $? -ne 0 ]
> 
> then
> 
> echo [ Script ] Stopping: No more URLs to fetch.
> 
> error=1
> 
> break
> 
> fi
> 
> segment=`ls -d test/segments/2* | tail -1`
> 
> 
> #fetching phase
> 
> echo [ Script ] Starting fetching phase
> 
> bin/nutch fetch $segment -threads 20
> 
> if [ $? -ne 0 ]
> 
> then
> 
> echo [ Script ] Fetch $segment failed. Deleting it.
> 
> rm -rf $segment
> 
> continue
> 
> fi
> 
> #parsing phase
> 
> echo [ Script ] Starting parsing phase
> 
> bin/nutch parse $segment
> 
> 
> #updateDB phase
> 
> echo [ Script ] Starting updateDB phase
> 
> bin/nutch updatedb test/crawldb $segment
> 
> 
> #indicizing with solr
> 
> bin/nutch invertlinks test/linkdb -dir test/segments
> 
> bin/nutch solrindex http://127.0.0.1:8983/solr/ test/crawldb -linkdb
> test/linkdb test/segments/*
> 
> done
> 
> 
> but it seems to not work. In fact crawling using the command:
> 
> bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 20
> 
> 
> and testing on the apache.org domain I get more URL than using the script
> (command: 1676, script: 1658)
> Can anyone tell me what's wrong with my script? Is there a better way to
> solve my problem?
> 
> Thanks,
> 
> Matteo
>