You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by mina <ta...@gmail.com> on 2011/12/03 08:32:26 UTC
how give several sites to nutch to crawl?
hi, i want to give nutch several sites and nutch crawl them. for example i
want nutch crawl:
http://www.site1.com
http://www.site2.com
http://www.site3.com
how can i do that?help me.
--
View this message in context: http://lucene.472066.n3.nabble.com/how-give-several-sites-to-nutch-to-crawl-tp3556697p3556697.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: how give several sites to nutch to crawl?
Posted by mina <ta...@gmail.com>.
i add this property in nutch-site.xml but my problem isn't resolved, how
property i should use? help me. its important for me.
--
View this message in context: http://lucene.472066.n3.nabble.com/how-give-several-sites-to-nutch-to-crawl-tp3556697p3559106.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: how give several sites to nutch to crawl?
Posted by al...@aim.com.
I think you should add this to nutch-site.xml
<property>
<name>generate.max.count</name>
<value>1000</value>
<description>The maximum number of urls in a single
fetchlist. -1 if unlimited. The urls are counted according
to the value of the parameter generator.count.mode.
</description>
</property>
and set topN to -1
Alex.
-----Original Message-----
From: mina <ta...@gmail.com>
To: nutch-user <nu...@lucene.apache.org>
Sent: Sat, Dec 3, 2011 6:10 pm
Subject: Re: how give several sites to nutch to crawl?
thanks for your answer. i use this script to crawl my sites:
$NUTCH_HOME/bin/nutch inject $NUTCH_HOME/bin/crawl1/crawldb
$NUTCH_HOME/bin/seedUrls
for((i=0; i < $depth; i++))
do
echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
$NUTCH_HOME/bin/nutch generate $NUTCH_HOME/bin/crawl1/crawldb
$NUTCH_HOME/bin/crawl1/segments $topN
if [ $? -ne 0 ]
then
echo "deepcrawler: Stopping at depth $depth. No more URLs to fetch."
break
fi
segment1=`ls -d $NUTCH_HOME/bin/crawl1/segments/* | tail -1`
$NUTCH_HOME/bin/nutch fetch $segment1
if [ $? -ne 0 ]
then
echo "deepcrawler: fetch $segment1 at depth `expr $i + 1` failed."
echo "deepcrawler: Deleting segment $segment1."
rm $RMARGS $segment1
continue
fi
$NUTCH_HOME/bin/nutch parse $segment1
$NUTCH_HOME/bin/nutch updatedb $NUTCH_HOME/bin/crawl1/crawldb $segment1
done
echo "----- Merge Segments (Step 5 of $steps) -----"
$NUTCH_HOME/bin/nutch mergesegs $NUTCH_HOME/bin/crawl1/MERGEDsegments
$NUTCH_HOME/bin/crawl1/segments/*
if [ "$safe" != "yes" ]
then
rm $RMARGS $NUTCH_HOME/bin/crawl1/segments
else
rm $RMARGS $NUTCH_HOME/bin/crawl1/BACKUPsegments
mv $MVARGS $NUTCH_HOME/bin/crawl1/segments
$NUTCH_HOME/bin/crawl1/BACKUPsegments
fi
mv $MVARGS $NUTCH_HOME/bin/crawl1/MERGEDsegments
$NUTCH_HOME/bin/crawl1/segments
echo "----- Invert Links (Step 6 of $steps) -----"
$NUTCH_HOME/bin/nutch invertlinks $NUTCH_HOME/bin/crawl1/linkdb
$NUTCH_HOME/bin/crawl1/segments/*
if [ "$safe" != "yes" ]
then
rm $RMARGS $NUTCH_HOME/bin/crawl1/NEWindexes
rm $RMARGS $NUTCH_HOME/bin/crawl1/index
else
rm $RMARGS $NUTCH_HOME/bin/crawl1/BACKUPindexes
rm $RMARGS $NUTCH_HOME/bin/crawl1/BACKUPindex
mv $MVARGS $NUTCH_HOME/bin/crawl1/NEWindexes
$NUTCH_HOME/bin/crawl1/BACKUPindexes
mv $MVARGS $NUTCH_HOME/bin/crawl1/index $NUTCH_HOME/bin/crawl1/BACKUPindex
fi
$NUTCH_HOME/bin/nutch solrindex http://$HOST:8983/solr/
$NUTCH_HOME/bin/crawl1/crawldb $NUTCH_HOME/bin/crawl1/linkdb
$NUTCH_HOME/bin/crawl1/segments/*
but nutch don't crawl all page in any site, for example when topN=1000,
nutch crawl 700 page from site1 and 250 from site2 and 40 from site3 and 10
page from site4. i want nutch crawl 1000 page from any site.help me.
--
View this message in context: http://lucene.472066.n3.nabble.com/how-give-several-sites-to-nutch-to-crawl-tp3556697p3558152.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: how give several sites to nutch to crawl?
Posted by mina <ta...@gmail.com>.
thanks for your answer. i use this script to crawl my sites:
$NUTCH_HOME/bin/nutch inject $NUTCH_HOME/bin/crawl1/crawldb
$NUTCH_HOME/bin/seedUrls
for((i=0; i < $depth; i++))
do
echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
$NUTCH_HOME/bin/nutch generate $NUTCH_HOME/bin/crawl1/crawldb
$NUTCH_HOME/bin/crawl1/segments $topN
if [ $? -ne 0 ]
then
echo "deepcrawler: Stopping at depth $depth. No more URLs to fetch."
break
fi
segment1=`ls -d $NUTCH_HOME/bin/crawl1/segments/* | tail -1`
$NUTCH_HOME/bin/nutch fetch $segment1
if [ $? -ne 0 ]
then
echo "deepcrawler: fetch $segment1 at depth `expr $i + 1` failed."
echo "deepcrawler: Deleting segment $segment1."
rm $RMARGS $segment1
continue
fi
$NUTCH_HOME/bin/nutch parse $segment1
$NUTCH_HOME/bin/nutch updatedb $NUTCH_HOME/bin/crawl1/crawldb $segment1
done
echo "----- Merge Segments (Step 5 of $steps) -----"
$NUTCH_HOME/bin/nutch mergesegs $NUTCH_HOME/bin/crawl1/MERGEDsegments
$NUTCH_HOME/bin/crawl1/segments/*
if [ "$safe" != "yes" ]
then
rm $RMARGS $NUTCH_HOME/bin/crawl1/segments
else
rm $RMARGS $NUTCH_HOME/bin/crawl1/BACKUPsegments
mv $MVARGS $NUTCH_HOME/bin/crawl1/segments
$NUTCH_HOME/bin/crawl1/BACKUPsegments
fi
mv $MVARGS $NUTCH_HOME/bin/crawl1/MERGEDsegments
$NUTCH_HOME/bin/crawl1/segments
echo "----- Invert Links (Step 6 of $steps) -----"
$NUTCH_HOME/bin/nutch invertlinks $NUTCH_HOME/bin/crawl1/linkdb
$NUTCH_HOME/bin/crawl1/segments/*
if [ "$safe" != "yes" ]
then
rm $RMARGS $NUTCH_HOME/bin/crawl1/NEWindexes
rm $RMARGS $NUTCH_HOME/bin/crawl1/index
else
rm $RMARGS $NUTCH_HOME/bin/crawl1/BACKUPindexes
rm $RMARGS $NUTCH_HOME/bin/crawl1/BACKUPindex
mv $MVARGS $NUTCH_HOME/bin/crawl1/NEWindexes
$NUTCH_HOME/bin/crawl1/BACKUPindexes
mv $MVARGS $NUTCH_HOME/bin/crawl1/index $NUTCH_HOME/bin/crawl1/BACKUPindex
fi
$NUTCH_HOME/bin/nutch solrindex http://$HOST:8983/solr/
$NUTCH_HOME/bin/crawl1/crawldb $NUTCH_HOME/bin/crawl1/linkdb
$NUTCH_HOME/bin/crawl1/segments/*
but nutch don't crawl all page in any site, for example when topN=1000,
nutch crawl 700 page from site1 and 250 from site2 and 40 from site3 and 10
page from site4. i want nutch crawl 1000 page from any site.help me.
--
View this message in context: http://lucene.472066.n3.nabble.com/how-give-several-sites-to-nutch-to-crawl-tp3556697p3558152.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: how give several sites to nutch to crawl?
Posted by Marek Bachmann <m....@uni-kassel.de>.
Am 03.12.2011 08:32, schrieb mina:
> hi, i want to give nutch several sites and nutch crawl them. for example i
> want nutch crawl:
> http://www.site1.com
> http://www.site2.com
> http://www.site3.com
> how can i do that?help me.
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/how-give-several-sites-to-nutch-to-crawl-tp3556697p3556697.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
1.) Make a dir called e.g. "seedUrls" and add a plain text file with all
the sites you want to crawl
2.) Add:
+^http://www.site1.com
+^http://www.site2.com
...
+^http://www.siteN.com
to your regex-urlfilter.txt in order to allow these urls to be crawled
3.) call the inject command (./nutch inject <crawldb> <url_dir>) where
<crawldb> is the name for your new crawldb and <url_dir> the directory
of the seed urls, in my example "seedUrls"
Then you can call the generator, fetcher, parser and updater for a crawl
cycle.
Hope that helps for the start. :)