You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by mina <ta...@gmail.com> on 2011/12/03 08:32:26 UTC

how give several sites to nutch to crawl?

hi, i want to give nutch several sites and nutch crawl them. for example i
want nutch crawl:
http://www.site1.com
http://www.site2.com
http://www.site3.com
how can i do that?help me.

--
View this message in context: http://lucene.472066.n3.nabble.com/how-give-several-sites-to-nutch-to-crawl-tp3556697p3556697.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: how give several sites to nutch to crawl?

Posted by mina <ta...@gmail.com>.

i add this property in nutch-site.xml but my problem isn't resolved, how
property i should use? help me. its important for me.

--
View this message in context: http://lucene.472066.n3.nabble.com/how-give-several-sites-to-nutch-to-crawl-tp3556697p3559106.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: how give several sites to nutch to crawl?

Posted by al...@aim.com.

I think you should add this to nutch-site.xml
<property>
  <name>generate.max.count</name>
  <value>1000</value>
  <description>The maximum number of urls in a single
  fetchlist.  -1 if unlimited. The urls are counted according
  to the value of the parameter generator.count.mode.
  </description>
</property>

 
and set topN to -1

Alex.

 

 

-----Original Message-----
From: mina <ta...@gmail.com>
To: nutch-user <nu...@lucene.apache.org>
Sent: Sat, Dec 3, 2011 6:10 pm
Subject: Re: how give several sites to nutch to crawl?


thanks for your answer. i use this script to crawl my sites:



$NUTCH_HOME/bin/nutch inject $NUTCH_HOME/bin/crawl1/crawldb

$NUTCH_HOME/bin/seedUrls

for((i=0; i < $depth; i++))

do

  echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"

$NUTCH_HOME/bin/nutch generate $NUTCH_HOME/bin/crawl1/crawldb

$NUTCH_HOME/bin/crawl1/segments $topN



 if [ $? -ne 0 ]

  then

    echo "deepcrawler: Stopping at depth $depth. No more URLs to fetch."

    break

  fi

  segment1=`ls -d $NUTCH_HOME/bin/crawl1/segments/* | tail -1`



$NUTCH_HOME/bin/nutch fetch $segment1

  if [ $? -ne 0 ]

 then

    echo "deepcrawler: fetch $segment1 at depth `expr $i + 1` failed."

    echo "deepcrawler: Deleting segment $segment1."

    rm $RMARGS $segment1

    continue

  fi

$NUTCH_HOME/bin/nutch parse $segment1

$NUTCH_HOME/bin/nutch updatedb $NUTCH_HOME/bin/crawl1/crawldb $segment1

done

echo "----- Merge Segments (Step 5 of $steps) -----"

$NUTCH_HOME/bin/nutch mergesegs $NUTCH_HOME/bin/crawl1/MERGEDsegments

$NUTCH_HOME/bin/crawl1/segments/*



if [ "$safe" != "yes" ]

then

  rm $RMARGS $NUTCH_HOME/bin/crawl1/segments



else

  rm $RMARGS $NUTCH_HOME/bin/crawl1/BACKUPsegments

  mv $MVARGS $NUTCH_HOME/bin/crawl1/segments

$NUTCH_HOME/bin/crawl1/BACKUPsegments



fi



mv $MVARGS $NUTCH_HOME/bin/crawl1/MERGEDsegments

$NUTCH_HOME/bin/crawl1/segments



echo "----- Invert Links (Step 6 of $steps) -----"

$NUTCH_HOME/bin/nutch invertlinks $NUTCH_HOME/bin/crawl1/linkdb

$NUTCH_HOME/bin/crawl1/segments/*



if [ "$safe" != "yes" ]

then

  rm $RMARGS $NUTCH_HOME/bin/crawl1/NEWindexes

  rm $RMARGS $NUTCH_HOME/bin/crawl1/index





else



  rm $RMARGS $NUTCH_HOME/bin/crawl1/BACKUPindexes

  rm $RMARGS $NUTCH_HOME/bin/crawl1/BACKUPindex

  mv $MVARGS $NUTCH_HOME/bin/crawl1/NEWindexes

$NUTCH_HOME/bin/crawl1/BACKUPindexes

  mv $MVARGS $NUTCH_HOME/bin/crawl1/index $NUTCH_HOME/bin/crawl1/BACKUPindex





fi



$NUTCH_HOME/bin/nutch solrindex http://$HOST:8983/solr/

$NUTCH_HOME/bin/crawl1/crawldb $NUTCH_HOME/bin/crawl1/linkdb

$NUTCH_HOME/bin/crawl1/segments/*







but nutch don't crawl all page in any site, for example when topN=1000,

nutch crawl 700 page from site1 and 250 from site2 and 40 from site3 and 10

page from site4. i want nutch crawl 1000 page from any site.help me.



--

View this message in context: http://lucene.472066.n3.nabble.com/how-give-several-sites-to-nutch-to-crawl-tp3556697p3558152.html

Sent from the Nutch - User mailing list archive at Nabble.com.

Re: how give several sites to nutch to crawl?

Posted by mina <ta...@gmail.com>.

thanks for your answer. i use this script to crawl my sites:

$NUTCH_HOME/bin/nutch inject $NUTCH_HOME/bin/crawl1/crawldb
$NUTCH_HOME/bin/seedUrls
for((i=0; i < $depth; i++))
do
  echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
$NUTCH_HOME/bin/nutch generate $NUTCH_HOME/bin/crawl1/crawldb
$NUTCH_HOME/bin/crawl1/segments $topN

 if [ $? -ne 0 ]
  then
    echo "deepcrawler: Stopping at depth $depth. No more URLs to fetch."
    break
  fi
  segment1=`ls -d $NUTCH_HOME/bin/crawl1/segments/* | tail -1`

$NUTCH_HOME/bin/nutch fetch $segment1
  if [ $? -ne 0 ]
 then
    echo "deepcrawler: fetch $segment1 at depth `expr $i + 1` failed."
    echo "deepcrawler: Deleting segment $segment1."
    rm $RMARGS $segment1
    continue
  fi
$NUTCH_HOME/bin/nutch parse $segment1
$NUTCH_HOME/bin/nutch updatedb $NUTCH_HOME/bin/crawl1/crawldb $segment1
done
echo "----- Merge Segments (Step 5 of $steps) -----"
$NUTCH_HOME/bin/nutch mergesegs $NUTCH_HOME/bin/crawl1/MERGEDsegments
$NUTCH_HOME/bin/crawl1/segments/*

if [ "$safe" != "yes" ]
then
  rm $RMARGS $NUTCH_HOME/bin/crawl1/segments

else
  rm $RMARGS $NUTCH_HOME/bin/crawl1/BACKUPsegments
  mv $MVARGS $NUTCH_HOME/bin/crawl1/segments
$NUTCH_HOME/bin/crawl1/BACKUPsegments

fi

mv $MVARGS $NUTCH_HOME/bin/crawl1/MERGEDsegments
$NUTCH_HOME/bin/crawl1/segments

echo "----- Invert Links (Step 6 of $steps) -----"
$NUTCH_HOME/bin/nutch invertlinks $NUTCH_HOME/bin/crawl1/linkdb
$NUTCH_HOME/bin/crawl1/segments/*

if [ "$safe" != "yes" ]
then
  rm $RMARGS $NUTCH_HOME/bin/crawl1/NEWindexes
  rm $RMARGS $NUTCH_HOME/bin/crawl1/index


else

  rm $RMARGS $NUTCH_HOME/bin/crawl1/BACKUPindexes
  rm $RMARGS $NUTCH_HOME/bin/crawl1/BACKUPindex
  mv $MVARGS $NUTCH_HOME/bin/crawl1/NEWindexes
$NUTCH_HOME/bin/crawl1/BACKUPindexes
  mv $MVARGS $NUTCH_HOME/bin/crawl1/index $NUTCH_HOME/bin/crawl1/BACKUPindex


fi

$NUTCH_HOME/bin/nutch solrindex http://$HOST:8983/solr/
$NUTCH_HOME/bin/crawl1/crawldb $NUTCH_HOME/bin/crawl1/linkdb
$NUTCH_HOME/bin/crawl1/segments/*



but nutch don't crawl all page in any site, for example when topN=1000,
nutch crawl 700 page from site1 and 250 from site2 and 40 from site3 and 10
page from site4. i want nutch crawl 1000 page from any site.help me.

--
View this message in context: http://lucene.472066.n3.nabble.com/how-give-several-sites-to-nutch-to-crawl-tp3556697p3558152.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: how give several sites to nutch to crawl?

Posted by Marek Bachmann <m....@uni-kassel.de>.

Am 03.12.2011 08:32, schrieb mina:
> hi, i want to give nutch several sites and nutch crawl them. for example i
> want nutch crawl:
> http://www.site1.com
> http://www.site2.com
> http://www.site3.com
> how can i do that?help me.
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/how-give-several-sites-to-nutch-to-crawl-tp3556697p3556697.html
> Sent from the Nutch - User mailing list archive at Nabble.com.

1.) Make a dir called e.g. "seedUrls" and add a plain text file with all
the sites you want to crawl
2.) Add:
	+^http://www.site1.com
	+^http://www.site2.com
	...
	+^http://www.siteN.com
to your regex-urlfilter.txt in order to allow these urls to be crawled

3.) call the inject command (./nutch inject <crawldb> <url_dir>) where
<crawldb> is the name for your new crawldb and <url_dir> the directory
of the seed urls, in my example "seedUrls"

Then you can call the generator, fetcher, parser and updater for a crawl
cycle.

Hope that helps for the start. :)