You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by mina <ta...@gmail.com> on 2011/11/09 09:51:57 UTC
crawl sites in nutch 1.3?
hi, crawl my sites with a script. i have a text file with name 'sites.txt',my
script read urls from sites.txt, first i add a url-for example url1 - in
sites.txt to crawl, and nutch crawl this site well. next i delete url1 from
sites.txt and add another url -for example url2-. nutch fetch older url
-url1- and crawl it. it dosn't crawl url2.
nutch :1.3
topN:10
depth:4
help me.
--
View this message in context: http://lucene.472066.n3.nabble.com/crawl-sites-in-nutch-1-3-tp3492962p3492962.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Input path does not exist (parse_data)
Posted by Lewis John Mcgibbney <le...@gmail.com>.
By the looks of it there was a problem parsing segment data in this
particular segment. Please try reparsing the segment.
On Sat, Nov 12, 2011 at 11:46 AM, Rum Raisin <ru...@yahoo.com> wrote:
> Sorry continuing, since yahoo keyboard shortcuts triggered premature
> email...
>
> It already created other directories like... with directories like
> crawl_generate under them below. But why does it give this error? It
> couldn't create the parse_data file earlier that its expecting now? Or it
> thinks there should be data in that directory but there's nothing there?
>
> /nutch-trunk/crawl/segments/20111112043249
> /nutch-trunk/crawl/segments/20111112043120
> /nutch-trunk/crawl/segments/20111112043717
> /nutch-trunk/crawl/segments/20111112042823
> /nutch-trunk/crawl/segments/20111112043256
>
>
> ________________________________
> From: Rum Raisin <ru...@yahoo.com>
> To: "user@nutch.apache.org" <us...@nutch.apache.org>
> Sent: Saturday, November 12, 2011 11:38 AM
> Subject: Input path does not exist (parse_data)
>
> I get this error running nutch trunk under eclipse...
> I don't understand what the problem is. It already created other
> directories like...
>
>
> Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException:
> Input path does not exist:
> file:/home/jeff/workspace/nutch-trunk/crawl/segments/20111112043120/parse_data
> Input path does not exist:
> file:/home/jeff/workspace/nutch-trunk/crawl/segments/20111112042823/parse_data
> at
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
> at
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
> at
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
> at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
> at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:149)
> at org.apache.nutch.crawl.Crawl.run(Crawl.java:143)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
>
--
*Lewis*
Re: Input path does not exist (parse_data)
Posted by Rum Raisin <ru...@yahoo.com>.
Sorry continuing, since yahoo keyboard shortcuts triggered premature email...
It already created other directories like... with directories like crawl_generate under them below. But why does it give this error? It couldn't create the parse_data file earlier that its expecting now? Or it thinks there should be data in that directory but there's nothing there?
/nutch-trunk/crawl/segments/20111112043249
/nutch-trunk/crawl/segments/20111112043120
/nutch-trunk/crawl/segments/20111112043717
/nutch-trunk/crawl/segments/20111112042823
/nutch-trunk/crawl/segments/20111112043256
________________________________
From: Rum Raisin <ru...@yahoo.com>
To: "user@nutch.apache.org" <us...@nutch.apache.org>
Sent: Saturday, November 12, 2011 11:38 AM
Subject: Input path does not exist (parse_data)
I get this error running nutch trunk under eclipse...
I don't understand what the problem is. It already created other directories like...
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/jeff/workspace/nutch-trunk/crawl/segments/20111112043120/parse_data
Input path does not exist: file:/home/jeff/workspace/nutch-trunk/crawl/segments/20111112042823/parse_data
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:149)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:143)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
Input path does not exist (parse_data)
Posted by Rum Raisin <ru...@yahoo.com>.
I get this error running nutch trunk under eclipse...
I don't understand what the problem is. It already created other directories like...
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/jeff/workspace/nutch-trunk/crawl/segments/20111112043120/parse_data
Input path does not exist: file:/home/jeff/workspace/nutch-trunk/crawl/segments/20111112042823/parse_data
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:149)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:143)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
Re: crawl sites in nutch 1.3?
Posted by mina <ta...@gmail.com>.
thanks for your answer, i think topN caused this problem, beacuse when nutch
fetch a url , it will fetch any links that exist in page.the maximum links
that will fetch from a page is equals to topN. i think if nutch fetch urls
equals topN it will not fetch another url from sites.txt. please give me an
example about topN, i don't know verey much about it. here is my script :
# deepcrawler script to run the Nutch bot for crawling and re-crawling.
# Usage: bin/deepcrawler [safe]
# If executed in 'safe' mode, it doesn't delete the temporary
# directories generated during crawl. This might be helpful for
# analysis and recovery in case a crawl fails.
#
# Author: Susam Pal
# set host
export HOST=127.0.0.1
# set depth
depth=4
# set threads
threads=5
adddays=0
# set topn
topN=10
#Comment this statement if you don't want to set topN value
# Arguments for rm and mv
RMARGS="-rf"
MVARGS="--verbose"
# Parse arguments
if [ "$1" == "safe" ]
then
safe=yes
fi
if [ -z "$NUTCH_HOME" ]
then
# set nutchHome
export NUTCH_HOME=/search-engine/nutch/runtime/local
# set javaHome
export JAVA_HOME=/opt/jdk1.6.0_25/
echo deepcrawler: $0 could not find environment variable NUTCH_HOME
echo "host is $HOST"
echo deepcrawler: NUTCH_HOME=$NUTCH_HOME has been set by the script
else
echo deepcrawler: $0 found environment variable NUTCH_HOME=$NUTCH_HOME
fi
if [ -z "$CATALINA_HOME" ]
then
CATALINA_HOME=/home/ganjyar/Development/apache-tomcat-6.0.33
echo deepcrawler: $0 could not find environment variable NUTCH_HOME
echo deepcrawler: CATALINA_HOME=$CATALINA_HOME has been set by the script
else
echo deepcrawler: $0 found environment variable
CATALINA_HOME=$CATALINA_HOME
fi
if [ -n "$topN" ]
then
topN="-topN $topN"
else
topN=""
fi
steps=10
echo "----- Inject (Step 1 of $steps) -----"
$NUTCH_HOME/bin/nutch inject $NUTCH_HOME/bin/crawl1/crawldb
$NUTCH_HOME/bin/urls/sites.txt
echo "----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----"
for((i=0; i < $depth; i++))
do
echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
$NUTCH_HOME/bin/nutch generate $NUTCH_HOME/bin/crawl1/crawldb
$NUTCH_HOME/bin/crawl1/segments $topN \
-adddays $adddays
if [ $? -ne 0 ]
then
echo "deepcrawler: Stopping at depth $depth. No more URLs to fetch."
break
fi
segment1=`ls -d $NUTCH_HOME/bin/crawl1/segments/* | tail -1`
$NUTCH_HOME/bin/nutch fetch $segment1 -threads $threads
if [ $? -ne 0 ]
then
echo "deepcrawler: fetch $segment1 at depth `expr $i + 1` failed."
echo "deepcrawler: Deleting segment $segment1."
rm $RMARGS $segment1
continue
fi
$NUTCH_HOME/bin/nutch parse $segment1
$NUTCH_HOME/bin/nutch updatedb $NUTCH_HOME/bin/crawl1/crawldb $segment1
done
#echo "----- Generate, Fetch, Parse, Update (Step 3 of $steps) -----"
#for((i=0; i < $depth; i++))
#do
# echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
# sh nutch generate crawl1/crawldb crawl1/segments topN 1000 \
# -adddays $adddays
# if [ $? -ne 0 ]
# then
# echo "deepcrawler: Stopping at depth $depth. No more URLs to fetch."
# break
# fi
# segment2=`ls -d crawl1/segments/* | tail -1`
# sh nutch fetch $segment2
# if [ $? -ne 0 ]
# then
# echo "deepcrawler: fetch $segment2 at depth `expr $i + 1` failed."
# echo "deepcrawler: Deleting segment $segment2."
# rm $RMARGS $segment2
# continue
# fi
#sh nutch parse $segment2
# sh nutch updatedb crawl1/crawldb $segment2
#done
#echo "----- Generate, Fetch, Parse, Update (Step 4 of $steps) -----"
#for((i=0; i < $depth; i++))
#do
# echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
# sh nutch generate crawl1/crawldb crawl1/segments topN 1000 \
# -adddays $adddays
# if [ $? -ne 0 ]
# then
# echo "deepcrawler: Stopping at depth $depth. No more URLs to fetch."
# break
# fi
# segment3=`ls -d crawl1/segments/* | tail -1`
# sh nutch fetch $segment3
# if [ $? -ne 0 ]
# then
# echo "deepcrawler: fetch $segment3 at depth `expr $i + 1` failed."
# echo "deepcrawler: Deleting segment $segment3."
# rm $RMARGS $segment3
# continue
# fi
#sh nutch parse $segment3
# sh nutch updatedb crawl1/crawldb $segment3
#done
echo "----- Merge Segments (Step 5 of $steps) -----"
$NUTCH_HOME/bin/nutch mergesegs $NUTCH_HOME/bin/crawl1/MERGEDsegments
$NUTCH_HOME/bin/crawl1/segments/*
if [ "$safe" != "yes" ]
then
rm $RMARGS $NUTCH_HOME/bin/crawl1/segments
else
rm $RMARGS $NUTCH_HOME/bin/crawl1/BACKUPsegments
mv $MVARGS $NUTCH_HOME/bin/crawl1/segments
$NUTCH_HOME/bin/crawl1/BACKUPsegments
fi
mv $MVARGS $NUTCH_HOME/bin/crawl1/MERGEDsegments
$NUTCH_HOME/bin/crawl1/segments
echo "----- Invert Links (Step 6 of $steps) -----"
$NUTCH_HOME/bin/nutch invertlinks $NUTCH_HOME/bin/crawl1/linkdb
$NUTCH_HOME/bin/crawl1/segments/*
echo "----- Index (Step 7 of $steps) -----"
#sh nutch index crawl1/NEWindexes crawl1/crawldb crawl1/linkdb \
# crawl1/segments/*
echo "----- Dedup (Step 8 of $steps) -----"
#sh nutch dedup crawl1/NEWindexes
echo "----- Merge Indexes (Step 9 of $steps) -----"
#sh nutch merge crawl1/NEWindex crawl1/NEWindexes
echo "----- Loading New Index (Step 10 of $steps) -----"
#${CATALINA_HOME}/bin/shutdown.sh
if [ "$safe" != "yes" ]
then
rm $RMARGS $NUTCH_HOME/bin/crawl1/NEWindexes
rm $RMARGS $NUTCH_HOME/bin/crawl1/index
else
rm $RMARGS $NUTCH_HOME/bin/crawl1/BACKUPindexes
rm $RMARGS $NUTCH_HOME/bin/crawl1/BACKUPindex
mv $MVARGS $NUTCH_HOME/bin/crawl1/NEWindexes
$NUTCH_HOME/bin/crawl1/BACKUPindexes
mv $MVARGS $NUTCH_HOME/bin/crawl1/index $NUTCH_HOME/bin/crawl1/BACKUPindex
fi
#mv $MVARGS crawl1/NEWindex crawl1/index
#sh catalina startup.sh
$NUTCH_HOME/bin/nutch solrindex http://$HOST:8983/solr/
$NUTCH_HOME/bin/crawl1/crawldb $NUTCH_HOME/bin/crawl1/linkdb
$NUTCH_HOME/bin/crawl1/segments/*
echo "deepcrawler: FINISHED: Crawl completed!"
echo ""
--
View this message in context: http://lucene.472066.n3.nabble.com/crawl-sites-in-nutch-1-3-tp3492962p3500896.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: crawl sites in nutch 1.3?
Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi,
Firstly you have not detailed anything regarding exactly what your script
does!!!
Once you initially fetched url1 did you update your crawldb? Secondly when
you manually deleted url1 and added url2 did you inject these into the
crawldb then generate a fetchlist?
Regardless of your settings for the crawl command, if the above steps have
not been undertaken then there is no way Nutch can know which URLs are a
most appropriate to fetch at any given time.
On Wed, Nov 9, 2011 at 12:51 AM, mina <ta...@gmail.com> wrote:
> hi, crawl my sites with a script. i have a text file with name
> 'sites.txt',my
> script read urls from sites.txt, first i add a url-for example url1 - in
> sites.txt to crawl, and nutch crawl this site well. next i delete url1 from
> sites.txt and add another url -for example url2-. nutch fetch older url
> -url1- and crawl it. it dosn't crawl url2.
> nutch :1.3
> topN:10
> depth:4
> help me.
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/crawl-sites-in-nutch-1-3-tp3492962p3492962.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
--
*Lewis*