You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by mina <ta...@gmail.com> on 2011/11/09 09:51:57 UTC

crawl sites in nutch 1.3?

hi, crawl my sites with a script. i have a text file with name 'sites.txt',my
script read urls from sites.txt, first i add a url-for example url1 - in
sites.txt to crawl, and nutch crawl this site well. next i delete url1 from
sites.txt and add another url -for example url2-. nutch fetch older url
-url1- and crawl it. it dosn't crawl url2.
nutch :1.3
topN:10
depth:4
help me.

--
View this message in context: http://lucene.472066.n3.nabble.com/crawl-sites-in-nutch-1-3-tp3492962p3492962.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Input path does not exist (parse_data)

Posted by Lewis John Mcgibbney <le...@gmail.com>.

By the looks of it there was a problem parsing segment data in this
particular segment. Please try reparsing the segment.

On Sat, Nov 12, 2011 at 11:46 AM, Rum Raisin <ru...@yahoo.com> wrote:

> Sorry continuing, since yahoo keyboard shortcuts triggered premature
> email...
>
> It already created other directories like... with directories like
> crawl_generate under them below. But why does it give this error? It
> couldn't create the parse_data file earlier that its expecting now? Or it
> thinks there should be data in that directory but there's nothing there?
>
> /nutch-trunk/crawl/segments/20111112043249
> /nutch-trunk/crawl/segments/20111112043120
> /nutch-trunk/crawl/segments/20111112043717
> /nutch-trunk/crawl/segments/20111112042823
> /nutch-trunk/crawl/segments/20111112043256
>
>
> ________________________________
> From: Rum Raisin <ru...@yahoo.com>
> To: "user@nutch.apache.org" <us...@nutch.apache.org>
> Sent: Saturday, November 12, 2011 11:38 AM
> Subject: Input path does not exist (parse_data)
>
> I get this error running nutch trunk under eclipse...
> I don't understand what the problem is. It already created other
> directories like...
>
>
> Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException:
> Input path does not exist:
> file:/home/jeff/workspace/nutch-trunk/crawl/segments/20111112043120/parse_data
> Input path does not exist:
> file:/home/jeff/workspace/nutch-trunk/crawl/segments/20111112042823/parse_data
> at
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
> at
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
> at
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
> at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
> at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:149)
> at org.apache.nutch.crawl.Crawl.run(Crawl.java:143)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
>



-- 
*Lewis*

Re: Input path does not exist (parse_data)

Posted by Rum Raisin <ru...@yahoo.com>.

Sorry continuing, since yahoo keyboard shortcuts triggered premature email...

It already created other directories like... with directories like crawl_generate under them below. But why does it give this error? It couldn't create the parse_data file earlier that its expecting now? Or it thinks there should be data in that directory but there's nothing there?

/nutch-trunk/crawl/segments/20111112043249
/nutch-trunk/crawl/segments/20111112043120
/nutch-trunk/crawl/segments/20111112043717
/nutch-trunk/crawl/segments/20111112042823
/nutch-trunk/crawl/segments/20111112043256


________________________________
From: Rum Raisin <ru...@yahoo.com>
To: "user@nutch.apache.org" <us...@nutch.apache.org>
Sent: Saturday, November 12, 2011 11:38 AM
Subject: Input path does not exist (parse_data)

I get this error running nutch trunk under eclipse...
I don't understand what the problem is. It already created other directories like...


Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/jeff/workspace/nutch-trunk/crawl/segments/20111112043120/parse_data
Input path does not exist: file:/home/jeff/workspace/nutch-trunk/crawl/segments/20111112042823/parse_data
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:149)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:143)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)

Input path does not exist (parse_data)

Posted by Rum Raisin <ru...@yahoo.com>.

I get this error running nutch trunk under eclipse...
I don't understand what the problem is. It already created other directories like...


Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/jeff/workspace/nutch-trunk/crawl/segments/20111112043120/parse_data
Input path does not exist: file:/home/jeff/workspace/nutch-trunk/crawl/segments/20111112042823/parse_data
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:149)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:143)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)

Re: crawl sites in nutch 1.3?

Posted by mina <ta...@gmail.com>.

thanks for your answer, i think topN caused this problem, beacuse when nutch
fetch a url , it will fetch any links that exist in page.the maximum links
that will fetch from a page is equals to topN. i think if nutch fetch urls
equals topN it will not fetch another url from sites.txt. please give me an
example about topN, i don't know verey much about it. here is my script :
# deepcrawler script to run the Nutch bot for crawling and re-crawling.
# Usage: bin/deepcrawler [safe]
#        If executed in 'safe' mode, it doesn't delete the temporary
#        directories generated during crawl. This might be helpful for
#        analysis and recovery in case a crawl fails.
#
# Author: Susam Pal

# set host
export HOST=127.0.0.1

# set depth
depth=4

# set threads
threads=5

adddays=0

# set topn
topN=10

#Comment this statement if you don't want to set topN value

# Arguments for rm and mv
RMARGS="-rf"
MVARGS="--verbose"

# Parse arguments
if [ "$1" == "safe" ]
then
  safe=yes
fi

if [ -z "$NUTCH_HOME" ]
then

# set nutchHome
export NUTCH_HOME=/search-engine/nutch/runtime/local

# set javaHome
export JAVA_HOME=/opt/jdk1.6.0_25/


echo deepcrawler: $0 could not find environment variable NUTCH_HOME
echo "host is $HOST"
  echo deepcrawler: NUTCH_HOME=$NUTCH_HOME has been set by the script 
else
  echo deepcrawler: $0 found environment variable NUTCH_HOME=$NUTCH_HOME 
fi

if [ -z "$CATALINA_HOME" ]
then
  CATALINA_HOME=/home/ganjyar/Development/apache-tomcat-6.0.33
  echo deepcrawler: $0 could not find environment variable NUTCH_HOME
  echo deepcrawler: CATALINA_HOME=$CATALINA_HOME has been set by the script 
else
  echo deepcrawler: $0 found environment variable
CATALINA_HOME=$CATALINA_HOME 
fi

if [ -n "$topN" ]
then
  topN="-topN $topN"
else
  topN=""
fi

steps=10
echo "----- Inject (Step 1 of $steps) -----"
$NUTCH_HOME/bin/nutch inject $NUTCH_HOME/bin/crawl1/crawldb
$NUTCH_HOME/bin/urls/sites.txt

echo "----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----"
for((i=0; i < $depth; i++))
do
  echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
$NUTCH_HOME/bin/nutch generate $NUTCH_HOME/bin/crawl1/crawldb
$NUTCH_HOME/bin/crawl1/segments $topN \
      -adddays $adddays
  if [ $? -ne 0 ]
  then
    echo "deepcrawler: Stopping at depth $depth. No more URLs to fetch."
    break
  fi
  segment1=`ls -d $NUTCH_HOME/bin/crawl1/segments/* | tail -1`

$NUTCH_HOME/bin/nutch fetch $segment1 -threads $threads
  if [ $? -ne 0 ]
 then
    echo "deepcrawler: fetch $segment1 at depth `expr $i + 1` failed."
    echo "deepcrawler: Deleting segment $segment1."
    rm $RMARGS $segment1
    continue
  fi
$NUTCH_HOME/bin/nutch parse $segment1
$NUTCH_HOME/bin/nutch updatedb $NUTCH_HOME/bin/crawl1/crawldb $segment1
done

#echo "----- Generate, Fetch, Parse, Update (Step 3 of $steps) -----"
#for((i=0; i < $depth; i++))
#do
#  echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
#  sh nutch generate crawl1/crawldb crawl1/segments topN 1000 \
#      -adddays $adddays
#  if [ $? -ne 0 ]
#  then
#    echo "deepcrawler: Stopping at depth $depth. No more URLs to fetch."
#    break
#  fi
#  segment2=`ls -d crawl1/segments/* | tail -1`

#  sh nutch fetch $segment2 
#  if [ $? -ne 0 ]
#  then
#    echo "deepcrawler: fetch $segment2 at depth `expr $i + 1` failed."
#    echo "deepcrawler: Deleting segment $segment2."
#    rm $RMARGS $segment2
#    continue
#  fi
#sh nutch parse $segment2
#  sh nutch updatedb crawl1/crawldb $segment2
#done

#echo "----- Generate, Fetch, Parse, Update (Step 4 of $steps) -----"
#for((i=0; i < $depth; i++))
#do
#  echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
#  sh nutch generate crawl1/crawldb crawl1/segments topN 1000 \
#      -adddays $adddays
#  if [ $? -ne 0 ]
#  then
#    echo "deepcrawler: Stopping at depth $depth. No more URLs to fetch."
#    break
#  fi
#  segment3=`ls -d crawl1/segments/* | tail -1`

#  sh nutch fetch $segment3 
#  if [ $? -ne 0 ]
#  then
#    echo "deepcrawler: fetch $segment3 at depth `expr $i + 1` failed."
#    echo "deepcrawler: Deleting segment $segment3."
#    rm $RMARGS $segment3
#    continue
#  fi
#sh nutch parse $segment3
#  sh nutch updatedb crawl1/crawldb $segment3
#done

echo "----- Merge Segments (Step 5 of $steps) -----"
$NUTCH_HOME/bin/nutch mergesegs $NUTCH_HOME/bin/crawl1/MERGEDsegments
$NUTCH_HOME/bin/crawl1/segments/*
if [ "$safe" != "yes" ]
then
  rm $RMARGS $NUTCH_HOME/bin/crawl1/segments
else
  rm $RMARGS $NUTCH_HOME/bin/crawl1/BACKUPsegments
  mv $MVARGS $NUTCH_HOME/bin/crawl1/segments
$NUTCH_HOME/bin/crawl1/BACKUPsegments
fi

mv $MVARGS $NUTCH_HOME/bin/crawl1/MERGEDsegments
$NUTCH_HOME/bin/crawl1/segments

echo "----- Invert Links (Step 6 of $steps) -----"
$NUTCH_HOME/bin/nutch invertlinks $NUTCH_HOME/bin/crawl1/linkdb
$NUTCH_HOME/bin/crawl1/segments/*

echo "----- Index (Step 7 of $steps) -----"
#sh nutch index crawl1/NEWindexes crawl1/crawldb crawl1/linkdb \
#    crawl1/segments/*

echo "----- Dedup (Step 8 of $steps) -----"
#sh nutch dedup crawl1/NEWindexes

echo "----- Merge Indexes (Step 9 of $steps) -----"
#sh nutch merge crawl1/NEWindex crawl1/NEWindexes

echo "----- Loading New Index (Step 10 of $steps) -----"
#${CATALINA_HOME}/bin/shutdown.sh

if [ "$safe" != "yes" ]
then
  rm $RMARGS $NUTCH_HOME/bin/crawl1/NEWindexes
  rm $RMARGS $NUTCH_HOME/bin/crawl1/index
else
  rm $RMARGS $NUTCH_HOME/bin/crawl1/BACKUPindexes
  rm $RMARGS $NUTCH_HOME/bin/crawl1/BACKUPindex
  mv $MVARGS $NUTCH_HOME/bin/crawl1/NEWindexes
$NUTCH_HOME/bin/crawl1/BACKUPindexes
  mv $MVARGS $NUTCH_HOME/bin/crawl1/index $NUTCH_HOME/bin/crawl1/BACKUPindex
fi

#mv $MVARGS crawl1/NEWindex crawl1/index

#sh catalina startup.sh
$NUTCH_HOME/bin/nutch solrindex http://$HOST:8983/solr/
$NUTCH_HOME/bin/crawl1/crawldb $NUTCH_HOME/bin/crawl1/linkdb
$NUTCH_HOME/bin/crawl1/segments/*
echo "deepcrawler: FINISHED: Crawl completed!"
echo ""


--
View this message in context: http://lucene.472066.n3.nabble.com/crawl-sites-in-nutch-1-3-tp3492962p3500896.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: crawl sites in nutch 1.3?

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi,

Firstly you have not detailed anything regarding exactly what your script
does!!!
Once you initially fetched url1 did you update your crawldb? Secondly when
you manually deleted url1 and added url2 did you inject these into the
crawldb then generate a fetchlist?

Regardless of your settings for the crawl command, if the above steps have
not been undertaken then there is no way Nutch can know which URLs are a
most appropriate to fetch at any given time.

On Wed, Nov 9, 2011 at 12:51 AM, mina <ta...@gmail.com> wrote:

> hi, crawl my sites with a script. i have a text file with name
> 'sites.txt',my
> script read urls from sites.txt, first i add a url-for example url1 - in
> sites.txt to crawl, and nutch crawl this site well. next i delete url1 from
> sites.txt and add another url -for example url2-. nutch fetch older url
> -url1- and crawl it. it dosn't crawl url2.
> nutch :1.3
> topN:10
> depth:4
> help me.
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/crawl-sites-in-nutch-1-3-tp3492962p3492962.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

-- 
*Lewis*