You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Tom Gardner <to...@tomg.com> on 2009/09/03 19:23:17 UTC

InvalidInputException: Input path does not exist

Hello,

I'm trying to get whole-web crawling working, but I'm getting this error in
the final indexing steps:

LinkDb: starting
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: file:/data/crawl/segments/20090903093154/parse_data
LinkDb: adding segment: file:/data/crawl/segments/20090903093154/parse_text
LinkDb: adding segment: file:/data/crawl/segments/20090903093154/crawl_fetch
LinkDb: adding segment:
file:/data/crawl/segments/20090903093154/crawl_generate
LinkDb: adding segment: file:/data/crawl/segments/20090903093154/content
LinkDb: adding segment: file:/data/crawl/segments/20090903093154/crawl_parse
LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not
exist: file:/data/crawl/segments/20090903093154/parse_data/parse_data
Input path does not exist:
file:/data/crawl/segments/20090903093154/parse_text/parse_data
Can anyone help?  This is running nutch 1.0 on a clean nutch install on
Fedora Linux.

I've verified the same error using the nutch-2009-09-03_05-18-47 release as
well.

My script and full error output are below.

Thanks


-------------------------------------------------------------------- Nutch
Script
-------------------------------------------------------------------------

#!/bin/bash
export JAVA_HOME=/usr/local/jdk

# Clean up from last run
/bin/rm -rf crawl seed
mkdir seed
# Copy list of urls to the seed directory
cp urls seed/urls.txt
# Injects urls in the 'seed' directory into the crawldb
/usr/local/nutch/bin/nutch inject crawl/crawldb seed
# Generate fetch list, fetch and parse content
/usr/local/nutch/bin/nutch generate crawl/crawldb crawl/segments
echo DONE GENERATE 1
# The above command will generate a new segment directory
# under crawl/segments that at this point contains files that
# store the url(s) to be fetched. In the following commands
# we need the latest segment dir as parameter so well store
# it in an environment variable:
SEGMENT=`ls -d crawl/segments/2* | tail -1`
echo SEGMENT 1: $SEGMENT
# Now launch the fetcher that actually goes to get the content
/usr/local/nutch/bin/nutch fetch $SEGMENT -noParsing
echo DONE FETCH 1
# Next, parse the content
/usr/local/nutch/bin/nutch parse $SEGMENT
echo DONE PARSE 1
# Then update the Nutch crawldb. The updatedb command will
# store all new urls discovered during the fetch and parse of
# the previous segment into Nutch database so they can be
# fetched later. Nutch also stores information about the
# pages that were fetched so the same urls wont be fetched
# again and again.
/usr/local/nutch/bin/nutch updatedb crawl/crawldb $SEGMENT -filter
-normalize
echo DONE UPDATEDB 1
# Now the database has entries for all of the pages referenced by the
initial set

# Now we fetch a new segment with the top-scoring 1000 pages
# /usr/local/nutch/bin/nutch generate crawl/crawldb crawl/segments -topN
1000
/usr/local/nutch/bin/nutch generate crawl/crawldb crawl/segments
echo DONE GENERATE 2
# reset SEGMENT
SEGMENT=`ls -d crawl/segments/2* | tail -1`
echo SEGMENT 2: $SEGMENT
# Now re-launch the fetcher to get the content
/usr/local/nutch/bin/nutch fetch $SEGMENT -noParsing
echo DONE FETCH 2
# Next, parse the content
/usr/local/nutch/bin/nutch parse $SEGMENT
echo DONE PARSE 2
# update the db
/usr/local/nutch/bin/nutch updatedb crawl/crawldb $SEGMENT -filter
-normalize
echo DONE UPDATE 2

# Fetch another round
# /usr/local/nutch/bin/nutch generate crawl/crawldb crawl/segments -topN
1000
/usr/local/nutch/bin/nutch generate crawl/crawldb crawl/segments
echo DONE GENERATE 3
# reset SEGMENT
SEGMENT=`ls -d crawl/segments/2* | tail -1`
echo SEGMENT 3: $SEGMENT
# Now re-launch the fetcher to get the content
/usr/local/nutch/bin/nutch fetch $SEGMENT -noParsing
echo DONE FETCH 3
# Next, parse the content
/usr/local/nutch/bin/nutch parse $SEGMENT
echo DONE PARSE 3
# update the db
/usr/local/nutch/bin/nutch updatedb crawl/crawldb $SEGMENT -filter
-normalize
echo DONE UPDATEDB 3

#
# We now index what we've gotten
#
# Before indexing we first invert all of the links,
# so that we may index incoming anchor text with the pages.
/usr/local/nutch/bin/nutch invertlinks crawl/linkdb -dir crawl/segments/*
# Then index
/usr/local/nutch/bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb
crawl/segments/*

----------------------------------------------------------------  Nutch
Errors ---------------------------------------------------------








-finishing thread FetcherThread, activeThreads=9
-finishing thread FetcherThread, activeThreads=8
-finishing thread FetcherThread, activeThreads=7
-finishing thread FetcherThread, activeThreads=6
-finishing thread FetcherThread, activeThreads=5
-finishing thread FetcherThread, activeThreads=4
-finishing thread FetcherThread, activeThreads=3
-finishing thread FetcherThread, activeThreads=2
-activeThreads=2, spinWaiting=2, fetchQueues.totalSize=0
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: done
DONE FETCH 3
DONE PARSE 3
CrawlDb update: starting
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20090903093336]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
DONE UPDATEDB 3
LinkDb: starting
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: file:/data/crawl/segments/20090903093154/parse_data
LinkDb: adding segment: file:/data/crawl/segments/20090903093154/parse_text
LinkDb: adding segment: file:/data/crawl/segments/20090903093154/crawl_fetch
LinkDb: adding segment:
file:/data/crawl/segments/20090903093154/crawl_generate
LinkDb: adding segment: file:/data/crawl/segments/20090903093154/content
LinkDb: adding segment: file:/data/crawl/segments/20090903093154/crawl_parse
LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not
exist: file:/data/crawl/segments/20090903093154/parse_data/parse_data
Input path does not exist:
file:/data/crawl/segments/20090903093154/parse_text/parse_data
Input path does not exist:
file:/data/crawl/segments/20090903093154/crawl_fetch/parse_data
Input path does not exist:
file:/data/crawl/segments/20090903093154/crawl_generate/parse_data
Input path does not exist:
file:/data/crawl/segments/20090903093154/content/parse_data
Input path does not exist:
file:/data/crawl/segments/20090903093154/crawl_parse/parse_data
 at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179)
 at
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39)
 at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:190)
 at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797)
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142)
 at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170)
 at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:285)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:248)
Indexer: starting
Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path does not
exist: file:/data/crawl/linkdb/current
 at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179)
 at
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39)
 at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:190)
 at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797)
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142)
 at org.apache.nutch.indexer.Indexer.index(Indexer.java:72)
 at org.apache.nutch.indexer.Indexer.run(Indexer.java:92)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at org.apache.nutch.indexer.Indexer.main(Indexer.java:101)

Re: InvalidInputException: Input path does not exist

Posted by Tom Gardner <to...@tomg.com>.
That worked perfectly! Thanks

On Thu, Sep 3, 2009 at 12:03 PM, Julien Nioche <
lists.digitalpebble@gmail.com> wrote:

> Haven't checked but I expect the correct command to be :
> */usr/local/nutch/bin/nutch invertlinks crawl/linkdb -dir crawl/segments*
> without the trailing /*
>
> J.
>
> --
> DigitalPebble Ltd
> http://www.digitalpebble.com
>
>
> nutch invertlinks
> 2009/9/3 Tom Gardner <to...@tomg.com>
>
> > Hello,
> >
> > I'm trying to get whole-web crawling working, but I'm getting this error
> in
> > the final indexing steps:
> >
> > LinkDb: starting
> > LinkDb: linkdb: crawl/linkdb
> > LinkDb: URL normalize: true
> > LinkDb: URL filter: true
> > LinkDb: adding segment:
> file:/data/crawl/segments/20090903093154/parse_data
> > LinkDb: adding segment:
> file:/data/crawl/segments/20090903093154/parse_text
> > LinkDb: adding segment:
> > file:/data/crawl/segments/20090903093154/crawl_fetch
> > LinkDb: adding segment:
> > file:/data/crawl/segments/20090903093154/crawl_generate
> > LinkDb: adding segment: file:/data/crawl/segments/20090903093154/content
> > LinkDb: adding segment:
> > file:/data/crawl/segments/20090903093154/crawl_parse
> > LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does
> not
> > exist: file:/data/crawl/segments/20090903093154/parse_data/parse_data
> > Input path does not exist:
> > file:/data/crawl/segments/20090903093154/parse_text/parse_data
> > Can anyone help?  This is running nutch 1.0 on a clean nutch install on
> > Fedora Linux.
> >
> > I've verified the same error using the nutch-2009-09-03_05-18-47 release
> as
> > well.
> >
> > My script and full error output are below.
> >
> > Thanks
> >
> >
> > --------------------------------------------------------------------
> Nutch
> > Script
> > -------------------------------------------------------------------------
> >
> > #!/bin/bash
> > export JAVA_HOME=/usr/local/jdk
> >
> > # Clean up from last run
> > /bin/rm -rf crawl seed
> > mkdir seed
> > # Copy list of urls to the seed directory
> > cp urls seed/urls.txt
> > # Injects urls in the 'seed' directory into the crawldb
> > /usr/local/nutch/bin/nutch inject crawl/crawldb seed
> > # Generate fetch list, fetch and parse content
> > /usr/local/nutch/bin/nutch generate crawl/crawldb crawl/segments
> > echo DONE GENERATE 1
> > # The above command will generate a new segment directory
> > # under crawl/segments that at this point contains files that
> > # store the url(s) to be fetched. In the following commands
> > # we need the latest segment dir as parameter so well store
> > # it in an environment variable:
> > SEGMENT=`ls -d crawl/segments/2* | tail -1`
> > echo SEGMENT 1: $SEGMENT
> > # Now launch the fetcher that actually goes to get the content
> > /usr/local/nutch/bin/nutch fetch $SEGMENT -noParsing
> > echo DONE FETCH 1
> > # Next, parse the content
> > /usr/local/nutch/bin/nutch parse $SEGMENT
> > echo DONE PARSE 1
> > # Then update the Nutch crawldb. The updatedb command will
> > # store all new urls discovered during the fetch and parse of
> > # the previous segment into Nutch database so they can be
> > # fetched later. Nutch also stores information about the
> > # pages that were fetched so the same urls wont be fetched
> > # again and again.
> > /usr/local/nutch/bin/nutch updatedb crawl/crawldb $SEGMENT -filter
> > -normalize
> > echo DONE UPDATEDB 1
> > # Now the database has entries for all of the pages referenced by the
> > initial set
> >
> > # Now we fetch a new segment with the top-scoring 1000 pages
> > # /usr/local/nutch/bin/nutch generate crawl/crawldb crawl/segments -topN
> > 1000
> > /usr/local/nutch/bin/nutch generate crawl/crawldb crawl/segments
> > echo DONE GENERATE 2
> > # reset SEGMENT
> > SEGMENT=`ls -d crawl/segments/2* | tail -1`
> > echo SEGMENT 2: $SEGMENT
> > # Now re-launch the fetcher to get the content
> > /usr/local/nutch/bin/nutch fetch $SEGMENT -noParsing
> > echo DONE FETCH 2
> > # Next, parse the content
> > /usr/local/nutch/bin/nutch parse $SEGMENT
> > echo DONE PARSE 2
> > # update the db
> > /usr/local/nutch/bin/nutch updatedb crawl/crawldb $SEGMENT -filter
> > -normalize
> > echo DONE UPDATE 2
> >
> > # Fetch another round
> > # /usr/local/nutch/bin/nutch generate crawl/crawldb crawl/segments -topN
> > 1000
> > /usr/local/nutch/bin/nutch generate crawl/crawldb crawl/segments
> > echo DONE GENERATE 3
> > # reset SEGMENT
> > SEGMENT=`ls -d crawl/segments/2* | tail -1`
> > echo SEGMENT 3: $SEGMENT
> > # Now re-launch the fetcher to get the content
> > /usr/local/nutch/bin/nutch fetch $SEGMENT -noParsing
> > echo DONE FETCH 3
> > # Next, parse the content
> > /usr/local/nutch/bin/nutch parse $SEGMENT
> > echo DONE PARSE 3
> > # update the db
> > /usr/local/nutch/bin/nutch updatedb crawl/crawldb $SEGMENT -filter
> > -normalize
> > echo DONE UPDATEDB 3
> >
> > #
> > # We now index what we've gotten
> > #
> > # Before indexing we first invert all of the links,
> > # so that we may index incoming anchor text with the pages.
> > /usr/local/nutch/bin/nutch invertlinks crawl/linkdb -dir crawl/segments/*
> > # Then index
> > /usr/local/nutch/bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb
> > crawl/segments/*
> >
> > ----------------------------------------------------------------  Nutch
> > Errors ---------------------------------------------------------
> >
> >
> >
> >
> >
> >
> >
> >
> > -finishing thread FetcherThread, activeThreads=9
> > -finishing thread FetcherThread, activeThreads=8
> > -finishing thread FetcherThread, activeThreads=7
> > -finishing thread FetcherThread, activeThreads=6
> > -finishing thread FetcherThread, activeThreads=5
> > -finishing thread FetcherThread, activeThreads=4
> > -finishing thread FetcherThread, activeThreads=3
> > -finishing thread FetcherThread, activeThreads=2
> > -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=0
> > -finishing thread FetcherThread, activeThreads=1
> > -finishing thread FetcherThread, activeThreads=0
> > -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> > -activeThreads=0
> > Fetcher: done
> > DONE FETCH 3
> > DONE PARSE 3
> > CrawlDb update: starting
> > CrawlDb update: db: crawl/crawldb
> > CrawlDb update: segments: [crawl/segments/20090903093336]
> > CrawlDb update: additions allowed: true
> > CrawlDb update: URL normalizing: true
> > CrawlDb update: URL filtering: true
> > CrawlDb update: Merging segment data into db.
> > CrawlDb update: done
> > DONE UPDATEDB 3
> > LinkDb: starting
> > LinkDb: linkdb: crawl/linkdb
> > LinkDb: URL normalize: true
> > LinkDb: URL filter: true
> > LinkDb: adding segment:
> file:/data/crawl/segments/20090903093154/parse_data
> > LinkDb: adding segment:
> file:/data/crawl/segments/20090903093154/parse_text
> > LinkDb: adding segment:
> > file:/data/crawl/segments/20090903093154/crawl_fetch
> > LinkDb: adding segment:
> > file:/data/crawl/segments/20090903093154/crawl_generate
> > LinkDb: adding segment: file:/data/crawl/segments/20090903093154/content
> > LinkDb: adding segment:
> > file:/data/crawl/segments/20090903093154/crawl_parse
> > LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does
> not
> > exist: file:/data/crawl/segments/20090903093154/parse_data/parse_data
> > Input path does not exist:
> > file:/data/crawl/segments/20090903093154/parse_text/parse_data
> > Input path does not exist:
> > file:/data/crawl/segments/20090903093154/crawl_fetch/parse_data
> > Input path does not exist:
> > file:/data/crawl/segments/20090903093154/crawl_generate/parse_data
> > Input path does not exist:
> > file:/data/crawl/segments/20090903093154/content/parse_data
> > Input path does not exist:
> > file:/data/crawl/segments/20090903093154/crawl_parse/parse_data
> >  at
> >
> >
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179)
> >  at
> >
> >
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39)
> >  at
> >
> >
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:190)
> >  at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797)
> >  at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142)
> >  at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170)
> >  at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:285)
> >  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >  at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:248)
> > Indexer: starting
> > Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path does
> > not
> > exist: file:/data/crawl/linkdb/current
> >  at
> >
> >
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179)
> >  at
> >
> >
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39)
> >  at
> >
> >
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:190)
> >  at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797)
> >  at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142)
> >  at org.apache.nutch.indexer.Indexer.index(Indexer.java:72)
> >  at org.apache.nutch.indexer.Indexer.run(Indexer.java:92)
> >  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >  at org.apache.nutch.indexer.Indexer.main(Indexer.java:101)
> >
>

Re: InvalidInputException: Input path does not exist

Posted by Julien Nioche <li...@gmail.com>.
Haven't checked but I expect the correct command to be :
*/usr/local/nutch/bin/nutch invertlinks crawl/linkdb -dir crawl/segments*
without the trailing /*

J.

-- 
DigitalPebble Ltd
http://www.digitalpebble.com


nutch invertlinks
2009/9/3 Tom Gardner <to...@tomg.com>

> Hello,
>
> I'm trying to get whole-web crawling working, but I'm getting this error in
> the final indexing steps:
>
> LinkDb: starting
> LinkDb: linkdb: crawl/linkdb
> LinkDb: URL normalize: true
> LinkDb: URL filter: true
> LinkDb: adding segment: file:/data/crawl/segments/20090903093154/parse_data
> LinkDb: adding segment: file:/data/crawl/segments/20090903093154/parse_text
> LinkDb: adding segment:
> file:/data/crawl/segments/20090903093154/crawl_fetch
> LinkDb: adding segment:
> file:/data/crawl/segments/20090903093154/crawl_generate
> LinkDb: adding segment: file:/data/crawl/segments/20090903093154/content
> LinkDb: adding segment:
> file:/data/crawl/segments/20090903093154/crawl_parse
> LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not
> exist: file:/data/crawl/segments/20090903093154/parse_data/parse_data
> Input path does not exist:
> file:/data/crawl/segments/20090903093154/parse_text/parse_data
> Can anyone help?  This is running nutch 1.0 on a clean nutch install on
> Fedora Linux.
>
> I've verified the same error using the nutch-2009-09-03_05-18-47 release as
> well.
>
> My script and full error output are below.
>
> Thanks
>
>
> -------------------------------------------------------------------- Nutch
> Script
> -------------------------------------------------------------------------
>
> #!/bin/bash
> export JAVA_HOME=/usr/local/jdk
>
> # Clean up from last run
> /bin/rm -rf crawl seed
> mkdir seed
> # Copy list of urls to the seed directory
> cp urls seed/urls.txt
> # Injects urls in the 'seed' directory into the crawldb
> /usr/local/nutch/bin/nutch inject crawl/crawldb seed
> # Generate fetch list, fetch and parse content
> /usr/local/nutch/bin/nutch generate crawl/crawldb crawl/segments
> echo DONE GENERATE 1
> # The above command will generate a new segment directory
> # under crawl/segments that at this point contains files that
> # store the url(s) to be fetched. In the following commands
> # we need the latest segment dir as parameter so well store
> # it in an environment variable:
> SEGMENT=`ls -d crawl/segments/2* | tail -1`
> echo SEGMENT 1: $SEGMENT
> # Now launch the fetcher that actually goes to get the content
> /usr/local/nutch/bin/nutch fetch $SEGMENT -noParsing
> echo DONE FETCH 1
> # Next, parse the content
> /usr/local/nutch/bin/nutch parse $SEGMENT
> echo DONE PARSE 1
> # Then update the Nutch crawldb. The updatedb command will
> # store all new urls discovered during the fetch and parse of
> # the previous segment into Nutch database so they can be
> # fetched later. Nutch also stores information about the
> # pages that were fetched so the same urls wont be fetched
> # again and again.
> /usr/local/nutch/bin/nutch updatedb crawl/crawldb $SEGMENT -filter
> -normalize
> echo DONE UPDATEDB 1
> # Now the database has entries for all of the pages referenced by the
> initial set
>
> # Now we fetch a new segment with the top-scoring 1000 pages
> # /usr/local/nutch/bin/nutch generate crawl/crawldb crawl/segments -topN
> 1000
> /usr/local/nutch/bin/nutch generate crawl/crawldb crawl/segments
> echo DONE GENERATE 2
> # reset SEGMENT
> SEGMENT=`ls -d crawl/segments/2* | tail -1`
> echo SEGMENT 2: $SEGMENT
> # Now re-launch the fetcher to get the content
> /usr/local/nutch/bin/nutch fetch $SEGMENT -noParsing
> echo DONE FETCH 2
> # Next, parse the content
> /usr/local/nutch/bin/nutch parse $SEGMENT
> echo DONE PARSE 2
> # update the db
> /usr/local/nutch/bin/nutch updatedb crawl/crawldb $SEGMENT -filter
> -normalize
> echo DONE UPDATE 2
>
> # Fetch another round
> # /usr/local/nutch/bin/nutch generate crawl/crawldb crawl/segments -topN
> 1000
> /usr/local/nutch/bin/nutch generate crawl/crawldb crawl/segments
> echo DONE GENERATE 3
> # reset SEGMENT
> SEGMENT=`ls -d crawl/segments/2* | tail -1`
> echo SEGMENT 3: $SEGMENT
> # Now re-launch the fetcher to get the content
> /usr/local/nutch/bin/nutch fetch $SEGMENT -noParsing
> echo DONE FETCH 3
> # Next, parse the content
> /usr/local/nutch/bin/nutch parse $SEGMENT
> echo DONE PARSE 3
> # update the db
> /usr/local/nutch/bin/nutch updatedb crawl/crawldb $SEGMENT -filter
> -normalize
> echo DONE UPDATEDB 3
>
> #
> # We now index what we've gotten
> #
> # Before indexing we first invert all of the links,
> # so that we may index incoming anchor text with the pages.
> /usr/local/nutch/bin/nutch invertlinks crawl/linkdb -dir crawl/segments/*
> # Then index
> /usr/local/nutch/bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb
> crawl/segments/*
>
> ----------------------------------------------------------------  Nutch
> Errors ---------------------------------------------------------
>
>
>
>
>
>
>
>
> -finishing thread FetcherThread, activeThreads=9
> -finishing thread FetcherThread, activeThreads=8
> -finishing thread FetcherThread, activeThreads=7
> -finishing thread FetcherThread, activeThreads=6
> -finishing thread FetcherThread, activeThreads=5
> -finishing thread FetcherThread, activeThreads=4
> -finishing thread FetcherThread, activeThreads=3
> -finishing thread FetcherThread, activeThreads=2
> -activeThreads=2, spinWaiting=2, fetchQueues.totalSize=0
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=0
> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=0
> Fetcher: done
> DONE FETCH 3
> DONE PARSE 3
> CrawlDb update: starting
> CrawlDb update: db: crawl/crawldb
> CrawlDb update: segments: [crawl/segments/20090903093336]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> DONE UPDATEDB 3
> LinkDb: starting
> LinkDb: linkdb: crawl/linkdb
> LinkDb: URL normalize: true
> LinkDb: URL filter: true
> LinkDb: adding segment: file:/data/crawl/segments/20090903093154/parse_data
> LinkDb: adding segment: file:/data/crawl/segments/20090903093154/parse_text
> LinkDb: adding segment:
> file:/data/crawl/segments/20090903093154/crawl_fetch
> LinkDb: adding segment:
> file:/data/crawl/segments/20090903093154/crawl_generate
> LinkDb: adding segment: file:/data/crawl/segments/20090903093154/content
> LinkDb: adding segment:
> file:/data/crawl/segments/20090903093154/crawl_parse
> LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not
> exist: file:/data/crawl/segments/20090903093154/parse_data/parse_data
> Input path does not exist:
> file:/data/crawl/segments/20090903093154/parse_text/parse_data
> Input path does not exist:
> file:/data/crawl/segments/20090903093154/crawl_fetch/parse_data
> Input path does not exist:
> file:/data/crawl/segments/20090903093154/crawl_generate/parse_data
> Input path does not exist:
> file:/data/crawl/segments/20090903093154/content/parse_data
> Input path does not exist:
> file:/data/crawl/segments/20090903093154/crawl_parse/parse_data
>  at
>
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179)
>  at
>
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39)
>  at
>
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:190)
>  at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797)
>  at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142)
>  at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170)
>  at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:285)
>  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>  at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:248)
> Indexer: starting
> Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path does
> not
> exist: file:/data/crawl/linkdb/current
>  at
>
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179)
>  at
>
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39)
>  at
>
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:190)
>  at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797)
>  at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142)
>  at org.apache.nutch.indexer.Indexer.index(Indexer.java:72)
>  at org.apache.nutch.indexer.Indexer.run(Indexer.java:92)
>  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>  at org.apache.nutch.indexer.Indexer.main(Indexer.java:101)
>