You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Paul Tomblin <pt...@xcski.com> on 2009/07/24 16:53:31 UTC

Why did my crawl fail?

I installed nutch 1.0 on my laptop last night and set it running to crawl my
blog with the command:  bin/nutch crawl urls -dir crawl.blog -depth 10
it was still running strong when I went to bed several hours later, and this
morning I woke up to this:

activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawl.blog/crawldb
CrawlDb update: segments: [crawl.blog/segments/20090724010303]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
LinkDb: starting
LinkDb: linkdb: crawl.blog/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment:
file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723154530
LinkDb: adding segment:
file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155106
LinkDb: adding segment:
file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155122
LinkDb: adding segment:
file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155303
LinkDb: adding segment:
file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155812
LinkDb: adding segment:
file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723161808
LinkDb: adding segment:
file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723171215
LinkDb: adding segment:
file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723193543
LinkDb: adding segment:
file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723224936
LinkDb: adding segment:
file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724004250
LinkDb: adding segment:
file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724010303
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException:
Input path does not exist:
file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723154530/parse_data
at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179)
at
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39)
at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:190)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:147)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:129)


-- 
http://www.linkedin.com/in/paultomblin

Re: Why did my crawl fail?

Posted by Paul Tomblin <pt...@xcski.com>.

Unfortunately I blew away those particular logs when I fetched the svn
trunk.  I just tried it again (well, I started it again at noon and it just
finished) and this time it worked fine, so it seems kind of heisenbug-like.
 Maybe it has something to do with which pages are types it can't handle?

On Mon, Jul 27, 2009 at 11:27 AM, xiao yang <ya...@gmail.com> wrote:

> Hi, Paul
>
> Can you post the error messages in the log file
> (file:/Users/ptomblin/nutch-1.0/logs)?
>
> On Mon, Jul 27, 2009 at 6:55 PM, Paul Tomblin<pt...@xcski.com> wrote:
> > Actually, I got that error the first time I used it, and then again when
> I
> > blew away the downloaded nutch and grabbed the latest trunk from
> Subversion.
> >
> > On Mon, Jul 27, 2009 at 1:11 AM, xiao yang <ya...@gmail.com>
> wrote:
> >
> >> You must have crawled for several times, and some of them failed
> >> before the parse phase. So the parse data was not generated.
> >> You'd better delete the whole directory
> >> file:/Users/ptomblin/nutch-1.0/crawl.blog, and recrawl it, then you
> >> will know the exact reason why it failed in the parse phase from the
> >> output information.
> >>
> >> Xiao
> >>
> >> On Fri, Jul 24, 2009 at 10:53 PM, Paul Tomblin<pt...@xcski.com>
> wrote:
> >> > I installed nutch 1.0 on my laptop last night and set it running to
> crawl
> >> my
> >> > blog with the command:  bin/nutch crawl urls -dir crawl.blog -depth 10
> >> > it was still running strong when I went to bed several hours later,
> and
> >> this
> >> > morning I woke up to this:
> >> >
> >> > activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> >> > -activeThreads=0
> >> > Fetcher: done
> >> > CrawlDb update: starting
> >> > CrawlDb update: db: crawl.blog/crawldb
> >> > CrawlDb update: segments: [crawl.blog/segments/20090724010303]
> >> > CrawlDb update: additions allowed: true
> >> > CrawlDb update: URL normalizing: true
> >> > CrawlDb update: URL filtering: true
> >> > CrawlDb update: Merging segment data into db.
> >> > CrawlDb update: done
> >> > LinkDb: starting
> >> > LinkDb: linkdb: crawl.blog/linkdb
> >> > LinkDb: URL normalize: true
> >> > LinkDb: URL filter: true
> >> > LinkDb: adding segment:
> >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723154530
> >> > LinkDb: adding segment:
> >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155106
> >> > LinkDb: adding segment:
> >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155122
> >> > LinkDb: adding segment:
> >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155303
> >> > LinkDb: adding segment:
> >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155812
> >> > LinkDb: adding segment:
> >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723161808
> >> > LinkDb: adding segment:
> >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723171215
> >> > LinkDb: adding segment:
> >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723193543
> >> > LinkDb: adding segment:
> >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723224936
> >> > LinkDb: adding segment:
> >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724004250
> >> > LinkDb: adding segment:
> >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724010303
> >> > Exception in thread "main"
> >> org.apache.hadoop.mapred.InvalidInputException:
> >> > Input path does not exist:
> >> >
> >>
> file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723154530/parse_data
> >> > at
> >> >
> >>
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179)
> >> > at
> >> >
> >>
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39)
> >> > at
> >> >
> >>
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:190)
> >> > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797)
> >> > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142)
> >> > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170)
> >> > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:147)
> >> > at org.apache.nutch.crawl.Crawl.main(Crawl.java:129)
> >> >
> >> >
> >> > --
> >> > http://www.linkedin.com/in/paultomblin
> >> >
> >>
> >
> >
> >
> > --
> > http://www.linkedin.com/in/paultomblin
> >
>



-- 
http://www.linkedin.com/in/paultomblin

Re: Why did my crawl fail?

Posted by xiao yang <ya...@gmail.com>.

Hi, Paul

Can you post the error messages in the log file
(file:/Users/ptomblin/nutch-1.0/logs)?

On Mon, Jul 27, 2009 at 6:55 PM, Paul Tomblin<pt...@xcski.com> wrote:
> Actually, I got that error the first time I used it, and then again when I
> blew away the downloaded nutch and grabbed the latest trunk from Subversion.
>
> On Mon, Jul 27, 2009 at 1:11 AM, xiao yang <ya...@gmail.com> wrote:
>
>> You must have crawled for several times, and some of them failed
>> before the parse phase. So the parse data was not generated.
>> You'd better delete the whole directory
>> file:/Users/ptomblin/nutch-1.0/crawl.blog, and recrawl it, then you
>> will know the exact reason why it failed in the parse phase from the
>> output information.
>>
>> Xiao
>>
>> On Fri, Jul 24, 2009 at 10:53 PM, Paul Tomblin<pt...@xcski.com> wrote:
>> > I installed nutch 1.0 on my laptop last night and set it running to crawl
>> my
>> > blog with the command:  bin/nutch crawl urls -dir crawl.blog -depth 10
>> > it was still running strong when I went to bed several hours later, and
>> this
>> > morning I woke up to this:
>> >
>> > activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
>> > -activeThreads=0
>> > Fetcher: done
>> > CrawlDb update: starting
>> > CrawlDb update: db: crawl.blog/crawldb
>> > CrawlDb update: segments: [crawl.blog/segments/20090724010303]
>> > CrawlDb update: additions allowed: true
>> > CrawlDb update: URL normalizing: true
>> > CrawlDb update: URL filtering: true
>> > CrawlDb update: Merging segment data into db.
>> > CrawlDb update: done
>> > LinkDb: starting
>> > LinkDb: linkdb: crawl.blog/linkdb
>> > LinkDb: URL normalize: true
>> > LinkDb: URL filter: true
>> > LinkDb: adding segment:
>> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723154530
>> > LinkDb: adding segment:
>> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155106
>> > LinkDb: adding segment:
>> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155122
>> > LinkDb: adding segment:
>> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155303
>> > LinkDb: adding segment:
>> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155812
>> > LinkDb: adding segment:
>> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723161808
>> > LinkDb: adding segment:
>> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723171215
>> > LinkDb: adding segment:
>> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723193543
>> > LinkDb: adding segment:
>> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723224936
>> > LinkDb: adding segment:
>> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724004250
>> > LinkDb: adding segment:
>> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724010303
>> > Exception in thread "main"
>> org.apache.hadoop.mapred.InvalidInputException:
>> > Input path does not exist:
>> >
>> file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723154530/parse_data
>> > at
>> >
>> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179)
>> > at
>> >
>> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39)
>> > at
>> >
>> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:190)
>> > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797)
>> > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142)
>> > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170)
>> > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:147)
>> > at org.apache.nutch.crawl.Crawl.main(Crawl.java:129)
>> >
>> >
>> > --
>> > http://www.linkedin.com/in/paultomblin
>> >
>>
>
>
>
> --
> http://www.linkedin.com/in/paultomblin
>

Re: Why did my crawl fail?

Posted by Paul Tomblin <pt...@xcski.com>.

Actually, I got that error the first time I used it, and then again when I
blew away the downloaded nutch and grabbed the latest trunk from Subversion.

On Mon, Jul 27, 2009 at 1:11 AM, xiao yang <ya...@gmail.com> wrote:

> You must have crawled for several times, and some of them failed
> before the parse phase. So the parse data was not generated.
> You'd better delete the whole directory
> file:/Users/ptomblin/nutch-1.0/crawl.blog, and recrawl it, then you
> will know the exact reason why it failed in the parse phase from the
> output information.
>
> Xiao
>
> On Fri, Jul 24, 2009 at 10:53 PM, Paul Tomblin<pt...@xcski.com> wrote:
> > I installed nutch 1.0 on my laptop last night and set it running to crawl
> my
> > blog with the command:  bin/nutch crawl urls -dir crawl.blog -depth 10
> > it was still running strong when I went to bed several hours later, and
> this
> > morning I woke up to this:
> >
> > activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> > -activeThreads=0
> > Fetcher: done
> > CrawlDb update: starting
> > CrawlDb update: db: crawl.blog/crawldb
> > CrawlDb update: segments: [crawl.blog/segments/20090724010303]
> > CrawlDb update: additions allowed: true
> > CrawlDb update: URL normalizing: true
> > CrawlDb update: URL filtering: true
> > CrawlDb update: Merging segment data into db.
> > CrawlDb update: done
> > LinkDb: starting
> > LinkDb: linkdb: crawl.blog/linkdb
> > LinkDb: URL normalize: true
> > LinkDb: URL filter: true
> > LinkDb: adding segment:
> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723154530
> > LinkDb: adding segment:
> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155106
> > LinkDb: adding segment:
> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155122
> > LinkDb: adding segment:
> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155303
> > LinkDb: adding segment:
> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155812
> > LinkDb: adding segment:
> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723161808
> > LinkDb: adding segment:
> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723171215
> > LinkDb: adding segment:
> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723193543
> > LinkDb: adding segment:
> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723224936
> > LinkDb: adding segment:
> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724004250
> > LinkDb: adding segment:
> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724010303
> > Exception in thread "main"
> org.apache.hadoop.mapred.InvalidInputException:
> > Input path does not exist:
> >
> file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723154530/parse_data
> > at
> >
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179)
> > at
> >
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39)
> > at
> >
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:190)
> > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797)
> > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142)
> > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170)
> > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:147)
> > at org.apache.nutch.crawl.Crawl.main(Crawl.java:129)
> >
> >
> > --
> > http://www.linkedin.com/in/paultomblin
> >
>



-- 
http://www.linkedin.com/in/paultomblin

Re: Why did my crawl fail?

Posted by xiao yang <ya...@gmail.com>.

You must have crawled for several times, and some of them failed
before the parse phase. So the parse data was not generated.
You'd better delete the whole directory
file:/Users/ptomblin/nutch-1.0/crawl.blog, and recrawl it, then you
will know the exact reason why it failed in the parse phase from the
output information.

Xiao

On Fri, Jul 24, 2009 at 10:53 PM, Paul Tomblin<pt...@xcski.com> wrote:
> I installed nutch 1.0 on my laptop last night and set it running to crawl my
> blog with the command:  bin/nutch crawl urls -dir crawl.blog -depth 10
> it was still running strong when I went to bed several hours later, and this
> morning I woke up to this:
>
> activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=0
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: crawl.blog/crawldb
> CrawlDb update: segments: [crawl.blog/segments/20090724010303]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> LinkDb: starting
> LinkDb: linkdb: crawl.blog/linkdb
> LinkDb: URL normalize: true
> LinkDb: URL filter: true
> LinkDb: adding segment:
> file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723154530
> LinkDb: adding segment:
> file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155106
> LinkDb: adding segment:
> file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155122
> LinkDb: adding segment:
> file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155303
> LinkDb: adding segment:
> file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155812
> LinkDb: adding segment:
> file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723161808
> LinkDb: adding segment:
> file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723171215
> LinkDb: adding segment:
> file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723193543
> LinkDb: adding segment:
> file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723224936
> LinkDb: adding segment:
> file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724004250
> LinkDb: adding segment:
> file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724010303
> Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException:
> Input path does not exist:
> file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723154530/parse_data
> at
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179)
> at
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39)
> at
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:190)
> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797)
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142)
> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170)
> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:147)
> at org.apache.nutch.crawl.Crawl.main(Crawl.java:129)
>
>
> --
> http://www.linkedin.com/in/paultomblin
>

RE: Why did my crawl fail?

Posted by Ar...@csiro.au.

Sorry, I think you misunderstood me. I meant no content has been fetched on that iteration, for the segment that does not have parse_data.

> -----Original Message-----
> From: ptomblin@gmail.com [mailto:ptomblin@gmail.com] On Behalf Of Paul
> Tomblin
> Sent: Monday, July 27, 2009 11:12 AM
> To: nutch-user@lucene.apache.org
> Subject: Re: Why did my crawl fail?
> 
> No, it fetched thousands of pages - my blog and picture gallery.  It just
> never finished indexing them because as well as looking at the 11 segments
> that exist, it's also trying to look at a segment that doesn't.
> 
> On Sun, Jul 26, 2009 at 9:06 PM, <Ar...@csiro.au> wrote:
> 
> > This is a very interesting issue. I guess that absence of parse_data
> means
> > that no content has been fetched. Am I wrong?
> >
> > This happened in my crawls a few times. Theoretically (I am guessing
> again)
> > this may happen if all urls selected for fetching on this iteration are
> > either blocked by the filters, or failed to be fetched, for whatever
> reason.
> >
> > I got around this problem by checking for presence of parse_data, and if
> it
> > is absent, deleting the segment. This seems to be working, but I am not
> 100%
> > sure that this is a good thing to do. Can I do this? Is it safe to do?
> Would
> > appreciate if someone with expert knowledge commented on this issue.
> >
> > Regards,
> >
> > Arkadi
> >
> >
> > > -----Original Message-----
> > > From: ptomblin@gmail.com [mailto:ptomblin@gmail.com] On Behalf Of Paul
> > > Tomblin
> > > Sent: Saturday, July 25, 2009 12:54 AM
> > > To: nutch-user
> > > Subject: Why did my crawl fail?
> > >
> > > I installed nutch 1.0 on my laptop last night and set it running to
> crawl
> > > my
> > > blog with the command:  bin/nutch crawl urls -dir crawl.blog -depth 10
> > > it was still running strong when I went to bed several hours later,
> and
> > > this
> > > morning I woke up to this:
> > >
> > > activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> > > -activeThreads=0
> > > Fetcher: done
> > > CrawlDb update: starting
> > > CrawlDb update: db: crawl.blog/crawldb
> > > CrawlDb update: segments: [crawl.blog/segments/20090724010303]
> > > CrawlDb update: additions allowed: true
> > > CrawlDb update: URL normalizing: true
> > > CrawlDb update: URL filtering: true
> > > CrawlDb update: Merging segment data into db.
> > > CrawlDb update: done
> > > LinkDb: starting
> > > LinkDb: linkdb: crawl.blog/linkdb
> > > LinkDb: URL normalize: true
> > > LinkDb: URL filter: true
> > > LinkDb: adding segment:
> > > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723154530
> > > LinkDb: adding segment:
> > > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155106
> > > LinkDb: adding segment:
> > > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155122
> > > LinkDb: adding segment:
> > > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155303
> > > LinkDb: adding segment:
> > > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155812
> > > LinkDb: adding segment:
> > > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723161808
> > > LinkDb: adding segment:
> > > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723171215
> > > LinkDb: adding segment:
> > > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723193543
> > > LinkDb: adding segment:
> > > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723224936
> > > LinkDb: adding segment:
> > > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724004250
> > > LinkDb: adding segment:
> > > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724010303
> > > Exception in thread "main"
> > org.apache.hadoop.mapred.InvalidInputException:
> > > Input path does not exist:
> > > file:/Users/ptomblin/nutch-
> > > 1.0/crawl.blog/segments/20090723154530/parse_data
> > > at
> > >
> >
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:1
> > > 79)
> > > at
> > >
> >
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileIn
> > > putFormat.java:39)
> > > at
> > >
> >
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:19
> > > 0)
> > > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797)
> > > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142)
> > > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170)
> > > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:147)
> > > at org.apache.nutch.crawl.Crawl.main(Crawl.java:129)
> > >
> > >
> > > --
> > > http://www.linkedin.com/in/paultomblin
> >
> 
> 
> 
> --
> http://www.linkedin.com/in/paultomblin

Re: Why did my crawl fail?

Posted by Paul Tomblin <pt...@xcski.com>.

No, it fetched thousands of pages - my blog and picture gallery.  It just
never finished indexing them because as well as looking at the 11 segments
that exist, it's also trying to look at a segment that doesn't.

On Sun, Jul 26, 2009 at 9:06 PM, <Ar...@csiro.au> wrote:

> This is a very interesting issue. I guess that absence of parse_data means
> that no content has been fetched. Am I wrong?
>
> This happened in my crawls a few times. Theoretically (I am guessing again)
> this may happen if all urls selected for fetching on this iteration are
> either blocked by the filters, or failed to be fetched, for whatever reason.
>
> I got around this problem by checking for presence of parse_data, and if it
> is absent, deleting the segment. This seems to be working, but I am not 100%
> sure that this is a good thing to do. Can I do this? Is it safe to do? Would
> appreciate if someone with expert knowledge commented on this issue.
>
> Regards,
>
> Arkadi
>
>
> > -----Original Message-----
> > From: ptomblin@gmail.com [mailto:ptomblin@gmail.com] On Behalf Of Paul
> > Tomblin
> > Sent: Saturday, July 25, 2009 12:54 AM
> > To: nutch-user
> > Subject: Why did my crawl fail?
> >
> > I installed nutch 1.0 on my laptop last night and set it running to crawl
> > my
> > blog with the command:  bin/nutch crawl urls -dir crawl.blog -depth 10
> > it was still running strong when I went to bed several hours later, and
> > this
> > morning I woke up to this:
> >
> > activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> > -activeThreads=0
> > Fetcher: done
> > CrawlDb update: starting
> > CrawlDb update: db: crawl.blog/crawldb
> > CrawlDb update: segments: [crawl.blog/segments/20090724010303]
> > CrawlDb update: additions allowed: true
> > CrawlDb update: URL normalizing: true
> > CrawlDb update: URL filtering: true
> > CrawlDb update: Merging segment data into db.
> > CrawlDb update: done
> > LinkDb: starting
> > LinkDb: linkdb: crawl.blog/linkdb
> > LinkDb: URL normalize: true
> > LinkDb: URL filter: true
> > LinkDb: adding segment:
> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723154530
> > LinkDb: adding segment:
> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155106
> > LinkDb: adding segment:
> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155122
> > LinkDb: adding segment:
> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155303
> > LinkDb: adding segment:
> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155812
> > LinkDb: adding segment:
> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723161808
> > LinkDb: adding segment:
> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723171215
> > LinkDb: adding segment:
> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723193543
> > LinkDb: adding segment:
> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723224936
> > LinkDb: adding segment:
> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724004250
> > LinkDb: adding segment:
> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724010303
> > Exception in thread "main"
> org.apache.hadoop.mapred.InvalidInputException:
> > Input path does not exist:
> > file:/Users/ptomblin/nutch-
> > 1.0/crawl.blog/segments/20090723154530/parse_data
> > at
> >
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:1
> > 79)
> > at
> >
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileIn
> > putFormat.java:39)
> > at
> >
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:19
> > 0)
> > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797)
> > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142)
> > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170)
> > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:147)
> > at org.apache.nutch.crawl.Crawl.main(Crawl.java:129)
> >
> >
> > --
> > http://www.linkedin.com/in/paultomblin
>



-- 
http://www.linkedin.com/in/paultomblin

RE: Why did my crawl fail?

Posted by Ar...@csiro.au.

This is a very interesting issue. I guess that absence of parse_data means that no content has been fetched. Am I wrong? 

This happened in my crawls a few times. Theoretically (I am guessing again) this may happen if all urls selected for fetching on this iteration are either blocked by the filters, or failed to be fetched, for whatever reason.

I got around this problem by checking for presence of parse_data, and if it is absent, deleting the segment. This seems to be working, but I am not 100% sure that this is a good thing to do. Can I do this? Is it safe to do? Would appreciate if someone with expert knowledge commented on this issue.

Regards,

Arkadi


> -----Original Message-----
> From: ptomblin@gmail.com [mailto:ptomblin@gmail.com] On Behalf Of Paul
> Tomblin
> Sent: Saturday, July 25, 2009 12:54 AM
> To: nutch-user
> Subject: Why did my crawl fail?
> 
> I installed nutch 1.0 on my laptop last night and set it running to crawl
> my
> blog with the command:  bin/nutch crawl urls -dir crawl.blog -depth 10
> it was still running strong when I went to bed several hours later, and
> this
> morning I woke up to this:
> 
> activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=0
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: crawl.blog/crawldb
> CrawlDb update: segments: [crawl.blog/segments/20090724010303]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> LinkDb: starting
> LinkDb: linkdb: crawl.blog/linkdb
> LinkDb: URL normalize: true
> LinkDb: URL filter: true
> LinkDb: adding segment:
> file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723154530
> LinkDb: adding segment:
> file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155106
> LinkDb: adding segment:
> file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155122
> LinkDb: adding segment:
> file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155303
> LinkDb: adding segment:
> file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155812
> LinkDb: adding segment:
> file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723161808
> LinkDb: adding segment:
> file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723171215
> LinkDb: adding segment:
> file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723193543
> LinkDb: adding segment:
> file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723224936
> LinkDb: adding segment:
> file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724004250
> LinkDb: adding segment:
> file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724010303
> Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException:
> Input path does not exist:
> file:/Users/ptomblin/nutch-
> 1.0/crawl.blog/segments/20090723154530/parse_data
> at
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:1
> 79)
> at
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileIn
> putFormat.java:39)
> at
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:19
> 0)
> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797)
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142)
> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170)
> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:147)
> at org.apache.nutch.crawl.Crawl.main(Crawl.java:129)
> 
> 
> --
> http://www.linkedin.com/in/paultomblin