You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Shuo Li <sl...@usc.edu> on 2015/02/21 00:26:22 UTC

linkdb/current/part-00000/data does not exist

Hi,

I'm trying to crawl  NSF ACADIS with nutch-selenium. I meet a problem
*with linkdb/current/part-00000/data
does not exist. *I checked my directory and my files during crawling, and
it appears this file sometimes exist and sometimes disappear. This is quite
weird and stranger.

Another problem is when we crawl NSIDC ADE, it will give us a 403 forbidden
error. Does this mean NSIDC ADE is blocking us?

The log of first error is in the bottom of this email. Any help would be
appreciated.

Regards,
Shuo Li





LinkDb: merging with existing linkdb: nsfacadis3Crawl/linkdb
LinkDb: java.io.FileNotFoundException: File
file:/vagrant/nutch/runtime/local/nsfacadis3Crawl/linkdb/current/part-00000/data
does not exist.
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:402)
at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:255)
at
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:47)
at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081)
at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)
at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:208)
at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:316)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:276)

Re: linkdb/current/part-00000/data does not exist

Posted by Shuo Li <sl...@usc.edu>.

I was using ./bin/crawl and not incremental crawling at that time. This
file appears after I start crawling *.gif, *.jpg, *.mov, etc. I will
provide more information if I can reproduce this error.

Thanks =)

On Sun, Feb 22, 2015 at 4:47 PM, Mattmann, Chris A (3980) <
chris.a.mattmann@jpl.nasa.gov> wrote:

> What command are you using to crawl? Are you using bin/crawl, and/or
> doing incremental crawling?
>
> Cheers,
> Chris
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
> -----Original Message-----
> From: Shuo Li <sl...@usc.edu>
> Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
> Date: Friday, February 20, 2015 at 3:26 PM
> To: "dev@nutch.apache.org" <de...@nutch.apache.org>
> Subject: linkdb/current/part-00000/data does not exist
>
> >Hi,
> >
> >
> >I'm trying to crawl  NSF ACADIS with nutch-selenium. I meet a problem
> >with linkdb/current/part-00000/data does not exist. I checked my
> >directory and my files during crawling, and it appears this file
> >sometimes exist and sometimes disappear. This is quite weird and stranger.
> >
> >
> >Another problem is when we crawl NSIDC ADE, it will give us a 403
> >forbidden error. Does this mean NSIDC ADE is blocking us?
> >
> >
> >The log of first error is in the bottom of this email. Any help would be
> >appreciated.
> >
> >
> >Regards,
> >Shuo Li
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >LinkDb: merging with existing linkdb: nsfacadis3Crawl/linkdb
> >LinkDb: java.io.FileNotFoundException: File
> >file:/vagrant/nutch/runtime/local/nsfacadis3Crawl/linkdb/current/part-0000
> >0/data does not exist.
> >at
> >org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.j
> >ava:402)
> >at
> >org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:
> >255)
> >at
> >org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileIn
> >putFormat.java:47)
> >at
> >org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:20
> >8)
> >at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081)
> >at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)
> >at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)
> >at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983)
> >at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
> >at java.security.AccessController.doPrivileged(Native Method)
> >at javax.security.auth.Subject.doAs(Subject.java:415)
> >at
> >org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.
> >java:1190)
> >at
> >org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
> >at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
> >at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
> >at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:208)
> >at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:316)
> >at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:276)
> >
>
>

Re: linkdb/current/part-00000/data does not exist

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.

What command are you using to crawl? Are you using bin/crawl, and/or
doing incremental crawling?

Cheers,
Chris


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Shuo Li <sl...@usc.edu>
Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
Date: Friday, February 20, 2015 at 3:26 PM
To: "dev@nutch.apache.org" <de...@nutch.apache.org>
Subject: linkdb/current/part-00000/data does not exist

>Hi,
>
>
>I'm trying to crawl  NSF ACADIS with nutch-selenium. I meet a problem
>with linkdb/current/part-00000/data does not exist. I checked my
>directory and my files during crawling, and it appears this file
>sometimes exist and sometimes disappear. This is quite weird and stranger.
>
>
>Another problem is when we crawl NSIDC ADE, it will give us a 403
>forbidden error. Does this mean NSIDC ADE is blocking us?
>
>
>The log of first error is in the bottom of this email. Any help would be
>appreciated.
>
>
>Regards,
>Shuo Li
>
>
>
>
>
>
>
>
>
>
>LinkDb: merging with existing linkdb: nsfacadis3Crawl/linkdb
>LinkDb: java.io.FileNotFoundException: File
>file:/vagrant/nutch/runtime/local/nsfacadis3Crawl/linkdb/current/part-0000
>0/data does not exist.
>at 
>org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.j
>ava:402)
>at 
>org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:
>255)
>at 
>org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileIn
>putFormat.java:47)
>at 
>org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:20
>8)
>at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081)
>at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)
>at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)
>at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983)
>at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
>at java.security.AccessController.doPrivileged(Native Method)
>at javax.security.auth.Subject.doAs(Subject.java:415)
>at 
>org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.
>java:1190)
>at 
>org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
>at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
>at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
>at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:208)
>at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:316)
>at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:276)
>

Re: linkdb/current/part-00000/data does not exist

Posted by veeresh beeram <ve...@gmail.com>.

Hi,

I was unable to reproduce the linkdb error.

The NSIDC ADE 403 forbidden error occurs because NSIDC seems to be blocking
User-Agent's containing "nutch" in them.

--
Thanks,
Veeresh

On 20 February 2015 at 15:26, Shuo Li <sl...@usc.edu> wrote:

> Hi,
>
> I'm trying to crawl  NSF ACADIS with nutch-selenium. I meet a problem *with linkdb/current/part-00000/data
> does not exist. *I checked my directory and my files during crawling, and
> it appears this file sometimes exist and sometimes disappear. This is quite
> weird and stranger.
>
> Another problem is when we crawl NSIDC ADE, it will give us a 403
> forbidden error. Does this mean NSIDC ADE is blocking us?
>
> The log of first error is in the bottom of this email. Any help would be
> appreciated.
>
> Regards,
> Shuo Li
>
>
>
>
>
> LinkDb: merging with existing linkdb: nsfacadis3Crawl/linkdb
> LinkDb: java.io.FileNotFoundException: File
> file:/vagrant/nutch/runtime/local/nsfacadis3Crawl/linkdb/current/part-00000/data
> does not exist.
> at
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:402)
> at
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:255)
> at
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:47)
> at
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
> at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081)
> at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)
> at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)
> at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983)
> at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
> at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:208)
> at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:316)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:276)
>