You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Svyatoslav Lavryk <la...@gmail.com> on 2015/03/10 16:09:11 UTC
"Not a File" Error on Re-Crawling
Hello,
We are using Nutch 1.9 and Hadoop 1.2.1
When we submit URLs for initial crawl, everything works fine.
When we submit some other urls next time to be added to an already existing
crawl db, we receive error:
15/03/10 13:09:20 ERROR security.UserGroupInformation:
PriviledgedActionException as:ubuntu cause:java.io.IOException: Not a file:
hdfs://master:9000/user/ubuntu/urls/crawldb
15/03/10 13:09:20 ERROR crawl.Injector: Injector: java.io.IOException: Not
a file: hdfs://master:9000/user/ubuntu/urls/crawldb
Stack trace:
at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:215)
at
org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081)
at
org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)
at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
at org.apache.nutch.crawl.Injector.inject(Injector.java:281)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:132)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:160)
We think we have enabled permissions on the folder:
/usr/local/hadoop/bin/hadoop dfs -chmod -R 777 /user/ubuntu/urls/
Any ideas will be appreciated.
Thanks,
Slavik
Re: "Not a File" Error on Re-Crawling
Posted by Svyatoslav Lavryk <la...@gmail.com>.
Thanks Sebastian,
Hopefully your suggestion will help.
Slavik.
On Tue, Mar 10, 2015 at 9:39 PM, Sebastian Nagel <wastl.nagel@googlemail.com
> wrote:
> Hi Slavik,
>
> assumed that
> /user/ubuntu/urls/
> contains seed URLs it should not contain also the CrawlDb.
> The path in the error message
> /user/ubuntu/urls/crawldb
> suggests that Injector tries to read URLs from crawldb
> which is (a) a directory and (b) contains binary data.
>
> Sebastian
>
>
> On 03/10/2015 04:09 PM, Svyatoslav Lavryk wrote:
> > Hello,
> >
> > We are using Nutch 1.9 and Hadoop 1.2.1
> >
> > When we submit URLs for initial crawl, everything works fine.
> >
> > When we submit some other urls next time to be added to an already
> existing
> > crawl db, we receive error:
> >
> > 15/03/10 13:09:20 ERROR security.UserGroupInformation:
> > PriviledgedActionException as:ubuntu cause:java.io.IOException: Not a
> file:
> > hdfs://master:9000/user/ubuntu/urls/crawldb
> > 15/03/10 13:09:20 ERROR crawl.Injector: Injector: java.io.IOException:
> Not
> > a file: hdfs://master:9000/user/ubuntu/urls/crawldb
> >
> > Stack trace:
> > at
> >
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:215)
> > at
> > org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081)
> > at
> > org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)
> > at
> org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)
> > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983)
> > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
> > at java.security.AccessController.doPrivileged(Native Method)
> > at javax.security.auth.Subject.doAs(Subject.java:415)
> > at
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
> > at
> > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
> > at
> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
> > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
> > at org.apache.nutch.crawl.Injector.inject(Injector.java:281)
> > at org.apache.nutch.crawl.Crawl.run(Crawl.java:132)
> > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > at
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> > at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > at java.lang.reflect.Method.invoke(Method.java:606)
> > at org.apache.hadoop.util.RunJar.main(RunJar.java:160)
> >
> >
> > We think we have enabled permissions on the folder:
> > /usr/local/hadoop/bin/hadoop dfs -chmod -R 777 /user/ubuntu/urls/
> >
> > Any ideas will be appreciated.
> >
> > Thanks,
> > Slavik
> >
>
>
Re: "Not a File" Error on Re-Crawling
Posted by Svyatoslav Lavryk <la...@gmail.com>.
Hi Sebastian,
Your suggestion actually helped.
Thank you very much.
Slavik.
On Tue, Mar 10, 2015 at 9:39 PM, Sebastian Nagel <wastl.nagel@googlemail.com
> wrote:
> Hi Slavik,
>
> assumed that
> /user/ubuntu/urls/
> contains seed URLs it should not contain also the CrawlDb.
> The path in the error message
> /user/ubuntu/urls/crawldb
> suggests that Injector tries to read URLs from crawldb
> which is (a) a directory and (b) contains binary data.
>
> Sebastian
>
>
> On 03/10/2015 04:09 PM, Svyatoslav Lavryk wrote:
> > Hello,
> >
> > We are using Nutch 1.9 and Hadoop 1.2.1
> >
> > When we submit URLs for initial crawl, everything works fine.
> >
> > When we submit some other urls next time to be added to an already
> existing
> > crawl db, we receive error:
> >
> > 15/03/10 13:09:20 ERROR security.UserGroupInformation:
> > PriviledgedActionException as:ubuntu cause:java.io.IOException: Not a
> file:
> > hdfs://master:9000/user/ubuntu/urls/crawldb
> > 15/03/10 13:09:20 ERROR crawl.Injector: Injector: java.io.IOException:
> Not
> > a file: hdfs://master:9000/user/ubuntu/urls/crawldb
> >
> > Stack trace:
> > at
> >
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:215)
> > at
> > org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081)
> > at
> > org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)
> > at
> org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)
> > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983)
> > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
> > at java.security.AccessController.doPrivileged(Native Method)
> > at javax.security.auth.Subject.doAs(Subject.java:415)
> > at
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
> > at
> > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
> > at
> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
> > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
> > at org.apache.nutch.crawl.Injector.inject(Injector.java:281)
> > at org.apache.nutch.crawl.Crawl.run(Crawl.java:132)
> > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > at
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> > at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > at java.lang.reflect.Method.invoke(Method.java:606)
> > at org.apache.hadoop.util.RunJar.main(RunJar.java:160)
> >
> >
> > We think we have enabled permissions on the folder:
> > /usr/local/hadoop/bin/hadoop dfs -chmod -R 777 /user/ubuntu/urls/
> >
> > Any ideas will be appreciated.
> >
> > Thanks,
> > Slavik
> >
>
>
Re: "Not a File" Error on Re-Crawling
Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Slavik,
assumed that
/user/ubuntu/urls/
contains seed URLs it should not contain also the CrawlDb.
The path in the error message
/user/ubuntu/urls/crawldb
suggests that Injector tries to read URLs from crawldb
which is (a) a directory and (b) contains binary data.
Sebastian
On 03/10/2015 04:09 PM, Svyatoslav Lavryk wrote:
> Hello,
>
> We are using Nutch 1.9 and Hadoop 1.2.1
>
> When we submit URLs for initial crawl, everything works fine.
>
> When we submit some other urls next time to be added to an already existing
> crawl db, we receive error:
>
> 15/03/10 13:09:20 ERROR security.UserGroupInformation:
> PriviledgedActionException as:ubuntu cause:java.io.IOException: Not a file:
> hdfs://master:9000/user/ubuntu/urls/crawldb
> 15/03/10 13:09:20 ERROR crawl.Injector: Injector: java.io.IOException: Not
> a file: hdfs://master:9000/user/ubuntu/urls/crawldb
>
> Stack trace:
> at
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:215)
> at
> org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081)
> at
> org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)
> at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)
> at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983)
> at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
> at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
> at org.apache.nutch.crawl.Injector.inject(Injector.java:281)
> at org.apache.nutch.crawl.Crawl.run(Crawl.java:132)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:160)
>
>
> We think we have enabled permissions on the folder:
> /usr/local/hadoop/bin/hadoop dfs -chmod -R 777 /user/ubuntu/urls/
>
> Any ideas will be appreciated.
>
> Thanks,
> Slavik
>