You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by "Meraj A. Khan" <me...@gmail.com> on 2014/09/15 07:05:53 UTC

Fetch Job Started Failing on Hadoop Cluster

Hello Folks,

My Nutch crawl which was running fine , started failing in the first Fetch
Job/Application, I am unable to figure out whats going on here, I have
attached the last snippet of the log below , can some please let me know
whats going on here ?

What I noticed is that even though the generate phase created a
segment 20140915004940
, the fetch phase is only looking up to the segments directory for the
segments.

Thanks.

14/09/15 00:50:07 INFO crawl.Generator: Generator: finished at 2014-09-15
00:50:07, elapsed: 00:00:59
ls: cannot access crawldirectory/segments/: No such file or directory
Operating on segment :
Fetching :
14/09/15 00:50:09 INFO fetcher.Fetcher: Fetcher: starting at 2014-09-15
00:50:09
14/09/15 00:50:09 INFO fetcher.Fetcher: Fetcher: segment:
crawldirectory/segments
14/09/15 00:50:09 INFO fetcher.Fetcher: Fetcher Timelimit set for :
1410767409664
Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library
/opt/hadoop-2.3.0/lib/native/libhadoop.so.1.0.0 which might have disabled
stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c
<libfile>', or link it with '-z noexecstack'.
14/09/15 00:50:10 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
14/09/15 00:50:10 INFO client.RMProxy: Connecting to ResourceManager at
server1.mydomain.com/170.75.152.162:8040
14/09/15 00:50:10 INFO client.RMProxy: Connecting to ResourceManager at
server1.mydomain.com/170.75.152.162:8040
14/09/15 00:50:12 INFO mapreduce.JobSubmitter: Cleaning up the staging area
/tmp/hadoop-yarn/staging/df/.staging/job_1410742329411_0010
14/09/15 00:50:12 WARN security.UserGroupInformation:
PriviledgedActionException as:df (auth:SIMPLE)
cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not
exist: hdfs://
server1.mydomain.com:9000/user/df/crawldirectory/segments/crawl_generate
14/09/15 00:50:12 WARN security.UserGroupInformation:
PriviledgedActionException as:df (auth:SIMPLE)
cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not
exist: hdfs://
server1.mydomain.com:9000/user/df/crawldirectory/segments/crawl_generate
14/09/15 00:50:12 ERROR fetcher.Fetcher: Fetcher:
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
hdfs://
server1.mydomain.com:9000/user/df/crawldirectory/segments/crawl_generate
    at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251)
    at
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:45)
    at
org.apache.nutch.fetcher.Fetcher$InputFormat.getSplits(Fetcher.java:108)
    at
org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:520)
    at
org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:512)
    at
org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:394)
    at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285)
    at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
    at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282)
    at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:562)
    at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:557)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
    at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:557)
    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:548)
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:833)
    at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1349)
    at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1385)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1358)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:212)

Re: Fetch Job Started Failing on Hadoop Cluster

Posted by "Meraj A. Khan" <me...@gmail.com>.

Markus,

Thanks, the issue I was setting the PATH variable in the bin/crawl script
and once I removed it and set it outside of the bin/crawl script , it
started working fine now.



On Tue, Sep 16, 2014 at 6:39 AM, Markus Jelsma <ma...@openindex.io>
wrote:

> Hi - you made Nutch believe that
> hdfs://server1.mydomain.com:9000/user/df/crawldirectory/segments/ is a
> segment, but it is not. So either no segment was created or written to the
> wrong location.
>
> I don't know what kind of script you are using but you should check the
> return
> code of the generator, if gives a -1 for no segment created.
>
> Markus
>
>
>
>
>
>
> -----Original message-----
> > From:Meraj A. Khan <merajak@gmail.com <ma...@gmail.com> >
> > Sent: Monday 15th September 2014 7:02
> > To: user@nutch.apache.org <ma...@nutch.apache.org>
> > Subject: Fetch Job Started Failing on Hadoop Cluster
> >
> > Hello Folks,
> >
> > My Nutch crawl which was running fine , started failing in the first
> Fetch
> > Job/Application, I am unable to figure out whats going on here, I have
> > attached the last snippet of the log below , can some please let me know
> > whats going on here ?
> >
> > What I noticed is that even though the generate phase created a
> > segment 20140915004940
> > , the fetch phase is only looking up to the segments directory for the
> > segments.
> >
> > Thanks.
> >
> > 14/09/15 00:50:07 INFO crawl.Generator: Generator: finished at 2014-09-15
> > 00:50:07, elapsed: 00:00:59
> > ls: cannot access crawldirectory/segments/: No such file or directory
> > Operating on segment :
> > Fetching :
> > 14/09/15 00:50:09 INFO fetcher.Fetcher: Fetcher: starting at 2014-09-15
> > 00:50:09
> > 14/09/15 00:50:09 INFO fetcher.Fetcher: Fetcher: segment:
> > crawldirectory/segments
> > 14/09/15 00:50:09 INFO fetcher.Fetcher: Fetcher Timelimit set for :
> > 1410767409664
> > Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library
> > /opt/hadoop-2.3.0/lib/native/libhadoop.so.1.0.0 which might have disabled
> > stack guard. The VM will try to fix the stack guard now.
> > It's highly recommended that you fix the library with 'execstack -c
> > <libfile>', or link it with '-z noexecstack'.
> > 14/09/15 00:50:10 WARN util.NativeCodeLoader: Unable to load
> native-hadoop
> > library for your platform... using builtin-java classes where applicable
> > 14/09/15 00:50:10 INFO client.RMProxy: Connecting to ResourceManager at
> > server1.mydomain.com/170.75.152.162:8040
> > 14/09/15 00:50:10 INFO client.RMProxy: Connecting to ResourceManager at
> > server1.mydomain.com/170.75.152.162:8040
> > 14/09/15 00:50:12 INFO mapreduce.JobSubmitter: Cleaning up the staging
> area
> > /tmp/hadoop-yarn/staging/df/.staging/job_1410742329411_0010
> > 14/09/15 00:50:12 WARN security.UserGroupInformation:
> > PriviledgedActionException as:df (auth:SIMPLE)
> > cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not
> > exist: hdfs://
> > server1.mydomain.com:9000/user/df/crawldirectory/segments/crawl_generate
> > 14/09/15 00:50:12 WARN security.UserGroupInformation:
> > PriviledgedActionException as:df (auth:SIMPLE)
> > cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not
> > exist: hdfs://
> > server1.mydomain.com:9000/user/df/crawldirectory/segments/crawl_generate
> > 14/09/15 00:50:12 ERROR fetcher.Fetcher: Fetcher:
> > org.apache.hadoop.mapred.InvalidInputException: Input path does not
> exist:
> > hdfs://
> > server1.mydomain.com:9000/user/df/crawldirectory/segments/crawl_generate
> >     at
> >
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251)
> >     at
> >
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:45)
> >     at
> > org.apache.nutch.fetcher.Fetcher$InputFormat.getSplits(Fetcher.java:108)
> >     at
> >
> org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:520)
> >     at
> >
> org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:512)
> >     at
> >
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:394)
> >     at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285)
> >     at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282)
> >     at java.security.AccessController.doPrivileged(Native Method)
> >     at javax.security.auth.Subject.doAs(Subject.java:415)
> >     at
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
> >     at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282)
> >     at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:562)
> >     at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:557)
> >     at java.security.AccessController.doPrivileged(Native Method)
> >     at javax.security.auth.Subject.doAs(Subject.java:415)
> >     at
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
> >     at
> > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:557)
> >     at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:548)
> >     at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:833)
> >     at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1349)
> >     at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1385)
> >     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> >     at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1358)
> >     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >     at
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> >     at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> >     at java.lang.reflect.Method.invoke(Method.java:606)
> >     at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
> >
>
>

RE: Fetch Job Started Failing on Hadoop Cluster

Posted by Markus Jelsma <ma...@openindex.io>.

Hi - you made Nutch believe that 
hdfs://server1.mydomain.com:9000/user/df/crawldirectory/segments/ is a 
segment, but it is not. So either no segment was created or written to the 
wrong location.

I don't know what kind of script you are using but you should check the return 
code of the generator, if gives a -1 for no segment created.

Markus




 
 
-----Original message-----
> From:Meraj A. Khan <merajak@gmail.com <ma...@gmail.com> >
> Sent: Monday 15th September 2014 7:02
> To: user@nutch.apache.org <ma...@nutch.apache.org> 
> Subject: Fetch Job Started Failing on Hadoop Cluster
> 
> Hello Folks,
> 
> My Nutch crawl which was running fine , started failing in the first Fetch
> Job/Application, I am unable to figure out whats going on here, I have
> attached the last snippet of the log below , can some please let me know
> whats going on here ?
> 
> What I noticed is that even though the generate phase created a
> segment 20140915004940
> , the fetch phase is only looking up to the segments directory for the
> segments.
> 
> Thanks.
> 
> 14/09/15 00:50:07 INFO crawl.Generator: Generator: finished at 2014-09-15
> 00:50:07, elapsed: 00:00:59
> ls: cannot access crawldirectory/segments/: No such file or directory
> Operating on segment :
> Fetching :
> 14/09/15 00:50:09 INFO fetcher.Fetcher: Fetcher: starting at 2014-09-15
> 00:50:09
> 14/09/15 00:50:09 INFO fetcher.Fetcher: Fetcher: segment:
> crawldirectory/segments
> 14/09/15 00:50:09 INFO fetcher.Fetcher: Fetcher Timelimit set for :
> 1410767409664
> Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library
> /opt/hadoop-2.3.0/lib/native/libhadoop.so.1.0.0 which might have disabled
> stack guard. The VM will try to fix the stack guard now.
> It's highly recommended that you fix the library with 'execstack -c
> <libfile>', or link it with '-z noexecstack'.
> 14/09/15 00:50:10 WARN util.NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> 14/09/15 00:50:10 INFO client.RMProxy: Connecting to ResourceManager at
> server1.mydomain.com/170.75.152.162:8040
> 14/09/15 00:50:10 INFO client.RMProxy: Connecting to ResourceManager at
> server1.mydomain.com/170.75.152.162:8040
> 14/09/15 00:50:12 INFO mapreduce.JobSubmitter: Cleaning up the staging area
> /tmp/hadoop-yarn/staging/df/.staging/job_1410742329411_0010
> 14/09/15 00:50:12 WARN security.UserGroupInformation:
> PriviledgedActionException as:df (auth:SIMPLE)
> cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not
> exist: hdfs://
> server1.mydomain.com:9000/user/df/crawldirectory/segments/crawl_generate
> 14/09/15 00:50:12 WARN security.UserGroupInformation:
> PriviledgedActionException as:df (auth:SIMPLE)
> cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not
> exist: hdfs://
> server1.mydomain.com:9000/user/df/crawldirectory/segments/crawl_generate
> 14/09/15 00:50:12 ERROR fetcher.Fetcher: Fetcher:
> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
> hdfs://
> server1.mydomain.com:9000/user/df/crawldirectory/segments/crawl_generate
>     at
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251)
>     at
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:45)
>     at
> org.apache.nutch.fetcher.Fetcher$InputFormat.getSplits(Fetcher.java:108)
>     at
> org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:520)
>     at
> org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:512)
>     at
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:394)
>     at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285)
>     at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282)
>     at java.security.AccessController.doPrivileged(Native Method)
>     at javax.security.auth.Subject.doAs(Subject.java:415)
>     at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>     at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282)
>     at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:562)
>     at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:557)
>     at java.security.AccessController.doPrivileged(Native Method)
>     at javax.security.auth.Subject.doAs(Subject.java:415)
>     at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>     at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:557)
>     at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:548)
>     at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:833)
>     at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1349)
>     at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1385)
>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>     at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1358)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>     at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:606)
>     at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
>