You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Sebastian Nagel <wa...@googlemail.com> on 2017/08/01 14:51:40 UTC

Re: cannot find nutch logs in distributed mode

Hi Srini,

> I am referring to the INFO messages that are printed in console when nutch
> 1.14 is running in distributed mode. For example

Afaics, the only way to get the logs of the job client is to redirect the console output to a file,
e.g.,

/mnt/nutch/runtime/deploy/bin/nutch inject /user/hadoop/crawlDIR/crawldb seed.txt &>inject.log

> I am running nutch from a EMR cluster.

If you're interested in the logs of task attempts, see:

http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-manage-view-web-log-files.html


Sebastian

On 07/29/2017 09:38 AM, Srinivasan Ramaswamy wrote:
> Hi Sebastin
> 
> I am referring to the INFO messages that are printed in console when nutch
> 1.14 is running in distributed mode. For example
> 
> Injecting seed URLs
> /mnt/nutch/runtime/deploy/bin/nutch inject /user/hadoop/crawlDIR/crawldb
> seed.txt
> 17/07/29 06:51:18 INFO crawl.Injector: Injector: starting at 2017-07-29
> 06:51:18
> 17/07/29 06:51:18 INFO crawl.Injector: Injector: crawlDb:
> /user/hadoop/crawlDIR/crawldb
> 17/07/29 06:51:18 INFO crawl.Injector: Injector: urlDir: seed.txt
> 17/07/29 06:51:18 INFO crawl.Injector: Injector: Converting injected urls
> to crawl db entries.
> 17/07/29 06:51:19 INFO client.RMProxy: Connecting to ResourceManager at
> ip-*-*-*-*.ec2.internal/*.*.*.*:8032
> 17/07/29 06:51:20 INFO input.FileInputFormat: Total input paths to process
> : 0
> 17/07/29 06:51:20 INFO input.FileInputFormat: Total input paths to process
> : 1
> .
> .
> 17/07/29 06:51:20 INFO mapreduce.Job: Running job: job_1500749038440_0003
> 17/07/29 06:51:28 INFO mapreduce.Job: Job job_1500749038440_0003 running in
> uber mode : false
> 17/07/29 06:51:28 INFO mapreduce.Job:  map 0% reduce 0%
> 17/07/29 06:51:33 INFO mapreduce.Job:  map 100% reduce 0%
> 17/07/29 06:51:38 INFO mapreduce.Job:  map 100% reduce 4%
> 17/07/29 06:51:40 INFO mapreduce.Job:  map 100% reduce 6%
> 17/07/29 06:51:41 INFO mapreduce.Job:  map 100% reduce 49%
> 17/07/29 06:51:42 INFO mapreduce.Job:  map 100% reduce 66%
> 17/07/29 06:51:43 INFO mapreduce.Job:  map 100% reduce 87%
> 17/07/29 06:51:44 INFO mapreduce.Job:  map 100% reduce 100%
> 
> I am running nutch from a EMR cluster. I did check around the log
> directories and I dont see the messages i see in the console anywhere else.
> 
> One more thing i noticed is when i issue the command
> 
> *ps -ef | grep nutch*
> 
> hadoop    21616  18344  2 06:59 pts/1    00:00:09
> /usr/lib/jvm/java-1.8.0-openjdk.x86_64/bin/java -Xmx1000m -server
> -XX:OnOutOfMemoryError=kill -9 %p *-Dhadoop.log.dir=/usr/lib/hadoop/logs*
> *-Dhadoop.log.file=hadoop.log* -Dhadoop.home.dir=/usr/lib/hadoop
> -Dhadoop.id.str= *-Dhadoop.root.logger=INFO,console*
> -Djava.library.path=:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native
> -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true
> -Dhadoop.security.logger=INFO,NullAppender -Dsun.net.inetaddr.ttl=30
> org.apache.hadoop.util.RunJar
> /mnt/nutch/runtime/deploy/apache-nutch-1.14-SNAPSHOT.job
> org.apache.nutch.fetcher.Fetcher -D mapreduce.map.java.opts=-Xmx2304m -D
> mapreduce.map.memory.mb=2880 -D mapreduce.reduce.java.opts=-Xmx4608m -D
> mapreduce.reduce.memory.mb=5760 -D mapreduce.job.reduces=12 -D
> mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D
> mapreduce.map.output.compress=true -D fetcher.timelimit.mins=180
> /user/hadoop/crawlDIR/segments/20170729065841 -noParsing -threads 100
> 
> The logger mentioned in the running process is console. How do i change it
> to the log file rotated by log4j ?
> 
> i tried modifying the conf/log4j.properties file to use DRFA instead
> of cmdstdout logger. but that did not help either.
> 
> Any help would be appreciated.
> 
> Thanks
> Srini
> 
> On Mon, Jul 24, 2017 at 12:52 AM, Sebastian Nagel <
> wastl.nagel@googlemail.com> wrote:
> 
>> Hi Srini,
>>
>> in distributed mode the bulk of Nutch's log output is kept in the Hadoop
>> task logs.
>> The configuration whether, how long and where these logs are kept depends
>> on the
>> configuration of your Hadoop cluster.  You can easily find tutorials and
>> examples
>> how to configure this if you google for "hadoop task logs".
>>
>> Be careful the Nutch logs are usually huge.  The easiest way to get them
>> for a jobs
>> is to run the following command on the master node:
>>
>>   yarn logs -applicationId <app_id>
>>
>> Best,
>> Sebastian
>>
>> On 07/21/2017 10:09 PM, Srinivasan Ramaswamy wrote:
>>> Hi
>>>
>>> I am running nutch in distributed mode. I would like to see all nuch logs
>>> written to files. I only see the console output. Can i see the same
>>> information logged to some log files ?
>>>
>>> When i run nutch in local mode i do see the logs in runtime/local/logs
>>> directory. But when i run nutch in distributed mode, i dont see it
>> anywhere
>>> except console.
>>>
>>> Can anyone help me with the settings that i need to change ?
>>>
>>> Thanks
>>> Srini
>>>
>>
>>
>

Re: cannot find nutch logs in distributed mode

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi Srini,

in local mode all log output from
- job client
- application master / job tracker
- YARN containers (map-reduce task attempts)
ends up in the same file, simply because all are running inside one single JVM.
In distributed mode all are running as separate processes on different machines
writing to separate log files.  That makes logging more complex.

Have a look at

https://discuss.pivotal.io/hc/en-us/articles/201925118-How-to-Find-and-Review-Logs-for-Yarn-MapReduce-Jobs
That's a condensed introduction into the topic.

Please, note that for more detailed questions the Hadoop user list or a forum
dedicated to EMR is the better place.


Best,
Sebastian


On 08/02/2017 09:55 AM, Srinivasan Ramaswamy wrote:
> Thanks for your reply Sebastian. I asked this question for the following
> reasons:
> 
> * We were running crawl script using nohup and we redirected the output to
> a local log file. In some weird/rare scenario (may be our master node went
> down at that time, i am not sure), the log file stopped but nutch process
> was running. We could not really see what it (nutch) is doing.
> 
> * I see that the nutch code uses log4j to log, so i am wondering it should
> all go to a log4j rotated log file instead of just console. The same works
> well in local mode. Can you please explain me why it doesnt write to a file
> and only to console ?
> 
> * It also puzzles me why the running process shows "
> Dhadoop.root.logger=INFO,console"  though i changed conf/log4j.properties
> to "log4j.rootLogger=INFO,DRFA"
> 
> Thanks
> Srini
> 
> On Tue, Aug 1, 2017 at 7:51 AM, Sebastian Nagel <wa...@googlemail.com>
> wrote:
> 
>> Hi Srini,
>>
>>> I am referring to the INFO messages that are printed in console when
>> nutch
>>> 1.14 is running in distributed mode. For example
>>
>> Afaics, the only way to get the logs of the job client is to redirect the
>> console output to a file,
>> e.g.,
>>
>> /mnt/nutch/runtime/deploy/bin/nutch inject /user/hadoop/crawlDIR/crawldb
>> seed.txt &>inject.log
>>
>>> I am running nutch from a EMR cluster.
>>
>> If you're interested in the logs of task attempts, see:
>>
>> http://docs.aws.amazon.com/emr/latest/ManagementGuide/
>> emr-manage-view-web-log-files.html
>>
>>
>> Sebastian
>>
>> On 07/29/2017 09:38 AM, Srinivasan Ramaswamy wrote:
>>> Hi Sebastin
>>>
>>> I am referring to the INFO messages that are printed in console when
>> nutch
>>> 1.14 is running in distributed mode. For example
>>>
>>> Injecting seed URLs
>>> /mnt/nutch/runtime/deploy/bin/nutch inject /user/hadoop/crawlDIR/crawldb
>>> seed.txt
>>> 17/07/29 06:51:18 INFO crawl.Injector: Injector: starting at 2017-07-29
>>> 06:51:18
>>> 17/07/29 06:51:18 INFO crawl.Injector: Injector: crawlDb:
>>> /user/hadoop/crawlDIR/crawldb
>>> 17/07/29 06:51:18 INFO crawl.Injector: Injector: urlDir: seed.txt
>>> 17/07/29 06:51:18 INFO crawl.Injector: Injector: Converting injected urls
>>> to crawl db entries.
>>> 17/07/29 06:51:19 INFO client.RMProxy: Connecting to ResourceManager at
>>> ip-*-*-*-*.ec2.internal/*.*.*.*:8032
>>> 17/07/29 06:51:20 INFO input.FileInputFormat: Total input paths to
>> process
>>> : 0
>>> 17/07/29 06:51:20 INFO input.FileInputFormat: Total input paths to
>> process
>>> : 1
>>> .
>>> .
>>> 17/07/29 06:51:20 INFO mapreduce.Job: Running job: job_1500749038440_0003
>>> 17/07/29 06:51:28 INFO mapreduce.Job: Job job_1500749038440_0003 running
>> in
>>> uber mode : false
>>> 17/07/29 06:51:28 INFO mapreduce.Job:  map 0% reduce 0%
>>> 17/07/29 06:51:33 INFO mapreduce.Job:  map 100% reduce 0%
>>> 17/07/29 06:51:38 INFO mapreduce.Job:  map 100% reduce 4%
>>> 17/07/29 06:51:40 INFO mapreduce.Job:  map 100% reduce 6%
>>> 17/07/29 06:51:41 INFO mapreduce.Job:  map 100% reduce 49%
>>> 17/07/29 06:51:42 INFO mapreduce.Job:  map 100% reduce 66%
>>> 17/07/29 06:51:43 INFO mapreduce.Job:  map 100% reduce 87%
>>> 17/07/29 06:51:44 INFO mapreduce.Job:  map 100% reduce 100%
>>>
>>> I am running nutch from a EMR cluster. I did check around the log
>>> directories and I dont see the messages i see in the console anywhere
>> else.
>>>
>>> One more thing i noticed is when i issue the command
>>>
>>> *ps -ef | grep nutch*
>>>
>>> hadoop    21616  18344  2 06:59 pts/1    00:00:09
>>> /usr/lib/jvm/java-1.8.0-openjdk.x86_64/bin/java -Xmx1000m -server
>>> -XX:OnOutOfMemoryError=kill -9 %p *-Dhadoop.log.dir=/usr/lib/
>> hadoop/logs*
>>> *-Dhadoop.log.file=hadoop.log* -Dhadoop.home.dir=/usr/lib/hadoop
>>> -Dhadoop.id.str= *-Dhadoop.root.logger=INFO,console*
>>> -Djava.library.path=:/usr/lib/hadoop-lzo/lib/native:/usr/
>> lib/hadoop/lib/native
>>> -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true
>>> -Dhadoop.security.logger=INFO,NullAppender -Dsun.net.inetaddr.ttl=30
>>> org.apache.hadoop.util.RunJar
>>> /mnt/nutch/runtime/deploy/apache-nutch-1.14-SNAPSHOT.job
>>> org.apache.nutch.fetcher.Fetcher -D mapreduce.map.java.opts=-Xmx2304m -D
>>> mapreduce.map.memory.mb=2880 -D mapreduce.reduce.java.opts=-Xmx4608m -D
>>> mapreduce.reduce.memory.mb=5760 -D mapreduce.job.reduces=12 -D
>>> mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D
>>> mapreduce.map.output.compress=true -D fetcher.timelimit.mins=180
>>> /user/hadoop/crawlDIR/segments/20170729065841 -noParsing -threads 100
>>>
>>> The logger mentioned in the running process is console. How do i change
>> it
>>> to the log file rotated by log4j ?
>>>
>>> i tried modifying the conf/log4j.properties file to use DRFA instead
>>> of cmdstdout logger. but that did not help either.
>>>
>>> Any help would be appreciated.
>>>
>>> Thanks
>>> Srini
>>>
>>> On Mon, Jul 24, 2017 at 12:52 AM, Sebastian Nagel <
>>> wastl.nagel@googlemail.com> wrote:
>>>
>>>> Hi Srini,
>>>>
>>>> in distributed mode the bulk of Nutch's log output is kept in the Hadoop
>>>> task logs.
>>>> The configuration whether, how long and where these logs are kept
>> depends
>>>> on the
>>>> configuration of your Hadoop cluster.  You can easily find tutorials and
>>>> examples
>>>> how to configure this if you google for "hadoop task logs".
>>>>
>>>> Be careful the Nutch logs are usually huge.  The easiest way to get them
>>>> for a jobs
>>>> is to run the following command on the master node:
>>>>
>>>>   yarn logs -applicationId <app_id>
>>>>
>>>> Best,
>>>> Sebastian
>>>>
>>>> On 07/21/2017 10:09 PM, Srinivasan Ramaswamy wrote:
>>>>> Hi
>>>>>
>>>>> I am running nutch in distributed mode. I would like to see all nuch
>> logs
>>>>> written to files. I only see the console output. Can i see the same
>>>>> information logged to some log files ?
>>>>>
>>>>> When i run nutch in local mode i do see the logs in runtime/local/logs
>>>>> directory. But when i run nutch in distributed mode, i dont see it
>>>> anywhere
>>>>> except console.
>>>>>
>>>>> Can anyone help me with the settings that i need to change ?
>>>>>
>>>>> Thanks
>>>>> Srini
>>>>>
>>>>
>>>>
>>>
>>
>>
>

Re: cannot find nutch logs in distributed mode

Posted by Srinivasan Ramaswamy <ur...@gmail.com>.

Thanks for your reply Sebastian. I asked this question for the following
reasons:

* We were running crawl script using nohup and we redirected the output to
a local log file. In some weird/rare scenario (may be our master node went
down at that time, i am not sure), the log file stopped but nutch process
was running. We could not really see what it (nutch) is doing.

* I see that the nutch code uses log4j to log, so i am wondering it should
all go to a log4j rotated log file instead of just console. The same works
well in local mode. Can you please explain me why it doesnt write to a file
and only to console ?

* It also puzzles me why the running process shows "
Dhadoop.root.logger=INFO,console"  though i changed conf/log4j.properties
to "log4j.rootLogger=INFO,DRFA"

Thanks
Srini

On Tue, Aug 1, 2017 at 7:51 AM, Sebastian Nagel <wa...@googlemail.com>
wrote:

> Hi Srini,
>
> > I am referring to the INFO messages that are printed in console when
> nutch
> > 1.14 is running in distributed mode. For example
>
> Afaics, the only way to get the logs of the job client is to redirect the
> console output to a file,
> e.g.,
>
> /mnt/nutch/runtime/deploy/bin/nutch inject /user/hadoop/crawlDIR/crawldb
> seed.txt &>inject.log
>
> > I am running nutch from a EMR cluster.
>
> If you're interested in the logs of task attempts, see:
>
> http://docs.aws.amazon.com/emr/latest/ManagementGuide/
> emr-manage-view-web-log-files.html
>
>
> Sebastian
>
> On 07/29/2017 09:38 AM, Srinivasan Ramaswamy wrote:
> > Hi Sebastin
> >
> > I am referring to the INFO messages that are printed in console when
> nutch
> > 1.14 is running in distributed mode. For example
> >
> > Injecting seed URLs
> > /mnt/nutch/runtime/deploy/bin/nutch inject /user/hadoop/crawlDIR/crawldb
> > seed.txt
> > 17/07/29 06:51:18 INFO crawl.Injector: Injector: starting at 2017-07-29
> > 06:51:18
> > 17/07/29 06:51:18 INFO crawl.Injector: Injector: crawlDb:
> > /user/hadoop/crawlDIR/crawldb
> > 17/07/29 06:51:18 INFO crawl.Injector: Injector: urlDir: seed.txt
> > 17/07/29 06:51:18 INFO crawl.Injector: Injector: Converting injected urls
> > to crawl db entries.
> > 17/07/29 06:51:19 INFO client.RMProxy: Connecting to ResourceManager at
> > ip-*-*-*-*.ec2.internal/*.*.*.*:8032
> > 17/07/29 06:51:20 INFO input.FileInputFormat: Total input paths to
> process
> > : 0
> > 17/07/29 06:51:20 INFO input.FileInputFormat: Total input paths to
> process
> > : 1
> > .
> > .
> > 17/07/29 06:51:20 INFO mapreduce.Job: Running job: job_1500749038440_0003
> > 17/07/29 06:51:28 INFO mapreduce.Job: Job job_1500749038440_0003 running
> in
> > uber mode : false
> > 17/07/29 06:51:28 INFO mapreduce.Job:  map 0% reduce 0%
> > 17/07/29 06:51:33 INFO mapreduce.Job:  map 100% reduce 0%
> > 17/07/29 06:51:38 INFO mapreduce.Job:  map 100% reduce 4%
> > 17/07/29 06:51:40 INFO mapreduce.Job:  map 100% reduce 6%
> > 17/07/29 06:51:41 INFO mapreduce.Job:  map 100% reduce 49%
> > 17/07/29 06:51:42 INFO mapreduce.Job:  map 100% reduce 66%
> > 17/07/29 06:51:43 INFO mapreduce.Job:  map 100% reduce 87%
> > 17/07/29 06:51:44 INFO mapreduce.Job:  map 100% reduce 100%
> >
> > I am running nutch from a EMR cluster. I did check around the log
> > directories and I dont see the messages i see in the console anywhere
> else.
> >
> > One more thing i noticed is when i issue the command
> >
> > *ps -ef | grep nutch*
> >
> > hadoop    21616  18344  2 06:59 pts/1    00:00:09
> > /usr/lib/jvm/java-1.8.0-openjdk.x86_64/bin/java -Xmx1000m -server
> > -XX:OnOutOfMemoryError=kill -9 %p *-Dhadoop.log.dir=/usr/lib/
> hadoop/logs*
> > *-Dhadoop.log.file=hadoop.log* -Dhadoop.home.dir=/usr/lib/hadoop
> > -Dhadoop.id.str= *-Dhadoop.root.logger=INFO,console*
> > -Djava.library.path=:/usr/lib/hadoop-lzo/lib/native:/usr/
> lib/hadoop/lib/native
> > -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true
> > -Dhadoop.security.logger=INFO,NullAppender -Dsun.net.inetaddr.ttl=30
> > org.apache.hadoop.util.RunJar
> > /mnt/nutch/runtime/deploy/apache-nutch-1.14-SNAPSHOT.job
> > org.apache.nutch.fetcher.Fetcher -D mapreduce.map.java.opts=-Xmx2304m -D
> > mapreduce.map.memory.mb=2880 -D mapreduce.reduce.java.opts=-Xmx4608m -D
> > mapreduce.reduce.memory.mb=5760 -D mapreduce.job.reduces=12 -D
> > mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D
> > mapreduce.map.output.compress=true -D fetcher.timelimit.mins=180
> > /user/hadoop/crawlDIR/segments/20170729065841 -noParsing -threads 100
> >
> > The logger mentioned in the running process is console. How do i change
> it
> > to the log file rotated by log4j ?
> >
> > i tried modifying the conf/log4j.properties file to use DRFA instead
> > of cmdstdout logger. but that did not help either.
> >
> > Any help would be appreciated.
> >
> > Thanks
> > Srini
> >
> > On Mon, Jul 24, 2017 at 12:52 AM, Sebastian Nagel <
> > wastl.nagel@googlemail.com> wrote:
> >
> >> Hi Srini,
> >>
> >> in distributed mode the bulk of Nutch's log output is kept in the Hadoop
> >> task logs.
> >> The configuration whether, how long and where these logs are kept
> depends
> >> on the
> >> configuration of your Hadoop cluster.  You can easily find tutorials and
> >> examples
> >> how to configure this if you google for "hadoop task logs".
> >>
> >> Be careful the Nutch logs are usually huge.  The easiest way to get them
> >> for a jobs
> >> is to run the following command on the master node:
> >>
> >>   yarn logs -applicationId <app_id>
> >>
> >> Best,
> >> Sebastian
> >>
> >> On 07/21/2017 10:09 PM, Srinivasan Ramaswamy wrote:
> >>> Hi
> >>>
> >>> I am running nutch in distributed mode. I would like to see all nuch
> logs
> >>> written to files. I only see the console output. Can i see the same
> >>> information logged to some log files ?
> >>>
> >>> When i run nutch in local mode i do see the logs in runtime/local/logs
> >>> directory. But when i run nutch in distributed mode, i dont see it
> >> anywhere
> >>> except console.
> >>>
> >>> Can anyone help me with the settings that i need to change ?
> >>>
> >>> Thanks
> >>> Srini
> >>>
> >>
> >>
> >
>
>