You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by ja...@thomson.com on 2006/09/13 22:00:05 UTC

0.8 Intranet Crawl Output/Logging?

I am using the nutch 0.8 'crawl' command to crawl some content.  When I
run the crawl command, I don't see any output, but the crawl is
running...  Is there a way to see information about what the crawler is
doing?

I have tried setting 'fetcher.verbose' to 'true' in my nutch-site.xml
causing no change to the behaviour.

I am trying to enable some plugins (file protocol and parse-xml plugin)
but I cant tell if they are being loaded correctly with out some output
from nutch.

Thanks!
Jared-

Re: 0.8 Intranet Crawl Output/Logging?

Posted by Tomi NA <he...@gmail.com>.
On 9/14/06, jared.dunne@thomson.com <ja...@thomson.com> wrote:
> Everyone, thanks for the help with this.  I hope to return the
> assistance, once I am more familiar with 0.8.  I am using tail -f now to
> monitor my test crawls.  It also look like you can use
> conf/hadoop-env.sh to redirect log file output to a different location
> for each of your configurations.
>
> One follow up question:
> Now that I can actually see the log, I am finding some of the output
> rather annoying/noisy.  Specially, I am referring to the Registered
> Plugins and Registered Extension-Points output.  It's nice to see that
> once at crawl start, but not with every step of the crawl.
>
> So does any one know if I can disable that output?  Here's the output to
> which I refer:
>
> 2006-09-14 14:03:42,852 INFO  plugin.PluginRepository - Plugins: looking
> in: /var/nutch/nutch-0.8/plugins
> 2006-09-14 14:03:43,030 INFO  plugin.PluginRepository - Plugin
> Auto-activation mode: [true]
> 2006-09-14 14:03:43,030 INFO  plugin.PluginRepository - Registered
> Plugins:

watch -n 1 "grep -v PluginRepository
/home/wmelo/nutch-0.8/logs/hadoop.log | tail -n 20"

t.n.a.

Re: 0.8 Intranet Crawl Output/Logging?

Posted by Renaud Richardet <re...@wyona.com>.
Hello Jared,

jared.dunne@thomson.com wrote:
> Everyone, thanks for the help with this.  I hope to return the
> assistance, once I am more familiar with 0.8.  I am using tail -f now to
> monitor my test crawls.  It also look like you can use
> conf/hadoop-env.sh to redirect log file output to a different location
> for each of your configurations.
>
> One follow up question:
> Now that I can actually see the log, I am finding some of the output
> rather annoying/noisy.  Specially, I am referring to the Registered
> Plugins and Registered Extension-Points output.  It's nice to see that
> once at crawl start, but not with every step of the crawl.
>
> So does any one know if I can disable that output?  
please see http://issues.apache.org/jira/browse/NUTCH-346

HTH,
Renaud

> Here's the output to
> which I refer:
>
> 2006-09-14 14:03:42,852 INFO  plugin.PluginRepository - Plugins: looking
> in: /var/nutch/nutch-0.8/plugins
> 2006-09-14 14:03:43,030 INFO  plugin.PluginRepository - Plugin
> Auto-activation mode: [true]
> 2006-09-14 14:03:43,030 INFO  plugin.PluginRepository - Registered
> Plugins:
> 2006-09-14 14:03:43,031 INFO  plugin.PluginRepository -
> CyberNeko HTML Parser (lib-nekohtml)
> 2006-09-14 14:03:43,031 INFO  plugin.PluginRepository -         Site
> Query Filter (query-site)
> 2006-09-14 14:03:43,031 INFO  plugin.PluginRepository -         Html
> Parse Plug-in (parse-html)
> [snip]
> 2006-09-14 14:03:43,031 INFO  plugin.PluginRepository - Registered
> Extension-Points:
> 2006-09-14 14:03:43,031 INFO  plugin.PluginRepository -         Nutch
> Summarizer (org.apache.nutch.searcher.Summarizer)
> 2006-09-14 14:03:43,031 INFO  plugin.PluginRepository -         Nutch 
> [snip]
> Search Results Clustering Plugin
> (org.apache.nutch.clustering.OnlineClusterer)
> 2006-09-14 14:03:43,032 INFO  plugin.PluginRepository -         Nutch
> Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
> 2006-09-14 14:03:43,032 INFO  plugin.PluginRepository -         Nutch
> Content Parser (org.apache.nutch.parse.Parser)
> [snip]
>
> Jared-
>
> -----Original Message-----
> From: Jacob Brunson [mailto:jacob.brunson@gmail.com] 
> Sent: Thursday, September 14, 2006 1:24 AM
> To: nutch-user@lucene.apache.org
> Subject: Re: 0.8 Intranet Crawl Output/Logging?
>
> On my system, I run the crawl command in one shell while running this
> command in another shell to monitor the crawl:
> tail -f log/hadoop.log
> Of course this does about the same thing as listed below, but "tail
> -f" is a little easier to remember.
>
> On 9/13/06, Tomi NA <he...@gmail.com> wrote:
>   
>> On 9/13/06, wmelo <wm...@olimpo.com.br> wrote:
>>     
>>> I have the same original doubt.  I know that the log shows
>>>       
> informations,
>   
>>> but, how to see the things happening, real time, like in nutch
>>>       
> 0.7.2, when
>   
>>> you use the crawl command in the terminal?
>>>       
>> try something like this (assuming you know what's good for you so you
>> use a *n*x):
>> watch -n 1 "tail -n 20 /home/wmelo/nutch-0.8/logs/hadoop.log"
>>
>> Please replace the path to your "logs" directory to match your
>> environment and report back if there's a problem.
>> Hope it helps.
>>
>> t.n.a.
>>
>>     
>
>
>   

-- 
Renaud Richardet
COO America
Wyona    -   Open Source Content Management   -   Apache Lenya
office +1 857 776-3195                  mobile +1 617 230 9112
renaud.richardet <at> wyona.com           http://www.wyona.com


RE: 0.8 Intranet Crawl Output/Logging?

Posted by ja...@thomson.com.
Everyone, thanks for the help with this.  I hope to return the
assistance, once I am more familiar with 0.8.  I am using tail -f now to
monitor my test crawls.  It also look like you can use
conf/hadoop-env.sh to redirect log file output to a different location
for each of your configurations.

One follow up question:
Now that I can actually see the log, I am finding some of the output
rather annoying/noisy.  Specially, I am referring to the Registered
Plugins and Registered Extension-Points output.  It's nice to see that
once at crawl start, but not with every step of the crawl.

So does any one know if I can disable that output?  Here's the output to
which I refer:

2006-09-14 14:03:42,852 INFO  plugin.PluginRepository - Plugins: looking
in: /var/nutch/nutch-0.8/plugins
2006-09-14 14:03:43,030 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2006-09-14 14:03:43,030 INFO  plugin.PluginRepository - Registered
Plugins:
2006-09-14 14:03:43,031 INFO  plugin.PluginRepository -
CyberNeko HTML Parser (lib-nekohtml)
2006-09-14 14:03:43,031 INFO  plugin.PluginRepository -         Site
Query Filter (query-site)
2006-09-14 14:03:43,031 INFO  plugin.PluginRepository -         Html
Parse Plug-in (parse-html)
[snip]
2006-09-14 14:03:43,031 INFO  plugin.PluginRepository - Registered
Extension-Points:
2006-09-14 14:03:43,031 INFO  plugin.PluginRepository -         Nutch
Summarizer (org.apache.nutch.searcher.Summarizer)
2006-09-14 14:03:43,031 INFO  plugin.PluginRepository -         Nutch 
[snip]
Search Results Clustering Plugin
(org.apache.nutch.clustering.OnlineClusterer)
2006-09-14 14:03:43,032 INFO  plugin.PluginRepository -         Nutch
Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2006-09-14 14:03:43,032 INFO  plugin.PluginRepository -         Nutch
Content Parser (org.apache.nutch.parse.Parser)
[snip]

Jared-

-----Original Message-----
From: Jacob Brunson [mailto:jacob.brunson@gmail.com] 
Sent: Thursday, September 14, 2006 1:24 AM
To: nutch-user@lucene.apache.org
Subject: Re: 0.8 Intranet Crawl Output/Logging?

On my system, I run the crawl command in one shell while running this
command in another shell to monitor the crawl:
tail -f log/hadoop.log
Of course this does about the same thing as listed below, but "tail
-f" is a little easier to remember.

On 9/13/06, Tomi NA <he...@gmail.com> wrote:
> On 9/13/06, wmelo <wm...@olimpo.com.br> wrote:
> > I have the same original doubt.  I know that the log shows
informations,
> > but, how to see the things happening, real time, like in nutch
0.7.2, when
> > you use the crawl command in the terminal?
>
> try something like this (assuming you know what's good for you so you
> use a *n*x):
> watch -n 1 "tail -n 20 /home/wmelo/nutch-0.8/logs/hadoop.log"
>
> Please replace the path to your "logs" directory to match your
> environment and report back if there's a problem.
> Hope it helps.
>
> t.n.a.
>


-- 
http://JacobBrunson.com

Re: 0.8 Intranet Crawl Output/Logging?

Posted by Jacob Brunson <ja...@gmail.com>.
On my system, I run the crawl command in one shell while running this
command in another shell to monitor the crawl:
tail -f log/hadoop.log
Of course this does about the same thing as listed below, but "tail
-f" is a little easier to remember.

On 9/13/06, Tomi NA <he...@gmail.com> wrote:
> On 9/13/06, wmelo <wm...@olimpo.com.br> wrote:
> > I have the same original doubt.  I know that the log shows  informations,
> > but, how to see the things happening, real time, like in nutch 0.7.2, when
> > you use the crawl command in the terminal?
>
> try something like this (assuming you know what's good for you so you
> use a *n*x):
> watch -n 1 "tail -n 20 /home/wmelo/nutch-0.8/logs/hadoop.log"
>
> Please replace the path to your "logs" directory to match your
> environment and report back if there's a problem.
> Hope it helps.
>
> t.n.a.
>


-- 
http://JacobBrunson.com

Re: 0.8 Intranet Crawl Output/Logging?

Posted by Jim Wilson <wi...@gmail.com>.
If you don't know what's good for you, baretail can provide a suitable
Windows alternative.

http://www.baremetalsoft.com/baretail/

-- Jim

On 9/13/06, Tomi NA <he...@gmail.com> wrote:
>
> On 9/13/06, wmelo <wm...@olimpo.com.br> wrote:
> > I have the same original doubt.  I know that the log
> shows  informations,
> > but, how to see the things happening, real time, like in nutch 0.7.2,
> when
> > you use the crawl command in the terminal?
>
> try something like this (assuming you know what's good for you so you
> use a *n*x):
> watch -n 1 "tail -n 20 /home/wmelo/nutch-0.8/logs/hadoop.log"
>
> Please replace the path to your "logs" directory to match your
> environment and report back if there's a problem.
> Hope it helps.
>
> t.n.a.
>

Re: 0.8 Intranet Crawl Output/Logging?

Posted by Tomi NA <he...@gmail.com>.
On 9/13/06, wmelo <wm...@olimpo.com.br> wrote:
> I have the same original doubt.  I know that the log shows  informations,
> but, how to see the things happening, real time, like in nutch 0.7.2, when
> you use the crawl command in the terminal?

try something like this (assuming you know what's good for you so you
use a *n*x):
watch -n 1 "tail -n 20 /home/wmelo/nutch-0.8/logs/hadoop.log"

Please replace the path to your "logs" directory to match your
environment and report back if there's a problem.
Hope it helps.

t.n.a.

Re: 0.8 Intranet Crawl Output/Logging?

Posted by wmelo <wm...@olimpo.com.br>.
I have the same original doubt.  I know that the log shows  informations, 
but, how to see the things happening, real time, like in nutch 0.7.2, when 
you use the crawl command in the terminal?

----- Original Message ----- 
From: "Ben Ogle" <og...@gmail.com>
To: <nu...@lucene.apache.org>
Sent: Wednesday, September 13, 2006 5:59 PM
Subject: Re: 0.8 Intranet Crawl Output/Logging?


>
> Look in the hadoop.log file under the nutch-0.8/logs dir. It should have 
> that
> info.
>
> Ben
>
>
> jared.dunne wrote:
>>
>> I am using the nutch 0.8 'crawl' command to crawl some content.  When I
>> run the crawl command, I don't see any output, but the crawl is
>> running...  Is there a way to see information about what the crawler is
>> doing?
>>
>> I have tried setting 'fetcher.verbose' to 'true' in my nutch-site.xml
>> causing no change to the behaviour.
>>
>> I am trying to enable some plugins (file protocol and parse-xml plugin)
>> but I cant tell if they are being loaded correctly with out some output
>> from nutch.
>>
>> Thanks!
>> Jared-
>>
>>
>
> -- 
> View this message in context: 
> http://www.nabble.com/0.8-Intranet-Crawl-Output-Logging--tf2267654.html#a6294542
> Sent from the Nutch - User forum at Nabble.com.
>
>
>
> -- 
> No virus found in this incoming message.
> Checked by AVG Free Edition.
> Version: 7.1.405 / Virus Database: 268.12.3/444 - Release Date: 11/9/2006
>
> 


Re: 0.8 Intranet Crawl Output/Logging?

Posted by Ben Ogle <og...@gmail.com>.
Look in the hadoop.log file under the nutch-0.8/logs dir. It should have that
info.

Ben


jared.dunne wrote:
> 
> I am using the nutch 0.8 'crawl' command to crawl some content.  When I
> run the crawl command, I don't see any output, but the crawl is
> running...  Is there a way to see information about what the crawler is
> doing?
> 
> I have tried setting 'fetcher.verbose' to 'true' in my nutch-site.xml
> causing no change to the behaviour.
> 
> I am trying to enable some plugins (file protocol and parse-xml plugin)
> but I cant tell if they are being loaded correctly with out some output
> from nutch.
> 
> Thanks!
> Jared-
> 
> 

-- 
View this message in context: http://www.nabble.com/0.8-Intranet-Crawl-Output-Logging--tf2267654.html#a6294542
Sent from the Nutch - User forum at Nabble.com.