You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Jason Stubblefield <mr...@gmail.com> on 2011/06/13 12:07:52 UTC

Nutch 1.3 fetch: "No agents listed in 'http.agent.name' property"

Hello,

I'm trying to fetch a segment using hadoop on a single node with nutch 1.3.
 I seem to be struggling with the new runtime configuration.  I have hadoop
up and running and have successfully run the readdb -stats command and
generated a sement, but when I run:

runtime/deploy/bin/nutch fetch crawl/segments/20110613103305 -threads 8

I get an error message: No agents listed in 'http.agent.name' property

I noticed there are now 2 conf files, one at trunk/conf and the other at
trunk/runtime/local/conf, and hae updated both of them with my
nutch-site.xml file, both have a properly configured http.agent.name.

Do I need to explicitly declare the conf directory somewhere?  Do in need to
move the conf file to trunk/runtime/deploy/conf, or put it somewhere else?
 What am i missing?

Thanks in advance!

~Jason

Re: Nutch 1.3 fetch: "No agents listed in 'http.agent.name' property"

Posted by Julien Nioche <li...@gmail.com>.
On 13 June 2011 21:15, Jason Stubblefield
<mr...@gmail.com>wrote:

> Thanks for the help Julien, I'll just copy the files to the hadoop conf
> directory for now while it is a single node.
>
> If I use the job file do I have to have the nutch package on each node in
> the cluster, or just on the master node?
>

Just on the master - it is sent to all the nodes for you just like any
normal mapreduce job


> I'm also curious if it would be possible or practical to declare the
> NUTCH_CONF_DIR in a nutch-env.sh file like hadoop uses, or somewhere in the
> nutch script.  Thanks again.
>

hmmm. relying on the conf files on the master only is OK but that won't help
with the URLFilter files etc... much simpler to generate a job file, use it
from the master and let hadoop distribute it to the slaves

Julien



>
> ~Jason
>
> On Mon, Jun 13, 2011 at 4:03 PM, Julien Nioche <
> lists.digitalpebble@gmail.com> wrote:
>
> > Hi Jason,
> >
> > If you have hadoop running independently from Nutch you should use
> > runtime/deploy/bin. The conf files can go directly in the hadoop/conf dir
> > or
> > in the Nutch job which you will need to regenerate with 'ant job' so that
> > it
> > reflects the changes you made in NUTCH/conf
> >
> > Julien
> >
> > On 13 June 2011 11:59, Jason Stubblefield
> > <mr...@gmail.com>wrote:
> >
> > > Update:  The nutch configuration files need to go in the hadoop conf
> > file.
> > >
> > > Maybe someone could recommend some best practices regarding the file
> > > structure?  Should all the nutch config files simply be copied to the
> > > hadoop
> > > conf directory?  Currently I have:
> > >
> > > /webcrawler/hadoop
> > > /webcrawler/nutch
> > >
> > > I guess im a bit confused because 1.3 didn't come bundled with hadoop.
> > >
> > > Thanks!
> > >
> > > ~Jason
> > >
> > > On Mon, Jun 13, 2011 at 12:07 PM, Jason Stubblefield <
> > > mr.jason.stubblefield@gmail.com> wrote:
> > >
> > > > Hello,
> > > >
> > > > I'm trying to fetch a segment using hadoop on a single node with
> nutch
> > > 1.3.
> > > >  I seem to be struggling with the new runtime configuration.  I have
> > > hadoop
> > > > up and running and have successfully run the readdb -stats command
> and
> > > > generated a sement, but when I run:
> > > >
> > > > runtime/deploy/bin/nutch fetch crawl/segments/20110613103305 -threads
> 8
> > > >
> > > > I get an error message: No agents listed in 'http.agent.name'
> property
> > > >
> > > > I noticed there are now 2 conf files, one at trunk/conf and the other
> > at
> > > > trunk/runtime/local/conf, and hae updated both of them with my
> > > > nutch-site.xml file, both have a properly configured http.agent.name
> .
> > > >
> > > > Do I need to explicitly declare the conf directory somewhere?  Do in
> > need
> > > > to move the conf file to trunk/runtime/deploy/conf, or put it
> somewhere
> > > > else?  What am i missing?
> > > >
> > > > Thanks in advance!
> > > >
> > > > ~Jason
> > > >
> > >
> >
> >
> >
> > --
> > *
> > *Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> >
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: Nutch 1.3 fetch: "No agents listed in 'http.agent.name' property"

Posted by Jason Stubblefield <mr...@gmail.com>.
Thanks for the help Julien, I'll just copy the files to the hadoop conf
directory for now while it is a single node.

If I use the job file do I have to have the nutch package on each node in
the cluster, or just on the master node?

I'm also curious if it would be possible or practical to declare the
NUTCH_CONF_DIR in a nutch-env.sh file like hadoop uses, or somewhere in the
nutch script.  Thanks again.

~Jason

On Mon, Jun 13, 2011 at 4:03 PM, Julien Nioche <
lists.digitalpebble@gmail.com> wrote:

> Hi Jason,
>
> If you have hadoop running independently from Nutch you should use
> runtime/deploy/bin. The conf files can go directly in the hadoop/conf dir
> or
> in the Nutch job which you will need to regenerate with 'ant job' so that
> it
> reflects the changes you made in NUTCH/conf
>
> Julien
>
> On 13 June 2011 11:59, Jason Stubblefield
> <mr...@gmail.com>wrote:
>
> > Update:  The nutch configuration files need to go in the hadoop conf
> file.
> >
> > Maybe someone could recommend some best practices regarding the file
> > structure?  Should all the nutch config files simply be copied to the
> > hadoop
> > conf directory?  Currently I have:
> >
> > /webcrawler/hadoop
> > /webcrawler/nutch
> >
> > I guess im a bit confused because 1.3 didn't come bundled with hadoop.
> >
> > Thanks!
> >
> > ~Jason
> >
> > On Mon, Jun 13, 2011 at 12:07 PM, Jason Stubblefield <
> > mr.jason.stubblefield@gmail.com> wrote:
> >
> > > Hello,
> > >
> > > I'm trying to fetch a segment using hadoop on a single node with nutch
> > 1.3.
> > >  I seem to be struggling with the new runtime configuration.  I have
> > hadoop
> > > up and running and have successfully run the readdb -stats command and
> > > generated a sement, but when I run:
> > >
> > > runtime/deploy/bin/nutch fetch crawl/segments/20110613103305 -threads 8
> > >
> > > I get an error message: No agents listed in 'http.agent.name' property
> > >
> > > I noticed there are now 2 conf files, one at trunk/conf and the other
> at
> > > trunk/runtime/local/conf, and hae updated both of them with my
> > > nutch-site.xml file, both have a properly configured http.agent.name.
> > >
> > > Do I need to explicitly declare the conf directory somewhere?  Do in
> need
> > > to move the conf file to trunk/runtime/deploy/conf, or put it somewhere
> > > else?  What am i missing?
> > >
> > > Thanks in advance!
> > >
> > > ~Jason
> > >
> >
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>

Re: Nutch 1.3 fetch: "No agents listed in 'http.agent.name' property"

Posted by Julien Nioche <li...@gmail.com>.
Hi Jason,

If you have hadoop running independently from Nutch you should use
runtime/deploy/bin. The conf files can go directly in the hadoop/conf dir or
in the Nutch job which you will need to regenerate with 'ant job' so that it
reflects the changes you made in NUTCH/conf

Julien

On 13 June 2011 11:59, Jason Stubblefield
<mr...@gmail.com>wrote:

> Update:  The nutch configuration files need to go in the hadoop conf file.
>
> Maybe someone could recommend some best practices regarding the file
> structure?  Should all the nutch config files simply be copied to the
> hadoop
> conf directory?  Currently I have:
>
> /webcrawler/hadoop
> /webcrawler/nutch
>
> I guess im a bit confused because 1.3 didn't come bundled with hadoop.
>
> Thanks!
>
> ~Jason
>
> On Mon, Jun 13, 2011 at 12:07 PM, Jason Stubblefield <
> mr.jason.stubblefield@gmail.com> wrote:
>
> > Hello,
> >
> > I'm trying to fetch a segment using hadoop on a single node with nutch
> 1.3.
> >  I seem to be struggling with the new runtime configuration.  I have
> hadoop
> > up and running and have successfully run the readdb -stats command and
> > generated a sement, but when I run:
> >
> > runtime/deploy/bin/nutch fetch crawl/segments/20110613103305 -threads 8
> >
> > I get an error message: No agents listed in 'http.agent.name' property
> >
> > I noticed there are now 2 conf files, one at trunk/conf and the other at
> > trunk/runtime/local/conf, and hae updated both of them with my
> > nutch-site.xml file, both have a properly configured http.agent.name.
> >
> > Do I need to explicitly declare the conf directory somewhere?  Do in need
> > to move the conf file to trunk/runtime/deploy/conf, or put it somewhere
> > else?  What am i missing?
> >
> > Thanks in advance!
> >
> > ~Jason
> >
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: Nutch 1.3 fetch: "No agents listed in 'http.agent.name' property"

Posted by Jason Stubblefield <mr...@gmail.com>.
Update:  The nutch configuration files need to go in the hadoop conf file.

Maybe someone could recommend some best practices regarding the file
structure?  Should all the nutch config files simply be copied to the hadoop
conf directory?  Currently I have:

/webcrawler/hadoop
/webcrawler/nutch

I guess im a bit confused because 1.3 didn't come bundled with hadoop.

Thanks!

~Jason

On Mon, Jun 13, 2011 at 12:07 PM, Jason Stubblefield <
mr.jason.stubblefield@gmail.com> wrote:

> Hello,
>
> I'm trying to fetch a segment using hadoop on a single node with nutch 1.3.
>  I seem to be struggling with the new runtime configuration.  I have hadoop
> up and running and have successfully run the readdb -stats command and
> generated a sement, but when I run:
>
> runtime/deploy/bin/nutch fetch crawl/segments/20110613103305 -threads 8
>
> I get an error message: No agents listed in 'http.agent.name' property
>
> I noticed there are now 2 conf files, one at trunk/conf and the other at
> trunk/runtime/local/conf, and hae updated both of them with my
> nutch-site.xml file, both have a properly configured http.agent.name.
>
> Do I need to explicitly declare the conf directory somewhere?  Do in need
> to move the conf file to trunk/runtime/deploy/conf, or put it somewhere
> else?  What am i missing?
>
> Thanks in advance!
>
> ~Jason
>