You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Justin Hartman <jj...@gmail.com> on 2007/01/28 10:17:49 UTC

Fetcher threads & automation

Hi all

Just have a couple more questions which remain unclear to me at this stage.

1. I'm fetching urls on a P4 2.8ghz machine with 1GB ram and 100mbps
connection. Based on this config what would you recommend the maximum
fetcher threads should be?

2. Does anyone know of a script or plugin that can automate the
segment/fetch/indexing process? Basicallly I'm fetching about 20
million pages and I have to run the segment, fetch and index process
myself in a shell (which takes some time). I really would like some
sort of a shell script that I can run and the whole process can run as
a daemon in the background and I can worry about other issues.

Thank you in advance!!!!
-- 
Regards
Justin Hartman
PGP Key ID: 102CC123

Re: Fetcher threads & automation

Posted by Dennis Kubes <nu...@dragonflymc.com>.

Stupid questions but we are sure that it is named logging.conf (the same 
name as the logconf variable) and that it is readable?

Dennis

Justin Hartman wrote:
> Hi Dennis
> 
> The logging.conf file is in the /hdd2/jobstream/ folder along with the
> python script. I haven't modified the logging.conf file at all -
> should i?
> 
> Regards
> Justin
> 
> On 1/29/07, Dennis Kubes <nu...@dragonflymc.com> wrote:
>> Justin,
>>
>> Thanks for the update.  I will update the script and the wiki to be able
>> to run this from a clean, no previous fetches run.  Currently it did
>> assume that there were at least some previous fetches, crawldb, and
>> segments to go with it.
>>
>> As to your error, I think it is looking for the logging.conf file.  Is
>> that file in the same directory as the JobStream.py script?  In the top
>> of the logging file there is a section called formatters like this:
>>
>> [formatters]
>> keys=simple
>>
>>
>> Dennis Kubes
>>
>> Justin Hartman wrote:
>> > Hi Dennis
>> >
>> > This is a great contribution and I personally thank you for making it
>> > available to the community.
>> >
>> > I am having a little difficulty getting it to work and possibly you
>> > can provide some assistance in what I'm doing wrong?
>> >
>> > A little background first:-
>> > I'm running the python script in the following location:
>> > /hdd2/jobstream/JobStream.py
>> > My master directory is: /hdd2/nutch/master
>> > My backup directory is: /hdd2/nutch/backup
>> >
>> > My config in JobStream.py is as follows:-
>> >
>> > Line 55 to 60 configured as:
>> > class JobStream:
>> >  nutchdir = "/home/nutch/nutch"
>> >  masterdir = "/hdd2/nutch/master"
>> >  backupdir = "/hdd2/nutch/backup"
>> >  log = logging.getLogger("jobstream")
>> >
>> > Line 377 onwards configured as:
>> > def main(argv):
>> >  # set the default values
>> >  resume = 0
>> >  execute = 0
>> >  checkfile = "jobstream.stop"
>> >  logconf = "logging.conf"
>> >  jobdir = "/hdd2/jobstream"
>> >  nutchdir = "/home/nutch/nutch"
>> >  masterdir = "/hdd2/nutch/master"
>> >  backupdir = "/hdd2/nutch/backup"
>> >  dfsdumpdir = "/hdd2/nutch/dump"
>> >  tempdir = "/hdd2/nutch/temp"
>> >  splitsize = 500000
>> >  fetchmerge = 3
>> >
>> > All the above paths are correct and have been created and the master
>> > and backup directories contain zero data and have been created for
>> > usage of the python script.
>> >
>> > When executing JobStream.py -e for the first time I got an error
>> > telling me it could not find various directories within the master
>> > directory so I injected the URLs into the /hdd2/nutch/master
>> > directory.
>> >
>> > This solved my initial error however now I have this error (below) and
>> > not sure what to do about it:
>> >
>> > /usr/bin/python2.4 /hdd2/jobstream/JobStream.py -e
>> > Traceback (most recent call last):
>> >  File "/hdd2/jobstream/JobStream.py", line 465, in ?
>> >    main(sys.argv[1:])
>> >  File "/hdd2/jobstream/JobStream.py", line 444, in main
>> >    logging.config.fileConfig(logconf)
>> >  File "logging/config.py", line 76, in fileConfig
>> >  File "/usr/lib/python2.4/ConfigParser.py", line 511, in get
>> >    raise NoSectionError(section)
>> > ConfigParser.NoSectionError: No section: 'formatters'
>> >
>> > Do you have any ideas?
>> >
>> > Regards
>> > Justin
>> >
>> > On 1/29/07, Dennis Kubes <nu...@dragonflymc.com> wrote:
>> >> It is up on the wiki at the following location.
>> >>
>> >> http://wiki.apache.org/nutch/Automating_Fetches_with_Python
>> >>
>> >> It has also been added to the front page.
>> >>
>> >> Dennis Kubes
>> >>
>> >> Andrzej Bialecki wrote:
>> >> > Dennis Kubes wrote:
>> >> >> We have a python script with logging which fully automates the
>> >> >> fetching and updating process, not the invert links or the indexing
>> >> >> process.  If anybody wants a copy, send me an email and I will send
>> >> >> you a copy.
>> >> >>
>> >> >> We are currently working on a more in-depth framework for 
>> automating
>> >> >> these types of job streams in python but that is not complete yet.
>> >> >>
>> >> >> Andrzej, do you think this is something we should post to the wiki?
>> >> >
>> >> > Sure, if it's ok for you to release it I'm sure many people would 
>> find
>> >> > it useful.
>> >> >
>> >>
>> >
>> >
>>
> 
>

Re: Fetcher threads & automation

Posted by Justin Hartman <jj...@gmail.com>.

Hi Dennis

The logging.conf file is in the /hdd2/jobstream/ folder along with the
python script. I haven't modified the logging.conf file at all -
should i?

Regards
Justin

On 1/29/07, Dennis Kubes <nu...@dragonflymc.com> wrote:
> Justin,
>
> Thanks for the update.  I will update the script and the wiki to be able
> to run this from a clean, no previous fetches run.  Currently it did
> assume that there were at least some previous fetches, crawldb, and
> segments to go with it.
>
> As to your error, I think it is looking for the logging.conf file.  Is
> that file in the same directory as the JobStream.py script?  In the top
> of the logging file there is a section called formatters like this:
>
> [formatters]
> keys=simple
>
>
> Dennis Kubes
>
> Justin Hartman wrote:
> > Hi Dennis
> >
> > This is a great contribution and I personally thank you for making it
> > available to the community.
> >
> > I am having a little difficulty getting it to work and possibly you
> > can provide some assistance in what I'm doing wrong?
> >
> > A little background first:-
> > I'm running the python script in the following location:
> > /hdd2/jobstream/JobStream.py
> > My master directory is: /hdd2/nutch/master
> > My backup directory is: /hdd2/nutch/backup
> >
> > My config in JobStream.py is as follows:-
> >
> > Line 55 to 60 configured as:
> > class JobStream:
> >  nutchdir = "/home/nutch/nutch"
> >  masterdir = "/hdd2/nutch/master"
> >  backupdir = "/hdd2/nutch/backup"
> >  log = logging.getLogger("jobstream")
> >
> > Line 377 onwards configured as:
> > def main(argv):
> >  # set the default values
> >  resume = 0
> >  execute = 0
> >  checkfile = "jobstream.stop"
> >  logconf = "logging.conf"
> >  jobdir = "/hdd2/jobstream"
> >  nutchdir = "/home/nutch/nutch"
> >  masterdir = "/hdd2/nutch/master"
> >  backupdir = "/hdd2/nutch/backup"
> >  dfsdumpdir = "/hdd2/nutch/dump"
> >  tempdir = "/hdd2/nutch/temp"
> >  splitsize = 500000
> >  fetchmerge = 3
> >
> > All the above paths are correct and have been created and the master
> > and backup directories contain zero data and have been created for
> > usage of the python script.
> >
> > When executing JobStream.py -e for the first time I got an error
> > telling me it could not find various directories within the master
> > directory so I injected the URLs into the /hdd2/nutch/master
> > directory.
> >
> > This solved my initial error however now I have this error (below) and
> > not sure what to do about it:
> >
> > /usr/bin/python2.4 /hdd2/jobstream/JobStream.py -e
> > Traceback (most recent call last):
> >  File "/hdd2/jobstream/JobStream.py", line 465, in ?
> >    main(sys.argv[1:])
> >  File "/hdd2/jobstream/JobStream.py", line 444, in main
> >    logging.config.fileConfig(logconf)
> >  File "logging/config.py", line 76, in fileConfig
> >  File "/usr/lib/python2.4/ConfigParser.py", line 511, in get
> >    raise NoSectionError(section)
> > ConfigParser.NoSectionError: No section: 'formatters'
> >
> > Do you have any ideas?
> >
> > Regards
> > Justin
> >
> > On 1/29/07, Dennis Kubes <nu...@dragonflymc.com> wrote:
> >> It is up on the wiki at the following location.
> >>
> >> http://wiki.apache.org/nutch/Automating_Fetches_with_Python
> >>
> >> It has also been added to the front page.
> >>
> >> Dennis Kubes
> >>
> >> Andrzej Bialecki wrote:
> >> > Dennis Kubes wrote:
> >> >> We have a python script with logging which fully automates the
> >> >> fetching and updating process, not the invert links or the indexing
> >> >> process.  If anybody wants a copy, send me an email and I will send
> >> >> you a copy.
> >> >>
> >> >> We are currently working on a more in-depth framework for automating
> >> >> these types of job streams in python but that is not complete yet.
> >> >>
> >> >> Andrzej, do you think this is something we should post to the wiki?
> >> >
> >> > Sure, if it's ok for you to release it I'm sure many people would find
> >> > it useful.
> >> >
> >>
> >
> >
>


-- 
Regards
Justin Hartman
PGP Key ID: 102CC123

Re: Fetcher threads & automation

Posted by Dennis Kubes <nu...@dragonflymc.com>.

Justin,

Thanks for the update.  I will update the script and the wiki to be able 
to run this from a clean, no previous fetches run.  Currently it did 
assume that there were at least some previous fetches, crawldb, and 
segments to go with it.

As to your error, I think it is looking for the logging.conf file.  Is 
that file in the same directory as the JobStream.py script?  In the top 
of the logging file there is a section called formatters like this:

[formatters]
keys=simple


Dennis Kubes

Justin Hartman wrote:
> Hi Dennis
> 
> This is a great contribution and I personally thank you for making it
> available to the community.
> 
> I am having a little difficulty getting it to work and possibly you
> can provide some assistance in what I'm doing wrong?
> 
> A little background first:-
> I'm running the python script in the following location:
> /hdd2/jobstream/JobStream.py
> My master directory is: /hdd2/nutch/master
> My backup directory is: /hdd2/nutch/backup
> 
> My config in JobStream.py is as follows:-
> 
> Line 55 to 60 configured as:
> class JobStream:
>  nutchdir = "/home/nutch/nutch"
>  masterdir = "/hdd2/nutch/master"
>  backupdir = "/hdd2/nutch/backup"
>  log = logging.getLogger("jobstream")
> 
> Line 377 onwards configured as:
> def main(argv):
>  # set the default values
>  resume = 0
>  execute = 0
>  checkfile = "jobstream.stop"
>  logconf = "logging.conf"
>  jobdir = "/hdd2/jobstream"
>  nutchdir = "/home/nutch/nutch"
>  masterdir = "/hdd2/nutch/master"
>  backupdir = "/hdd2/nutch/backup"
>  dfsdumpdir = "/hdd2/nutch/dump"
>  tempdir = "/hdd2/nutch/temp"
>  splitsize = 500000
>  fetchmerge = 3
> 
> All the above paths are correct and have been created and the master
> and backup directories contain zero data and have been created for
> usage of the python script.
> 
> When executing JobStream.py -e for the first time I got an error
> telling me it could not find various directories within the master
> directory so I injected the URLs into the /hdd2/nutch/master
> directory.
> 
> This solved my initial error however now I have this error (below) and
> not sure what to do about it:
> 
> /usr/bin/python2.4 /hdd2/jobstream/JobStream.py -e
> Traceback (most recent call last):
>  File "/hdd2/jobstream/JobStream.py", line 465, in ?
>    main(sys.argv[1:])
>  File "/hdd2/jobstream/JobStream.py", line 444, in main
>    logging.config.fileConfig(logconf)
>  File "logging/config.py", line 76, in fileConfig
>  File "/usr/lib/python2.4/ConfigParser.py", line 511, in get
>    raise NoSectionError(section)
> ConfigParser.NoSectionError: No section: 'formatters'
> 
> Do you have any ideas?
> 
> Regards
> Justin
> 
> On 1/29/07, Dennis Kubes <nu...@dragonflymc.com> wrote:
>> It is up on the wiki at the following location.
>>
>> http://wiki.apache.org/nutch/Automating_Fetches_with_Python
>>
>> It has also been added to the front page.
>>
>> Dennis Kubes
>>
>> Andrzej Bialecki wrote:
>> > Dennis Kubes wrote:
>> >> We have a python script with logging which fully automates the
>> >> fetching and updating process, not the invert links or the indexing
>> >> process.  If anybody wants a copy, send me an email and I will send
>> >> you a copy.
>> >>
>> >> We are currently working on a more in-depth framework for automating
>> >> these types of job streams in python but that is not complete yet.
>> >>
>> >> Andrzej, do you think this is something we should post to the wiki?
>> >
>> > Sure, if it's ok for you to release it I'm sure many people would find
>> > it useful.
>> >
>>
> 
>

Re: Fetcher threads & automation

Posted by Justin Hartman <jj...@gmail.com>.

Hi Dennis

This is a great contribution and I personally thank you for making it
available to the community.

I am having a little difficulty getting it to work and possibly you
can provide some assistance in what I'm doing wrong?

A little background first:-
I'm running the python script in the following location:
/hdd2/jobstream/JobStream.py
My master directory is: /hdd2/nutch/master
My backup directory is: /hdd2/nutch/backup

My config in JobStream.py is as follows:-

Line 55 to 60 configured as:
class JobStream:
  nutchdir = "/home/nutch/nutch"
  masterdir = "/hdd2/nutch/master"
  backupdir = "/hdd2/nutch/backup"
  log = logging.getLogger("jobstream")

Line 377 onwards configured as:
def main(argv):
  # set the default values
  resume = 0
  execute = 0
  checkfile = "jobstream.stop"
  logconf = "logging.conf"
  jobdir = "/hdd2/jobstream"
  nutchdir = "/home/nutch/nutch"
  masterdir = "/hdd2/nutch/master"
  backupdir = "/hdd2/nutch/backup"
  dfsdumpdir = "/hdd2/nutch/dump"
  tempdir = "/hdd2/nutch/temp"
  splitsize = 500000
  fetchmerge = 3

All the above paths are correct and have been created and the master
and backup directories contain zero data and have been created for
usage of the python script.

When executing JobStream.py -e for the first time I got an error
telling me it could not find various directories within the master
directory so I injected the URLs into the /hdd2/nutch/master
directory.

This solved my initial error however now I have this error (below) and
not sure what to do about it:

/usr/bin/python2.4 /hdd2/jobstream/JobStream.py -e
Traceback (most recent call last):
  File "/hdd2/jobstream/JobStream.py", line 465, in ?
    main(sys.argv[1:])
  File "/hdd2/jobstream/JobStream.py", line 444, in main
    logging.config.fileConfig(logconf)
  File "logging/config.py", line 76, in fileConfig
  File "/usr/lib/python2.4/ConfigParser.py", line 511, in get
    raise NoSectionError(section)
ConfigParser.NoSectionError: No section: 'formatters'

Do you have any ideas?

Regards
Justin

On 1/29/07, Dennis Kubes <nu...@dragonflymc.com> wrote:
> It is up on the wiki at the following location.
>
> http://wiki.apache.org/nutch/Automating_Fetches_with_Python
>
> It has also been added to the front page.
>
> Dennis Kubes
>
> Andrzej Bialecki wrote:
> > Dennis Kubes wrote:
> >> We have a python script with logging which fully automates the
> >> fetching and updating process, not the invert links or the indexing
> >> process.  If anybody wants a copy, send me an email and I will send
> >> you a copy.
> >>
> >> We are currently working on a more in-depth framework for automating
> >> these types of job streams in python but that is not complete yet.
> >>
> >> Andrzej, do you think this is something we should post to the wiki?
> >
> > Sure, if it's ok for you to release it I'm sure many people would find
> > it useful.
> >
>


-- 
Regards
Justin Hartman
PGP Key ID: 102CC123

Re: Fetcher threads & automation

Posted by Dennis Kubes <nu...@dragonflymc.com>.

It is up on the wiki at the following location.

http://wiki.apache.org/nutch/Automating_Fetches_with_Python

It has also been added to the front page.

Dennis Kubes

Andrzej Bialecki wrote:
> Dennis Kubes wrote:
>> We have a python script with logging which fully automates the 
>> fetching and updating process, not the invert links or the indexing 
>> process.  If anybody wants a copy, send me an email and I will send 
>> you a copy.
>>
>> We are currently working on a more in-depth framework for automating 
>> these types of job streams in python but that is not complete yet.
>>
>> Andrzej, do you think this is something we should post to the wiki?
> 
> Sure, if it's ok for you to release it I'm sure many people would find 
> it useful.
>

Re: Fetcher threads & automation

Posted by Andrzej Bialecki <ab...@getopt.org>.

Dennis Kubes wrote:
> We have a python script with logging which fully automates the 
> fetching and updating process, not the invert links or the indexing 
> process.  If anybody wants a copy, send me an email and I will send 
> you a copy.
>
> We are currently working on a more in-depth framework for automating 
> these types of job streams in python but that is not complete yet.
>
> Andrzej, do you think this is something we should post to the wiki?

Sure, if it's ok for you to release it I'm sure many people would find 
it useful.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Fetcher threads & automation

Posted by Dennis Kubes <nu...@dragonflymc.com>.

We have a python script with logging which fully automates the fetching 
and updating process, not the invert links or the indexing process.  If 
anybody wants a copy, send me an email and I will send you a copy.

We are currently working on a more in-depth framework for automating 
these types of job streams in python but that is not complete yet.

Andrzej, do you think this is something we should post to the wiki?

Dennis Kubes

Justin Hartman wrote:
> Hi all
> 
> Just have a couple more questions which remain unclear to me at this stage.
> 
> 1. I'm fetching urls on a P4 2.8ghz machine with 1GB ram and 100mbps
> connection. Based on this config what would you recommend the maximum
> fetcher threads should be?
> 
> 2. Does anyone know of a script or plugin that can automate the
> segment/fetch/indexing process? Basicallly I'm fetching about 20
> million pages and I have to run the segment, fetch and index process
> myself in a shell (which takes some time). I really would like some
> sort of a shell script that I can run and the whole process can run as
> a daemon in the background and I can worry about other issues.
> 
> Thank you in advance!!!!