You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Sami Siren (JIRA)" <ji...@apache.org> on 2006/08/04 16:17:14 UTC

[jira] Created: (NUTCH-339) Refactor nutch to allow fetcher improvements

Refactor nutch to allow fetcher improvements 
---------------------------------------------

                 Key: NUTCH-339
                 URL: http://issues.apache.org/jira/browse/NUTCH-339
             Project: Nutch
          Issue Type: Task
          Components: fetcher
    Affects Versions: 0.9
         Environment: n/a
            Reporter: Sami Siren
         Assigned To: Sami Siren


As I (and Stefan?) see it there are two major areas the current fetcher could be
improved (as in speed)

1. Politeness code and how it is implemented is the biggest
problem of current fetcher(together with robots.txt handling).
With a simple code changes like replacing it with a PriorityQueue
based solution showed very promising results in increased IO.

2. Changing fetcher to use non blocking io (this requires great amount
of work as we need to implement the protocols from scratch again).

I would like to start with working towards #1 by first refactoring
the current code (plugins actually) in following way:

1. Move robots.txt handling away from (lib-http)plugin.
Even if this is related only to http, leaving it to lib-http
does not allow other kinds of scheduling strategies to be implemented
(it is hardcoded to fetch robots.txt from the same thread when requesting
a page from a site from witch it hasn't tried to load robots.txt)

2. Move code for politeness away from (lib-http)plugin
It is really usable outside http and also the current design limits
changing of the implementation (to queue based)

Where to move these, well my suggestion is the nutch core, does anybody
see problems with this?

These code refactoring activities are to be done in a way that none
of the current functionality is (at least deliberately) changed leaving
current functionality as is thus leaving room and possibility to build
the next generation fetcher(s) without destroying the old one at same time.



-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-339) Refactor nutch to allow fetcher improvements

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/NUTCH-339?page=all ]

Andrzej Bialecki  updated NUTCH-339:
------------------------------------

    Attachment: patch2.txt

This patch compiles and runs. Tested very lightly with a short fetchlist - please review & test.

> Refactor nutch to allow fetcher improvements
> --------------------------------------------
>
>                 Key: NUTCH-339
>                 URL: http://issues.apache.org/jira/browse/NUTCH-339
>             Project: Nutch
>          Issue Type: Task
>          Components: fetcher
>    Affects Versions: 0.8
>         Environment: n/a
>            Reporter: Sami Siren
>         Assigned To: Sami Siren
>             Fix For: 0.9.0
>
>         Attachments: patch.txt, patch2.txt
>
>
> As I (and Stefan?) see it there are two major areas the current fetcher could be
> improved (as in speed)
> 1. Politeness code and how it is implemented is the biggest
> problem of current fetcher(together with robots.txt handling).
> With a simple code changes like replacing it with a PriorityQueue
> based solution showed very promising results in increased IO.
> 2. Changing fetcher to use non blocking io (this requires great amount
> of work as we need to implement the protocols from scratch again).
> I would like to start with working towards #1 by first refactoring
> the current code (plugins actually) in following way:
> 1. Move robots.txt handling away from (lib-http)plugin.
> Even if this is related only to http, leaving it to lib-http
> does not allow other kinds of scheduling strategies to be implemented
> (it is hardcoded to fetch robots.txt from the same thread when requesting
> a page from a site from witch it hasn't tried to load robots.txt)
> 2. Move code for politeness away from (lib-http)plugin
> It is really usable outside http and also the current design limits
> changing of the implementation (to queue based)
> Where to move these, well my suggestion is the nutch core, does anybody
> see problems with this?
> These code refactoring activities are to be done in a way that none
> of the current functionality is (at least deliberately) changed leaving
> current functionality as is thus leaving room and possibility to build
> the next generation fetcher(s) without destroying the old one at same time.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher improvements

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-339?page=comments#action_12425763 ] 
            
Andrzej Bialecki  commented on NUTCH-339:
-----------------------------------------

Great minds think alike ... ;) I started doing exactly this, and so far my patches seem to follow all requirements.

Here's my work-in-progress patch. Warning: not tested!

> Refactor nutch to allow fetcher improvements
> --------------------------------------------
>
>                 Key: NUTCH-339
>                 URL: http://issues.apache.org/jira/browse/NUTCH-339
>             Project: Nutch
>          Issue Type: Task
>          Components: fetcher
>    Affects Versions: 0.9
>         Environment: n/a
>            Reporter: Sami Siren
>         Assigned To: Sami Siren
>
> As I (and Stefan?) see it there are two major areas the current fetcher could be
> improved (as in speed)
> 1. Politeness code and how it is implemented is the biggest
> problem of current fetcher(together with robots.txt handling).
> With a simple code changes like replacing it with a PriorityQueue
> based solution showed very promising results in increased IO.
> 2. Changing fetcher to use non blocking io (this requires great amount
> of work as we need to implement the protocols from scratch again).
> I would like to start with working towards #1 by first refactoring
> the current code (plugins actually) in following way:
> 1. Move robots.txt handling away from (lib-http)plugin.
> Even if this is related only to http, leaving it to lib-http
> does not allow other kinds of scheduling strategies to be implemented
> (it is hardcoded to fetch robots.txt from the same thread when requesting
> a page from a site from witch it hasn't tried to load robots.txt)
> 2. Move code for politeness away from (lib-http)plugin
> It is really usable outside http and also the current design limits
> changing of the implementation (to queue based)
> Where to move these, well my suggestion is the nutch core, does anybody
> see problems with this?
> These code refactoring activities are to be done in a way that none
> of the current functionality is (at least deliberately) changed leaving
> current functionality as is thus leaving room and possibility to build
> the next generation fetcher(s) without destroying the old one at same time.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher improvements

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-339?page=comments#action_12433193 ] 
            
Andrzej Bialecki  commented on NUTCH-339:
-----------------------------------------

By all means, if you have spare CPU cycles please go forward ... You can probably reuse parts of my patch related to Protocol API changes and robots handling, which if I'm not mistaken implement #1 from your list.

> Refactor nutch to allow fetcher improvements
> --------------------------------------------
>
>                 Key: NUTCH-339
>                 URL: http://issues.apache.org/jira/browse/NUTCH-339
>             Project: Nutch
>          Issue Type: Task
>          Components: fetcher
>    Affects Versions: 0.8
>         Environment: n/a
>            Reporter: Sami Siren
>         Assigned To: Sami Siren
>             Fix For: 0.9.0
>
>         Attachments: patch.txt, patch2.txt
>
>
> As I (and Stefan?) see it there are two major areas the current fetcher could be
> improved (as in speed)
> 1. Politeness code and how it is implemented is the biggest
> problem of current fetcher(together with robots.txt handling).
> With a simple code changes like replacing it with a PriorityQueue
> based solution showed very promising results in increased IO.
> 2. Changing fetcher to use non blocking io (this requires great amount
> of work as we need to implement the protocols from scratch again).
> I would like to start with working towards #1 by first refactoring
> the current code (plugins actually) in following way:
> 1. Move robots.txt handling away from (lib-http)plugin.
> Even if this is related only to http, leaving it to lib-http
> does not allow other kinds of scheduling strategies to be implemented
> (it is hardcoded to fetch robots.txt from the same thread when requesting
> a page from a site from witch it hasn't tried to load robots.txt)
> 2. Move code for politeness away from (lib-http)plugin
> It is really usable outside http and also the current design limits
> changing of the implementation (to queue based)
> Where to move these, well my suggestion is the nutch core, does anybody
> see problems with this?
> These code refactoring activities are to be done in a way that none
> of the current functionality is (at least deliberately) changed leaving
> current functionality as is thus leaving room and possibility to build
> the next generation fetcher(s) without destroying the old one at same time.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher improvements

Posted by "Sami Siren (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-339?page=comments#action_12433185 ] 
            
Sami Siren commented on NUTCH-339:
----------------------------------

Andrzej,

are you still working with this or should I proceed as I originally planned ;)

> Refactor nutch to allow fetcher improvements
> --------------------------------------------
>
>                 Key: NUTCH-339
>                 URL: http://issues.apache.org/jira/browse/NUTCH-339
>             Project: Nutch
>          Issue Type: Task
>          Components: fetcher
>    Affects Versions: 0.8
>         Environment: n/a
>            Reporter: Sami Siren
>         Assigned To: Sami Siren
>             Fix For: 0.9.0
>
>         Attachments: patch.txt, patch2.txt
>
>
> As I (and Stefan?) see it there are two major areas the current fetcher could be
> improved (as in speed)
> 1. Politeness code and how it is implemented is the biggest
> problem of current fetcher(together with robots.txt handling).
> With a simple code changes like replacing it with a PriorityQueue
> based solution showed very promising results in increased IO.
> 2. Changing fetcher to use non blocking io (this requires great amount
> of work as we need to implement the protocols from scratch again).
> I would like to start with working towards #1 by first refactoring
> the current code (plugins actually) in following way:
> 1. Move robots.txt handling away from (lib-http)plugin.
> Even if this is related only to http, leaving it to lib-http
> does not allow other kinds of scheduling strategies to be implemented
> (it is hardcoded to fetch robots.txt from the same thread when requesting
> a page from a site from witch it hasn't tried to load robots.txt)
> 2. Move code for politeness away from (lib-http)plugin
> It is really usable outside http and also the current design limits
> changing of the implementation (to queue based)
> Where to move these, well my suggestion is the nutch core, does anybody
> see problems with this?
> These code refactoring activities are to be done in a way that none
> of the current functionality is (at least deliberately) changed leaving
> current functionality as is thus leaving room and possibility to build
> the next generation fetcher(s) without destroying the old one at same time.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-339) Refactor nutch to allow fetcher improvements

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/NUTCH-339?page=all ]

Doğacan Güney updated NUTCH-339:
--------------------------------

    Attachment: patch3.txt

> Refactor nutch to allow fetcher improvements
> --------------------------------------------
>
>                 Key: NUTCH-339
>                 URL: http://issues.apache.org/jira/browse/NUTCH-339
>             Project: Nutch
>          Issue Type: Task
>          Components: fetcher
>    Affects Versions: 0.8
>         Environment: n/a
>            Reporter: Sami Siren
>         Assigned To: Sami Siren
>             Fix For: 0.9.0
>
>         Attachments: patch.txt, patch2.txt, patch3.txt
>
>
> As I (and Stefan?) see it there are two major areas the current fetcher could be
> improved (as in speed)
> 1. Politeness code and how it is implemented is the biggest
> problem of current fetcher(together with robots.txt handling).
> With a simple code changes like replacing it with a PriorityQueue
> based solution showed very promising results in increased IO.
> 2. Changing fetcher to use non blocking io (this requires great amount
> of work as we need to implement the protocols from scratch again).
> I would like to start with working towards #1 by first refactoring
> the current code (plugins actually) in following way:
> 1. Move robots.txt handling away from (lib-http)plugin.
> Even if this is related only to http, leaving it to lib-http
> does not allow other kinds of scheduling strategies to be implemented
> (it is hardcoded to fetch robots.txt from the same thread when requesting
> a page from a site from witch it hasn't tried to load robots.txt)
> 2. Move code for politeness away from (lib-http)plugin
> It is really usable outside http and also the current design limits
> changing of the implementation (to queue based)
> Where to move these, well my suggestion is the nutch core, does anybody
> see problems with this?
> These code refactoring activities are to be done in a way that none
> of the current functionality is (at least deliberately) changed leaving
> current functionality as is thus leaving room and possibility to build
> the next generation fetcher(s) without destroying the old one at same time.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: (NUTCH-339) Refactor nutch to allow fetcher improvements

Posted by Uroš Gruber <ur...@sir-mag.com>.

e w wrote:
> What do you now set fetcher.threads.per.host to? Can you tell me what 
> your
> generate.max.per.host value is as well?
>
<property>
 <name>fetcher.server.delay</name>
 <value>0</value>
 <description>The number of seconds the fetcher will delay between
  successive requests to the same server.</description>
</property>

<property>
  <name>fetcher.threads.fetch</name>
  <value>10</value>
</property>

<property>
  <name>generate.max.per.host</name>
  <value>400</value>
</property>

<property>
 <name>fetcher.threads.per.host</name>
 <value>10</value>
</property>

<property>
  <name>http.max.delays</name>
  <value>30</value>
</property>

> I got big improvements after setting:
>
> <property>
>  <name>fetcher.server.delay</name>
>  <value>0.5</value>
>  <description>The number of seconds the fetcher will delay between
>   successive requests to the same server.</description>
> </property>
>
> even though I'm only generating 5 urls per host 
> (generate.max.per.host=5). I
> don't know whether fetcher.server.delay also affects requests made 
> through a
> proxy (anyone?) since I'm using a proxy.
>
> Also, I still can't see any logging output from the fetchers i.e. what 
> url
> is being requested in any log file anywhere. I'm not so hot with java but
> can anyone here tell whether:
>
> log4j.threshhold=ALL
>
I set this

log4j.logger.org.apache.nutch=DEBUG
log4j.logger.org.apache.hadoop=DEBUG

That I can see what is going on.

--
Uros
> is conf/log4j.properties should be threshhold with 1 "h" or are 2 
> "h"'s the
> java way?
>
> And is there any reason why the lines in the function below are commented
> out:
>
>  public void configure(JobConf job) {
>    setConf(job);
>
>    this.segmentName = job.get(SEGMENT_NAME_KEY);
>    this.storingContent = isStoringContent(job);
>    this.parsing = isParsing(job);
>
> //    if (job.getBoolean("fetcher.verbose", false)) {
> //      LOG.setLevel(Level.FINE);
> //    }
>  }
>
> Is this parameter now read somewhere else?
>
> Any enlightenment always appreciated.
>
> -Ed
>
> On 8/9/06, Uroš Gruber <ur...@sir-mag.com> wrote:
>>
>> Sami Siren wrote:
>> >
>> >> I set DEBUG level loging and I've checked time during operations and
>> >> when doint MapReduce job which is run after every page it takes 3-4
>> >> seconds till next url is fethed.
>> >> I have some local site and fetching 100 pages takes about 6 minutes.
>> >
>> > You are fetching a single site yes? Then you can get more performance
>> > by tweaking the configuration
>> > of fetcher.
>> >
>> > <property>
>> >  <name>fetcher.server.delay</name>
>> >  <value></value>
>> >  <description>The number of seconds the fetcher will delay between
>> >   successive requests to the same server.</description>
>> > </property>
>> >
>> > <property>
>> >  <name>fetcher.threads.per.host</name>
>> >  <value></value>
>> >  <description>This number is the maximum number of threads that
>> >    should be allowed to access a host at one time.</description>
>> > </property>
>> >
>> Hi,
>>
>> I've manage to test nutch speed on several machines with different OS as
>> well.
>> I looks that fetcher.threads.per.host makes fetcher run faster.
>>
>> What I still don't understand is this.
>>
>> When fetcher threads was set to default value fetcher was doing
>> mapreduce after every url.
>> But now job is run on about 400 urls or maybe more.
>>
>> -- 
>> Uros
>> > --
>> > Sami Siren
>>
>>
>

Re: (NUTCH-339) Refactor nutch to allow fetcher improvements

Posted by e w <ep...@gmail.com>.

What do you now set fetcher.threads.per.host to? Can you tell me what your
generate.max.per.host value is as well?

I got big improvements after setting:

<property>
  <name>fetcher.server.delay</name>
  <value>0.5</value>
  <description>The number of seconds the fetcher will delay between
   successive requests to the same server.</description>
</property>

even though I'm only generating 5 urls per host (generate.max.per.host=5). I
don't know whether fetcher.server.delay also affects requests made through a
proxy (anyone?) since I'm using a proxy.

Also, I still can't see any logging output from the fetchers i.e. what url
is being requested in any log file anywhere. I'm not so hot with java but
can anyone here tell whether:

log4j.threshhold=ALL

is conf/log4j.properties should be threshhold with 1 "h" or are 2 "h"'s the
java way?

And is there any reason why the lines in the function below are commented
out:

  public void configure(JobConf job) {
    setConf(job);

    this.segmentName = job.get(SEGMENT_NAME_KEY);
    this.storingContent = isStoringContent(job);
    this.parsing = isParsing(job);

//    if (job.getBoolean("fetcher.verbose", false)) {
//      LOG.setLevel(Level.FINE);
//    }
  }

Is this parameter now read somewhere else?

Any enlightenment always appreciated.

-Ed

On 8/9/06, Uroš Gruber <ur...@sir-mag.com> wrote:
>
> Sami Siren wrote:
> >
> >> I set DEBUG level loging and I've checked time during operations and
> >> when doint MapReduce job which is run after every page it takes 3-4
> >> seconds till next url is fethed.
> >> I have some local site and fetching 100 pages takes about 6 minutes.
> >
> > You are fetching a single site yes? Then you can get more performance
> > by tweaking the configuration
> > of fetcher.
> >
> > <property>
> >  <name>fetcher.server.delay</name>
> >  <value></value>
> >  <description>The number of seconds the fetcher will delay between
> >   successive requests to the same server.</description>
> > </property>
> >
> > <property>
> >  <name>fetcher.threads.per.host</name>
> >  <value></value>
> >  <description>This number is the maximum number of threads that
> >    should be allowed to access a host at one time.</description>
> > </property>
> >
> Hi,
>
> I've manage to test nutch speed on several machines with different OS as
> well.
> I looks that fetcher.threads.per.host makes fetcher run faster.
>
> What I still don't understand is this.
>
> When fetcher threads was set to default value fetcher was doing
> mapreduce after every url.
> But now job is run on about 400 urls or maybe more.
>
> --
> Uros
> > --
> > Sami Siren
>
>

Re: (NUTCH-339) Refactor nutch to allow fetcher improvements

Posted by Uroš Gruber <ur...@sir-mag.com>.

Sami Siren wrote:
>
>> I set DEBUG level loging and I've checked time during operations and 
>> when doint MapReduce job which is run after every page it takes 3-4 
>> seconds till next url is fethed.
>> I have some local site and fetching 100 pages takes about 6 minutes.
>
> You are fetching a single site yes? Then you can get more performance 
> by tweaking the configuration
> of fetcher.
>
> <property>
>  <name>fetcher.server.delay</name>
>  <value></value>
>  <description>The number of seconds the fetcher will delay between
>   successive requests to the same server.</description>
> </property>
>
> <property>
>  <name>fetcher.threads.per.host</name>
>  <value></value>
>  <description>This number is the maximum number of threads that
>    should be allowed to access a host at one time.</description>
> </property>
>
Hi,

I've manage to test nutch speed on several machines with different OS as 
well.
I looks that fetcher.threads.per.host makes fetcher run faster.

What I still don't understand is this.

When fetcher threads was set to default value fetcher was doing 
mapreduce after every url.
But now job is run on about 400 urls or maybe more.

--
Uros
> -- 
> Sami Siren

Re: (NUTCH-339) Refactor nutch to allow fetcher improvements

Posted by Sami Siren <ss...@gmail.com>.

> I set DEBUG level loging and I've checked time during operations and 
> when doint MapReduce job which is run after every page it takes 3-4 
> seconds till next url is fethed.
> I have some local site and fetching 100 pages takes about 6 minutes.

You are fetching a single site yes? Then you can get more performance by 
tweaking the configuration
of fetcher.

<property>
  <name>fetcher.server.delay</name>
  <value></value>
  <description>The number of seconds the fetcher will delay between
   successive requests to the same server.</description>
</property>

<property>
  <name>fetcher.threads.per.host</name>
  <value></value>
  <description>This number is the maximum number of threads that
    should be allowed to access a host at one time.</description>
</property>

--
 Sami Siren

Re: (NUTCH-339) Refactor nutch to allow fetcher improvements

Posted by Uroš Gruber <ur...@sir-mag.com>.

Sami Siren wrote:
> Uroš Gruber wrote:
>
>> Andrzej Bialecki wrote:
>>
>>> Sami Siren (JIRA) wrote:
>>>
>>>> I am not sure to what you refer to by this 3-4 sec but yes I agree 
>>>> threre are more aspects to optimize in fetcher, what I was firstly 
>>>> concerned was the fetching IO speed what was getting ridiculously 
>>>> low (not quite sure when this happened).
>>>>   
>>>
>>>
>> I set DEBUG level loging and I've checked time during operations and 
>> when doint MapReduce job which is run after every page it takes 3-4 
>> seconds till next url is fethed.
>> I have some local site and fetching 100 pages takes about 6 minutes.
>
> Even I havent's seen it go that slow :)
>
Lucky me ;)
>>> Depending on the number of map/reduce tasks, there is a framework 
>>> overhead to transfer the job JAR
>>>
>> I would like to help find what cause such slowness. Version 0.7 did 
>> not use MapReduce and fetching was done about 20 pages per second on 
>> the same server. With same site fetching is reduced to 0.3 pages per 
>> second.
>
> With queue based solution I just did a crawl of about 600k pages and 
> it averaged 16 pps (1376 kb/s) with parsing enabled. Perhaps you could 
> try Andrzejs new Fetcher and see how it performs for you (I haven't 
> yet read the code ot tested it my self).
>
I'll try it, but first I need to test it on java 1.4.2. Maybe the 
problem is with OS itself. I'll report bask as soon as I have more test.

regards

Uros

Re: (NUTCH-339) Refactor nutch to allow fetcher improvements

Posted by Sami Siren <ss...@gmail.com>.

Uroš Gruber wrote:

> Andrzej Bialecki wrote:
>
>> Sami Siren (JIRA) wrote:
>>
>>> I am not sure to what you refer to by this 3-4 sec but yes I agree 
>>> threre are more aspects to optimize in fetcher, what I was firstly 
>>> concerned was the fetching IO speed what was getting ridiculously 
>>> low (not quite sure when this happened).
>>>   
>>
>>
> I set DEBUG level loging and I've checked time during operations and 
> when doint MapReduce job which is run after every page it takes 3-4 
> seconds till next url is fethed.
> I have some local site and fetching 100 pages takes about 6 minutes.

Even I havent's seen it go that slow :)

>> Depending on the number of map/reduce tasks, there is a framework 
>> overhead to transfer the job JAR file, and start the subprocess on 
>> each tasktracker. However, once these are started the framework's 
>> overhead should be negligible, because single task is responsible for 
>> fetching many urls.
>>
>> Naturally, for small jobs, with very few urls, the overhead is 
>> relatively large.
>>
>> The symptoms I'm seeing is that eventually most threads end up in 
>> blockAddr spin-waiting. Another problem I see is that when the number 
>> of fetching threads is high relative to the available bandwidth, the 
>> data is trickling in so slowly that the Fetcher.run() decides that 
>> it's hung, and aborts the task. What happens then is that the task 
>> gets a SUCCEEDED status in tasktracker, although in reality it may 
>> have fetched only a small portion of the allotted fetchlist.
>>
> I would like to help find what cause such slowness. Version 0.7 did 
> not use MapReduce and fetching was done about 20 pages per second on 
> the same server. With same site fetching is reduced to 0.3 pages per 
> second.

With queue based solution I just did a crawl of about 600k pages and it 
averaged 16 pps (1376 kb/s) with parsing enabled. Perhaps you could try 
Andrzejs new Fetcher and see how it performs for you (I haven't yet read 
the code ot tested it my self).

--
 Sami Siren

Re: (NUTCH-339) Refactor nutch to allow fetcher improvements

Posted by Uroš Gruber <ur...@sir-mag.com>.

Andrzej Bialecki wrote:
> Sami Siren (JIRA) wrote:
>> I am not sure to what you refer to by this 3-4 sec but yes I agree 
>> threre are more aspects to optimize in fetcher, what I was firstly 
>> concerned was the fetching IO speed what was getting ridiculously low 
>> (not quite sure when this happened).
>>   
>
I set DEBUG level loging and I've checked time during operations and 
when doint MapReduce job which is run after every page it takes 3-4 
seconds till next url is fethed. 

I have some local site and fetching 100 pages takes about 6 minutes.
> Depending on the number of map/reduce tasks, there is a framework 
> overhead to transfer the job JAR file, and start the subprocess on 
> each tasktracker. However, once these are started the framework's 
> overhead should be negligible, because single task is responsible for 
> fetching many urls.
>
> Naturally, for small jobs, with very few urls, the overhead is 
> relatively large.
>
> The symptoms I'm seeing is that eventually most threads end up in 
> blockAddr spin-waiting. Another problem I see is that when the number 
> of fetching threads is high relative to the available bandwidth, the 
> data is trickling in so slowly that the Fetcher.run() decides that 
> it's hung, and aborts the task. What happens then is that the task 
> gets a SUCCEEDED status in tasktracker, although in reality it may 
> have fetched only a small portion of the allotted fetchlist.
>
I would like to help find what cause such slowness. Version 0.7 did not 
use MapReduce and fetching was done about 20 pages per second on the 
same server. With same site fetching is reduced to 0.3 pages per second.

here is log msg

2006-08-02 10:12:29,162 INFO  mapred.LocalJobRunner - 37 pages, 0 errors, 0.3 pages/s, 52 kb/s,
2006-08-02 10:12:30,164 INFO  mapred.LocalJobRunner - 37 pages, 0 errors, 0.3 pages/s, 52 kb/s,
2006-08-02 10:12:31,166 INFO  mapred.LocalJobRunner - 37 pages, 0 errors, 0.3 pages/s, 51 kb/s,
2006-08-02 10:12:32,168 INFO  mapred.LocalJobRunner - 37 pages, 0 errors, 0.3 pages/s, 51 kb/s,
2006-08-02 10:12:33,170 INFO  mapred.LocalJobRunner - 37 pages, 0 errors, 0.3 pages/s, 50 kb/s,


>> We should open more than one ticket to track these separate aspects. 
>> And for general discussion the mailing lista are perhaps the best place.
>>   
> (I'm moving this to the list then).
>
>
regards

Uros

Re: (NUTCH-339) Refactor nutch to allow fetcher improvements

Posted by Andrzej Bialecki <ab...@getopt.org>.

Sami Siren (JIRA) wrote:
> I am not sure to what you refer to by this 3-4 sec but yes I agree threre are more aspects to optimize in fetcher, what I was firstly concerned was the fetching IO speed what was getting ridiculously low (not quite sure when this happened).
>   

Depending on the number of map/reduce tasks, there is a framework 
overhead to transfer the job JAR file, and start the subprocess on each 
tasktracker. However, once these are started the framework's overhead 
should be negligible, because single task is responsible for fetching 
many urls.

Naturally, for small jobs, with very few urls, the overhead is 
relatively large.

The symptoms I'm seeing is that eventually most threads end up in 
blockAddr spin-waiting. Another problem I see is that when the number of 
fetching threads is high relative to the available bandwidth, the data 
is trickling in so slowly that the Fetcher.run() decides that it's hung, 
and aborts the task. What happens then is that the task gets a SUCCEEDED 
status in tasktracker, although in reality it may have fetched only a 
small portion of the allotted fetchlist.

> We should open more than one ticket to track these separate aspects. And for general discussion the mailing lista are perhaps the best place.
>   
(I'm moving this to the list then).

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

[jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher improvements

Posted by "Sami Siren (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-339?page=comments#action_12425782 ] 
            
Sami Siren commented on NUTCH-339:
----------------------------------

I am not sure to what you refer to by this 3-4 sec but yes I agree threre are more aspects to optimize in fetcher, what I was firstly concerned was the fetching IO speed what was getting ridiculously low (not quite sure when this happened).

We should open more than one ticket to track these separate aspects. And for general discussion the mailing lista are perhaps the best place.




> Refactor nutch to allow fetcher improvements
> --------------------------------------------
>
>                 Key: NUTCH-339
>                 URL: http://issues.apache.org/jira/browse/NUTCH-339
>             Project: Nutch
>          Issue Type: Task
>          Components: fetcher
>    Affects Versions: 0.9
>         Environment: n/a
>            Reporter: Sami Siren
>         Assigned To: Sami Siren
>         Attachments: patch.txt
>
>
> As I (and Stefan?) see it there are two major areas the current fetcher could be
> improved (as in speed)
> 1. Politeness code and how it is implemented is the biggest
> problem of current fetcher(together with robots.txt handling).
> With a simple code changes like replacing it with a PriorityQueue
> based solution showed very promising results in increased IO.
> 2. Changing fetcher to use non blocking io (this requires great amount
> of work as we need to implement the protocols from scratch again).
> I would like to start with working towards #1 by first refactoring
> the current code (plugins actually) in following way:
> 1. Move robots.txt handling away from (lib-http)plugin.
> Even if this is related only to http, leaving it to lib-http
> does not allow other kinds of scheduling strategies to be implemented
> (it is hardcoded to fetch robots.txt from the same thread when requesting
> a page from a site from witch it hasn't tried to load robots.txt)
> 2. Move code for politeness away from (lib-http)plugin
> It is really usable outside http and also the current design limits
> changing of the implementation (to queue based)
> Where to move these, well my suggestion is the nutch core, does anybody
> see problems with this?
> These code refactoring activities are to be done in a way that none
> of the current functionality is (at least deliberately) changed leaving
> current functionality as is thus leaving room and possibility to build
> the next generation fetcher(s) without destroying the old one at same time.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher improvements

Posted by "Uros Gruber (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-339?page=comments#action_12425777 ] 
            
Uros Gruber commented on NUTCH-339:
-----------------------------------

I check my logs and see that the main speed issue with 0.8 is actualy MapReduce work. I takes about 3-4 seconds for one page.  Fetching is done 20 maybe 30 miliseconds.

I don't know it this is right place to talk about this.


> Refactor nutch to allow fetcher improvements
> --------------------------------------------
>
>                 Key: NUTCH-339
>                 URL: http://issues.apache.org/jira/browse/NUTCH-339
>             Project: Nutch
>          Issue Type: Task
>          Components: fetcher
>    Affects Versions: 0.9
>         Environment: n/a
>            Reporter: Sami Siren
>         Assigned To: Sami Siren
>         Attachments: patch.txt
>
>
> As I (and Stefan?) see it there are two major areas the current fetcher could be
> improved (as in speed)
> 1. Politeness code and how it is implemented is the biggest
> problem of current fetcher(together with robots.txt handling).
> With a simple code changes like replacing it with a PriorityQueue
> based solution showed very promising results in increased IO.
> 2. Changing fetcher to use non blocking io (this requires great amount
> of work as we need to implement the protocols from scratch again).
> I would like to start with working towards #1 by first refactoring
> the current code (plugins actually) in following way:
> 1. Move robots.txt handling away from (lib-http)plugin.
> Even if this is related only to http, leaving it to lib-http
> does not allow other kinds of scheduling strategies to be implemented
> (it is hardcoded to fetch robots.txt from the same thread when requesting
> a page from a site from witch it hasn't tried to load robots.txt)
> 2. Move code for politeness away from (lib-http)plugin
> It is really usable outside http and also the current design limits
> changing of the implementation (to queue based)
> Where to move these, well my suggestion is the nutch core, does anybody
> see problems with this?
> These code refactoring activities are to be done in a way that none
> of the current functionality is (at least deliberately) changed leaving
> current functionality as is thus leaving room and possibility to build
> the next generation fetcher(s) without destroying the old one at same time.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher improvements

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-339?page=comments#action_12433354 ] 
            
Doğacan Güney commented on NUTCH-339:
-------------------------------------

I have made a few changes to Andrzej's latest patch. The biggest change is that BLOCKED_ADDR_QUEUE is now a priority queue and cleanExpiredServerBlocks should block threads a lot less. I am attaching this as patch3.txt.

> Refactor nutch to allow fetcher improvements
> --------------------------------------------
>
>                 Key: NUTCH-339
>                 URL: http://issues.apache.org/jira/browse/NUTCH-339
>             Project: Nutch
>          Issue Type: Task
>          Components: fetcher
>    Affects Versions: 0.8
>         Environment: n/a
>            Reporter: Sami Siren
>         Assigned To: Sami Siren
>             Fix For: 0.9.0
>
>         Attachments: patch.txt, patch2.txt, patch3.txt
>
>
> As I (and Stefan?) see it there are two major areas the current fetcher could be
> improved (as in speed)
> 1. Politeness code and how it is implemented is the biggest
> problem of current fetcher(together with robots.txt handling).
> With a simple code changes like replacing it with a PriorityQueue
> based solution showed very promising results in increased IO.
> 2. Changing fetcher to use non blocking io (this requires great amount
> of work as we need to implement the protocols from scratch again).
> I would like to start with working towards #1 by first refactoring
> the current code (plugins actually) in following way:
> 1. Move robots.txt handling away from (lib-http)plugin.
> Even if this is related only to http, leaving it to lib-http
> does not allow other kinds of scheduling strategies to be implemented
> (it is hardcoded to fetch robots.txt from the same thread when requesting
> a page from a site from witch it hasn't tried to load robots.txt)
> 2. Move code for politeness away from (lib-http)plugin
> It is really usable outside http and also the current design limits
> changing of the implementation (to queue based)
> Where to move these, well my suggestion is the nutch core, does anybody
> see problems with this?
> These code refactoring activities are to be done in a way that none
> of the current functionality is (at least deliberately) changed leaving
> current functionality as is thus leaving room and possibility to build
> the next generation fetcher(s) without destroying the old one at same time.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-339) Refactor nutch to allow fetcher improvements

Posted by "Sami Siren (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/NUTCH-339?page=all ]

Sami Siren updated NUTCH-339:
-----------------------------

        Fix Version/s: 0.9.0
    Affects Version/s: 0.8
                           (was: 0.9.0)

> Refactor nutch to allow fetcher improvements
> --------------------------------------------
>
>                 Key: NUTCH-339
>                 URL: http://issues.apache.org/jira/browse/NUTCH-339
>             Project: Nutch
>          Issue Type: Task
>          Components: fetcher
>    Affects Versions: 0.8
>         Environment: n/a
>            Reporter: Sami Siren
>         Assigned To: Sami Siren
>             Fix For: 0.9.0
>
>         Attachments: patch.txt
>
>
> As I (and Stefan?) see it there are two major areas the current fetcher could be
> improved (as in speed)
> 1. Politeness code and how it is implemented is the biggest
> problem of current fetcher(together with robots.txt handling).
> With a simple code changes like replacing it with a PriorityQueue
> based solution showed very promising results in increased IO.
> 2. Changing fetcher to use non blocking io (this requires great amount
> of work as we need to implement the protocols from scratch again).
> I would like to start with working towards #1 by first refactoring
> the current code (plugins actually) in following way:
> 1. Move robots.txt handling away from (lib-http)plugin.
> Even if this is related only to http, leaving it to lib-http
> does not allow other kinds of scheduling strategies to be implemented
> (it is hardcoded to fetch robots.txt from the same thread when requesting
> a page from a site from witch it hasn't tried to load robots.txt)
> 2. Move code for politeness away from (lib-http)plugin
> It is really usable outside http and also the current design limits
> changing of the implementation (to queue based)
> Where to move these, well my suggestion is the nutch core, does anybody
> see problems with this?
> These code refactoring activities are to be done in a way that none
> of the current functionality is (at least deliberately) changed leaving
> current functionality as is thus leaving room and possibility to build
> the next generation fetcher(s) without destroying the old one at same time.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-339) Refactor nutch to allow fetcher improvements

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/NUTCH-339?page=all ]

Andrzej Bialecki  updated NUTCH-339:
------------------------------------

    Attachment: patch.txt

Work-in-progress patch containing new Fetcher2, and supporting changes in Protocol API.

> Refactor nutch to allow fetcher improvements
> --------------------------------------------
>
>                 Key: NUTCH-339
>                 URL: http://issues.apache.org/jira/browse/NUTCH-339
>             Project: Nutch
>          Issue Type: Task
>          Components: fetcher
>    Affects Versions: 0.9
>         Environment: n/a
>            Reporter: Sami Siren
>         Assigned To: Sami Siren
>         Attachments: patch.txt
>
>
> As I (and Stefan?) see it there are two major areas the current fetcher could be
> improved (as in speed)
> 1. Politeness code and how it is implemented is the biggest
> problem of current fetcher(together with robots.txt handling).
> With a simple code changes like replacing it with a PriorityQueue
> based solution showed very promising results in increased IO.
> 2. Changing fetcher to use non blocking io (this requires great amount
> of work as we need to implement the protocols from scratch again).
> I would like to start with working towards #1 by first refactoring
> the current code (plugins actually) in following way:
> 1. Move robots.txt handling away from (lib-http)plugin.
> Even if this is related only to http, leaving it to lib-http
> does not allow other kinds of scheduling strategies to be implemented
> (it is hardcoded to fetch robots.txt from the same thread when requesting
> a page from a site from witch it hasn't tried to load robots.txt)
> 2. Move code for politeness away from (lib-http)plugin
> It is really usable outside http and also the current design limits
> changing of the implementation (to queue based)
> Where to move these, well my suggestion is the nutch core, does anybody
> see problems with this?
> These code refactoring activities are to be done in a way that none
> of the current functionality is (at least deliberately) changed leaving
> current functionality as is thus leaving room and possibility to build
> the next generation fetcher(s) without destroying the old one at same time.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira