You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Sami Siren (JIRA)" <ji...@apache.org> on 2006/08/04 16:17:14 UTC
[jira] Created: (NUTCH-339) Refactor nutch to allow fetcher
improvements
Refactor nutch to allow fetcher improvements
---------------------------------------------
Key: NUTCH-339
URL: http://issues.apache.org/jira/browse/NUTCH-339
Project: Nutch
Issue Type: Task
Components: fetcher
Affects Versions: 0.9
Environment: n/a
Reporter: Sami Siren
Assigned To: Sami Siren
As I (and Stefan?) see it there are two major areas the current fetcher could be
improved (as in speed)
1. Politeness code and how it is implemented is the biggest
problem of current fetcher(together with robots.txt handling).
With a simple code changes like replacing it with a PriorityQueue
based solution showed very promising results in increased IO.
2. Changing fetcher to use non blocking io (this requires great amount
of work as we need to implement the protocols from scratch again).
I would like to start with working towards #1 by first refactoring
the current code (plugins actually) in following way:
1. Move robots.txt handling away from (lib-http)plugin.
Even if this is related only to http, leaving it to lib-http
does not allow other kinds of scheduling strategies to be implemented
(it is hardcoded to fetch robots.txt from the same thread when requesting
a page from a site from witch it hasn't tried to load robots.txt)
2. Move code for politeness away from (lib-http)plugin
It is really usable outside http and also the current design limits
changing of the implementation (to queue based)
Where to move these, well my suggestion is the nutch core, does anybody
see problems with this?
These code refactoring activities are to be done in a way that none
of the current functionality is (at least deliberately) changed leaving
current functionality as is thus leaving room and possibility to build
the next generation fetcher(s) without destroying the old one at same time.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-339) Refactor nutch to allow fetcher
improvements
Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-339?page=all ]
Andrzej Bialecki updated NUTCH-339:
------------------------------------
Attachment: patch2.txt
This patch compiles and runs. Tested very lightly with a short fetchlist - please review & test.
> Refactor nutch to allow fetcher improvements
> --------------------------------------------
>
> Key: NUTCH-339
> URL: http://issues.apache.org/jira/browse/NUTCH-339
> Project: Nutch
> Issue Type: Task
> Components: fetcher
> Affects Versions: 0.8
> Environment: n/a
> Reporter: Sami Siren
> Assigned To: Sami Siren
> Fix For: 0.9.0
>
> Attachments: patch.txt, patch2.txt
>
>
> As I (and Stefan?) see it there are two major areas the current fetcher could be
> improved (as in speed)
> 1. Politeness code and how it is implemented is the biggest
> problem of current fetcher(together with robots.txt handling).
> With a simple code changes like replacing it with a PriorityQueue
> based solution showed very promising results in increased IO.
> 2. Changing fetcher to use non blocking io (this requires great amount
> of work as we need to implement the protocols from scratch again).
> I would like to start with working towards #1 by first refactoring
> the current code (plugins actually) in following way:
> 1. Move robots.txt handling away from (lib-http)plugin.
> Even if this is related only to http, leaving it to lib-http
> does not allow other kinds of scheduling strategies to be implemented
> (it is hardcoded to fetch robots.txt from the same thread when requesting
> a page from a site from witch it hasn't tried to load robots.txt)
> 2. Move code for politeness away from (lib-http)plugin
> It is really usable outside http and also the current design limits
> changing of the implementation (to queue based)
> Where to move these, well my suggestion is the nutch core, does anybody
> see problems with this?
> These code refactoring activities are to be done in a way that none
> of the current functionality is (at least deliberately) changed leaving
> current functionality as is thus leaving room and possibility to build
> the next generation fetcher(s) without destroying the old one at same time.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher
improvements
Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-339?page=comments#action_12425763 ]
Andrzej Bialecki commented on NUTCH-339:
-----------------------------------------
Great minds think alike ... ;) I started doing exactly this, and so far my patches seem to follow all requirements.
Here's my work-in-progress patch. Warning: not tested!
> Refactor nutch to allow fetcher improvements
> --------------------------------------------
>
> Key: NUTCH-339
> URL: http://issues.apache.org/jira/browse/NUTCH-339
> Project: Nutch
> Issue Type: Task
> Components: fetcher
> Affects Versions: 0.9
> Environment: n/a
> Reporter: Sami Siren
> Assigned To: Sami Siren
>
> As I (and Stefan?) see it there are two major areas the current fetcher could be
> improved (as in speed)
> 1. Politeness code and how it is implemented is the biggest
> problem of current fetcher(together with robots.txt handling).
> With a simple code changes like replacing it with a PriorityQueue
> based solution showed very promising results in increased IO.
> 2. Changing fetcher to use non blocking io (this requires great amount
> of work as we need to implement the protocols from scratch again).
> I would like to start with working towards #1 by first refactoring
> the current code (plugins actually) in following way:
> 1. Move robots.txt handling away from (lib-http)plugin.
> Even if this is related only to http, leaving it to lib-http
> does not allow other kinds of scheduling strategies to be implemented
> (it is hardcoded to fetch robots.txt from the same thread when requesting
> a page from a site from witch it hasn't tried to load robots.txt)
> 2. Move code for politeness away from (lib-http)plugin
> It is really usable outside http and also the current design limits
> changing of the implementation (to queue based)
> Where to move these, well my suggestion is the nutch core, does anybody
> see problems with this?
> These code refactoring activities are to be done in a way that none
> of the current functionality is (at least deliberately) changed leaving
> current functionality as is thus leaving room and possibility to build
> the next generation fetcher(s) without destroying the old one at same time.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher
improvements
Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-339?page=comments#action_12433193 ]
Andrzej Bialecki commented on NUTCH-339:
-----------------------------------------
By all means, if you have spare CPU cycles please go forward ... You can probably reuse parts of my patch related to Protocol API changes and robots handling, which if I'm not mistaken implement #1 from your list.
> Refactor nutch to allow fetcher improvements
> --------------------------------------------
>
> Key: NUTCH-339
> URL: http://issues.apache.org/jira/browse/NUTCH-339
> Project: Nutch
> Issue Type: Task
> Components: fetcher
> Affects Versions: 0.8
> Environment: n/a
> Reporter: Sami Siren
> Assigned To: Sami Siren
> Fix For: 0.9.0
>
> Attachments: patch.txt, patch2.txt
>
>
> As I (and Stefan?) see it there are two major areas the current fetcher could be
> improved (as in speed)
> 1. Politeness code and how it is implemented is the biggest
> problem of current fetcher(together with robots.txt handling).
> With a simple code changes like replacing it with a PriorityQueue
> based solution showed very promising results in increased IO.
> 2. Changing fetcher to use non blocking io (this requires great amount
> of work as we need to implement the protocols from scratch again).
> I would like to start with working towards #1 by first refactoring
> the current code (plugins actually) in following way:
> 1. Move robots.txt handling away from (lib-http)plugin.
> Even if this is related only to http, leaving it to lib-http
> does not allow other kinds of scheduling strategies to be implemented
> (it is hardcoded to fetch robots.txt from the same thread when requesting
> a page from a site from witch it hasn't tried to load robots.txt)
> 2. Move code for politeness away from (lib-http)plugin
> It is really usable outside http and also the current design limits
> changing of the implementation (to queue based)
> Where to move these, well my suggestion is the nutch core, does anybody
> see problems with this?
> These code refactoring activities are to be done in a way that none
> of the current functionality is (at least deliberately) changed leaving
> current functionality as is thus leaving room and possibility to build
> the next generation fetcher(s) without destroying the old one at same time.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher
improvements
Posted by "Sami Siren (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-339?page=comments#action_12433185 ]
Sami Siren commented on NUTCH-339:
----------------------------------
Andrzej,
are you still working with this or should I proceed as I originally planned ;)
> Refactor nutch to allow fetcher improvements
> --------------------------------------------
>
> Key: NUTCH-339
> URL: http://issues.apache.org/jira/browse/NUTCH-339
> Project: Nutch
> Issue Type: Task
> Components: fetcher
> Affects Versions: 0.8
> Environment: n/a
> Reporter: Sami Siren
> Assigned To: Sami Siren
> Fix For: 0.9.0
>
> Attachments: patch.txt, patch2.txt
>
>
> As I (and Stefan?) see it there are two major areas the current fetcher could be
> improved (as in speed)
> 1. Politeness code and how it is implemented is the biggest
> problem of current fetcher(together with robots.txt handling).
> With a simple code changes like replacing it with a PriorityQueue
> based solution showed very promising results in increased IO.
> 2. Changing fetcher to use non blocking io (this requires great amount
> of work as we need to implement the protocols from scratch again).
> I would like to start with working towards #1 by first refactoring
> the current code (plugins actually) in following way:
> 1. Move robots.txt handling away from (lib-http)plugin.
> Even if this is related only to http, leaving it to lib-http
> does not allow other kinds of scheduling strategies to be implemented
> (it is hardcoded to fetch robots.txt from the same thread when requesting
> a page from a site from witch it hasn't tried to load robots.txt)
> 2. Move code for politeness away from (lib-http)plugin
> It is really usable outside http and also the current design limits
> changing of the implementation (to queue based)
> Where to move these, well my suggestion is the nutch core, does anybody
> see problems with this?
> These code refactoring activities are to be done in a way that none
> of the current functionality is (at least deliberately) changed leaving
> current functionality as is thus leaving room and possibility to build
> the next generation fetcher(s) without destroying the old one at same time.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-339) Refactor nutch to allow fetcher
improvements
Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-339?page=all ]
Doğacan Güney updated NUTCH-339:
--------------------------------
Attachment: patch3.txt
> Refactor nutch to allow fetcher improvements
> --------------------------------------------
>
> Key: NUTCH-339
> URL: http://issues.apache.org/jira/browse/NUTCH-339
> Project: Nutch
> Issue Type: Task
> Components: fetcher
> Affects Versions: 0.8
> Environment: n/a
> Reporter: Sami Siren
> Assigned To: Sami Siren
> Fix For: 0.9.0
>
> Attachments: patch.txt, patch2.txt, patch3.txt
>
>
> As I (and Stefan?) see it there are two major areas the current fetcher could be
> improved (as in speed)
> 1. Politeness code and how it is implemented is the biggest
> problem of current fetcher(together with robots.txt handling).
> With a simple code changes like replacing it with a PriorityQueue
> based solution showed very promising results in increased IO.
> 2. Changing fetcher to use non blocking io (this requires great amount
> of work as we need to implement the protocols from scratch again).
> I would like to start with working towards #1 by first refactoring
> the current code (plugins actually) in following way:
> 1. Move robots.txt handling away from (lib-http)plugin.
> Even if this is related only to http, leaving it to lib-http
> does not allow other kinds of scheduling strategies to be implemented
> (it is hardcoded to fetch robots.txt from the same thread when requesting
> a page from a site from witch it hasn't tried to load robots.txt)
> 2. Move code for politeness away from (lib-http)plugin
> It is really usable outside http and also the current design limits
> changing of the implementation (to queue based)
> Where to move these, well my suggestion is the nutch core, does anybody
> see problems with this?
> These code refactoring activities are to be done in a way that none
> of the current functionality is (at least deliberately) changed leaving
> current functionality as is thus leaving room and possibility to build
> the next generation fetcher(s) without destroying the old one at same time.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: (NUTCH-339) Refactor nutch to allow fetcher improvements
Posted by Uroš Gruber <ur...@sir-mag.com>.
e w wrote:
> What do you now set fetcher.threads.per.host to? Can you tell me what
> your
> generate.max.per.host value is as well?
>
<property>
<name>fetcher.server.delay</name>
<value>0</value>
<description>The number of seconds the fetcher will delay between
successive requests to the same server.</description>
</property>
<property>
<name>fetcher.threads.fetch</name>
<value>10</value>
</property>
<property>
<name>generate.max.per.host</name>
<value>400</value>
</property>
<property>
<name>fetcher.threads.per.host</name>
<value>10</value>
</property>
<property>
<name>http.max.delays</name>
<value>30</value>
</property>
> I got big improvements after setting:
>
> <property>
> <name>fetcher.server.delay</name>
> <value>0.5</value>
> <description>The number of seconds the fetcher will delay between
> successive requests to the same server.</description>
> </property>
>
> even though I'm only generating 5 urls per host
> (generate.max.per.host=5). I
> don't know whether fetcher.server.delay also affects requests made
> through a
> proxy (anyone?) since I'm using a proxy.
>
> Also, I still can't see any logging output from the fetchers i.e. what
> url
> is being requested in any log file anywhere. I'm not so hot with java but
> can anyone here tell whether:
>
> log4j.threshhold=ALL
>
I set this
log4j.logger.org.apache.nutch=DEBUG
log4j.logger.org.apache.hadoop=DEBUG
That I can see what is going on.
--
Uros
> is conf/log4j.properties should be threshhold with 1 "h" or are 2
> "h"'s the
> java way?
>
> And is there any reason why the lines in the function below are commented
> out:
>
> public void configure(JobConf job) {
> setConf(job);
>
> this.segmentName = job.get(SEGMENT_NAME_KEY);
> this.storingContent = isStoringContent(job);
> this.parsing = isParsing(job);
>
> // if (job.getBoolean("fetcher.verbose", false)) {
> // LOG.setLevel(Level.FINE);
> // }
> }
>
> Is this parameter now read somewhere else?
>
> Any enlightenment always appreciated.
>
> -Ed
>
> On 8/9/06, Uroš Gruber <ur...@sir-mag.com> wrote:
>>
>> Sami Siren wrote:
>> >
>> >> I set DEBUG level loging and I've checked time during operations and
>> >> when doint MapReduce job which is run after every page it takes 3-4
>> >> seconds till next url is fethed.
>> >> I have some local site and fetching 100 pages takes about 6 minutes.
>> >
>> > You are fetching a single site yes? Then you can get more performance
>> > by tweaking the configuration
>> > of fetcher.
>> >
>> > <property>
>> > <name>fetcher.server.delay</name>
>> > <value></value>
>> > <description>The number of seconds the fetcher will delay between
>> > successive requests to the same server.</description>
>> > </property>
>> >
>> > <property>
>> > <name>fetcher.threads.per.host</name>
>> > <value></value>
>> > <description>This number is the maximum number of threads that
>> > should be allowed to access a host at one time.</description>
>> > </property>
>> >
>> Hi,
>>
>> I've manage to test nutch speed on several machines with different OS as
>> well.
>> I looks that fetcher.threads.per.host makes fetcher run faster.
>>
>> What I still don't understand is this.
>>
>> When fetcher threads was set to default value fetcher was doing
>> mapreduce after every url.
>> But now job is run on about 400 urls or maybe more.
>>
>> --
>> Uros
>> > --
>> > Sami Siren
>>
>>
>
Re: (NUTCH-339) Refactor nutch to allow fetcher improvements
Posted by e w <ep...@gmail.com>.
What do you now set fetcher.threads.per.host to? Can you tell me what your
generate.max.per.host value is as well?
I got big improvements after setting:
<property>
<name>fetcher.server.delay</name>
<value>0.5</value>
<description>The number of seconds the fetcher will delay between
successive requests to the same server.</description>
</property>
even though I'm only generating 5 urls per host (generate.max.per.host=5). I
don't know whether fetcher.server.delay also affects requests made through a
proxy (anyone?) since I'm using a proxy.
Also, I still can't see any logging output from the fetchers i.e. what url
is being requested in any log file anywhere. I'm not so hot with java but
can anyone here tell whether:
log4j.threshhold=ALL
is conf/log4j.properties should be threshhold with 1 "h" or are 2 "h"'s the
java way?
And is there any reason why the lines in the function below are commented
out:
public void configure(JobConf job) {
setConf(job);
this.segmentName = job.get(SEGMENT_NAME_KEY);
this.storingContent = isStoringContent(job);
this.parsing = isParsing(job);
// if (job.getBoolean("fetcher.verbose", false)) {
// LOG.setLevel(Level.FINE);
// }
}
Is this parameter now read somewhere else?
Any enlightenment always appreciated.
-Ed
On 8/9/06, Uroš Gruber <ur...@sir-mag.com> wrote:
>
> Sami Siren wrote:
> >
> >> I set DEBUG level loging and I've checked time during operations and
> >> when doint MapReduce job which is run after every page it takes 3-4
> >> seconds till next url is fethed.
> >> I have some local site and fetching 100 pages takes about 6 minutes.
> >
> > You are fetching a single site yes? Then you can get more performance
> > by tweaking the configuration
> > of fetcher.
> >
> > <property>
> > <name>fetcher.server.delay</name>
> > <value></value>
> > <description>The number of seconds the fetcher will delay between
> > successive requests to the same server.</description>
> > </property>
> >
> > <property>
> > <name>fetcher.threads.per.host</name>
> > <value></value>
> > <description>This number is the maximum number of threads that
> > should be allowed to access a host at one time.</description>
> > </property>
> >
> Hi,
>
> I've manage to test nutch speed on several machines with different OS as
> well.
> I looks that fetcher.threads.per.host makes fetcher run faster.
>
> What I still don't understand is this.
>
> When fetcher threads was set to default value fetcher was doing
> mapreduce after every url.
> But now job is run on about 400 urls or maybe more.
>
> --
> Uros
> > --
> > Sami Siren
>
>
Re: (NUTCH-339) Refactor nutch to allow fetcher improvements
Posted by Uroš Gruber <ur...@sir-mag.com>.
Sami Siren wrote:
>
>> I set DEBUG level loging and I've checked time during operations and
>> when doint MapReduce job which is run after every page it takes 3-4
>> seconds till next url is fethed.
>> I have some local site and fetching 100 pages takes about 6 minutes.
>
> You are fetching a single site yes? Then you can get more performance
> by tweaking the configuration
> of fetcher.
>
> <property>
> <name>fetcher.server.delay</name>
> <value></value>
> <description>The number of seconds the fetcher will delay between
> successive requests to the same server.</description>
> </property>
>
> <property>
> <name>fetcher.threads.per.host</name>
> <value></value>
> <description>This number is the maximum number of threads that
> should be allowed to access a host at one time.</description>
> </property>
>
Hi,
I've manage to test nutch speed on several machines with different OS as
well.
I looks that fetcher.threads.per.host makes fetcher run faster.
What I still don't understand is this.
When fetcher threads was set to default value fetcher was doing
mapreduce after every url.
But now job is run on about 400 urls or maybe more.
--
Uros
> --
> Sami Siren
Re: (NUTCH-339) Refactor nutch to allow fetcher improvements
Posted by Sami Siren <ss...@gmail.com>.
> I set DEBUG level loging and I've checked time during operations and
> when doint MapReduce job which is run after every page it takes 3-4
> seconds till next url is fethed.
> I have some local site and fetching 100 pages takes about 6 minutes.
You are fetching a single site yes? Then you can get more performance by
tweaking the configuration
of fetcher.
<property>
<name>fetcher.server.delay</name>
<value></value>
<description>The number of seconds the fetcher will delay between
successive requests to the same server.</description>
</property>
<property>
<name>fetcher.threads.per.host</name>
<value></value>
<description>This number is the maximum number of threads that
should be allowed to access a host at one time.</description>
</property>
--
Sami Siren
Re: (NUTCH-339) Refactor nutch to allow fetcher improvements
Posted by Uroš Gruber <ur...@sir-mag.com>.
Sami Siren wrote:
> Uroš Gruber wrote:
>
>> Andrzej Bialecki wrote:
>>
>>> Sami Siren (JIRA) wrote:
>>>
>>>> I am not sure to what you refer to by this 3-4 sec but yes I agree
>>>> threre are more aspects to optimize in fetcher, what I was firstly
>>>> concerned was the fetching IO speed what was getting ridiculously
>>>> low (not quite sure when this happened).
>>>>
>>>
>>>
>> I set DEBUG level loging and I've checked time during operations and
>> when doint MapReduce job which is run after every page it takes 3-4
>> seconds till next url is fethed.
>> I have some local site and fetching 100 pages takes about 6 minutes.
>
> Even I havent's seen it go that slow :)
>
Lucky me ;)
>>> Depending on the number of map/reduce tasks, there is a framework
>>> overhead to transfer the job JAR
>>>
>> I would like to help find what cause such slowness. Version 0.7 did
>> not use MapReduce and fetching was done about 20 pages per second on
>> the same server. With same site fetching is reduced to 0.3 pages per
>> second.
>
> With queue based solution I just did a crawl of about 600k pages and
> it averaged 16 pps (1376 kb/s) with parsing enabled. Perhaps you could
> try Andrzejs new Fetcher and see how it performs for you (I haven't
> yet read the code ot tested it my self).
>
I'll try it, but first I need to test it on java 1.4.2. Maybe the
problem is with OS itself. I'll report bask as soon as I have more test.
regards
Uros
Re: (NUTCH-339) Refactor nutch to allow fetcher improvements
Posted by Sami Siren <ss...@gmail.com>.
Uroš Gruber wrote:
> Andrzej Bialecki wrote:
>
>> Sami Siren (JIRA) wrote:
>>
>>> I am not sure to what you refer to by this 3-4 sec but yes I agree
>>> threre are more aspects to optimize in fetcher, what I was firstly
>>> concerned was the fetching IO speed what was getting ridiculously
>>> low (not quite sure when this happened).
>>>
>>
>>
> I set DEBUG level loging and I've checked time during operations and
> when doint MapReduce job which is run after every page it takes 3-4
> seconds till next url is fethed.
> I have some local site and fetching 100 pages takes about 6 minutes.
Even I havent's seen it go that slow :)
>> Depending on the number of map/reduce tasks, there is a framework
>> overhead to transfer the job JAR file, and start the subprocess on
>> each tasktracker. However, once these are started the framework's
>> overhead should be negligible, because single task is responsible for
>> fetching many urls.
>>
>> Naturally, for small jobs, with very few urls, the overhead is
>> relatively large.
>>
>> The symptoms I'm seeing is that eventually most threads end up in
>> blockAddr spin-waiting. Another problem I see is that when the number
>> of fetching threads is high relative to the available bandwidth, the
>> data is trickling in so slowly that the Fetcher.run() decides that
>> it's hung, and aborts the task. What happens then is that the task
>> gets a SUCCEEDED status in tasktracker, although in reality it may
>> have fetched only a small portion of the allotted fetchlist.
>>
> I would like to help find what cause such slowness. Version 0.7 did
> not use MapReduce and fetching was done about 20 pages per second on
> the same server. With same site fetching is reduced to 0.3 pages per
> second.
With queue based solution I just did a crawl of about 600k pages and it
averaged 16 pps (1376 kb/s) with parsing enabled. Perhaps you could try
Andrzejs new Fetcher and see how it performs for you (I haven't yet read
the code ot tested it my self).
--
Sami Siren
Re: (NUTCH-339) Refactor nutch to allow fetcher improvements
Posted by Uroš Gruber <ur...@sir-mag.com>.
Andrzej Bialecki wrote:
> Sami Siren (JIRA) wrote:
>> I am not sure to what you refer to by this 3-4 sec but yes I agree
>> threre are more aspects to optimize in fetcher, what I was firstly
>> concerned was the fetching IO speed what was getting ridiculously low
>> (not quite sure when this happened).
>>
>
I set DEBUG level loging and I've checked time during operations and
when doint MapReduce job which is run after every page it takes 3-4
seconds till next url is fethed.
I have some local site and fetching 100 pages takes about 6 minutes.
> Depending on the number of map/reduce tasks, there is a framework
> overhead to transfer the job JAR file, and start the subprocess on
> each tasktracker. However, once these are started the framework's
> overhead should be negligible, because single task is responsible for
> fetching many urls.
>
> Naturally, for small jobs, with very few urls, the overhead is
> relatively large.
>
> The symptoms I'm seeing is that eventually most threads end up in
> blockAddr spin-waiting. Another problem I see is that when the number
> of fetching threads is high relative to the available bandwidth, the
> data is trickling in so slowly that the Fetcher.run() decides that
> it's hung, and aborts the task. What happens then is that the task
> gets a SUCCEEDED status in tasktracker, although in reality it may
> have fetched only a small portion of the allotted fetchlist.
>
I would like to help find what cause such slowness. Version 0.7 did not
use MapReduce and fetching was done about 20 pages per second on the
same server. With same site fetching is reduced to 0.3 pages per second.
here is log msg
2006-08-02 10:12:29,162 INFO mapred.LocalJobRunner - 37 pages, 0 errors, 0.3 pages/s, 52 kb/s,
2006-08-02 10:12:30,164 INFO mapred.LocalJobRunner - 37 pages, 0 errors, 0.3 pages/s, 52 kb/s,
2006-08-02 10:12:31,166 INFO mapred.LocalJobRunner - 37 pages, 0 errors, 0.3 pages/s, 51 kb/s,
2006-08-02 10:12:32,168 INFO mapred.LocalJobRunner - 37 pages, 0 errors, 0.3 pages/s, 51 kb/s,
2006-08-02 10:12:33,170 INFO mapred.LocalJobRunner - 37 pages, 0 errors, 0.3 pages/s, 50 kb/s,
>> We should open more than one ticket to track these separate aspects.
>> And for general discussion the mailing lista are perhaps the best place.
>>
> (I'm moving this to the list then).
>
>
regards
Uros
Re: (NUTCH-339) Refactor nutch to allow fetcher improvements
Posted by Andrzej Bialecki <ab...@getopt.org>.
Sami Siren (JIRA) wrote:
> I am not sure to what you refer to by this 3-4 sec but yes I agree threre are more aspects to optimize in fetcher, what I was firstly concerned was the fetching IO speed what was getting ridiculously low (not quite sure when this happened).
>
Depending on the number of map/reduce tasks, there is a framework
overhead to transfer the job JAR file, and start the subprocess on each
tasktracker. However, once these are started the framework's overhead
should be negligible, because single task is responsible for fetching
many urls.
Naturally, for small jobs, with very few urls, the overhead is
relatively large.
The symptoms I'm seeing is that eventually most threads end up in
blockAddr spin-waiting. Another problem I see is that when the number of
fetching threads is high relative to the available bandwidth, the data
is trickling in so slowly that the Fetcher.run() decides that it's hung,
and aborts the task. What happens then is that the task gets a SUCCEEDED
status in tasktracker, although in reality it may have fetched only a
small portion of the allotted fetchlist.
> We should open more than one ticket to track these separate aspects. And for general discussion the mailing lista are perhaps the best place.
>
(I'm moving this to the list then).
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
[jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher
improvements
Posted by "Sami Siren (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-339?page=comments#action_12425782 ]
Sami Siren commented on NUTCH-339:
----------------------------------
I am not sure to what you refer to by this 3-4 sec but yes I agree threre are more aspects to optimize in fetcher, what I was firstly concerned was the fetching IO speed what was getting ridiculously low (not quite sure when this happened).
We should open more than one ticket to track these separate aspects. And for general discussion the mailing lista are perhaps the best place.
> Refactor nutch to allow fetcher improvements
> --------------------------------------------
>
> Key: NUTCH-339
> URL: http://issues.apache.org/jira/browse/NUTCH-339
> Project: Nutch
> Issue Type: Task
> Components: fetcher
> Affects Versions: 0.9
> Environment: n/a
> Reporter: Sami Siren
> Assigned To: Sami Siren
> Attachments: patch.txt
>
>
> As I (and Stefan?) see it there are two major areas the current fetcher could be
> improved (as in speed)
> 1. Politeness code and how it is implemented is the biggest
> problem of current fetcher(together with robots.txt handling).
> With a simple code changes like replacing it with a PriorityQueue
> based solution showed very promising results in increased IO.
> 2. Changing fetcher to use non blocking io (this requires great amount
> of work as we need to implement the protocols from scratch again).
> I would like to start with working towards #1 by first refactoring
> the current code (plugins actually) in following way:
> 1. Move robots.txt handling away from (lib-http)plugin.
> Even if this is related only to http, leaving it to lib-http
> does not allow other kinds of scheduling strategies to be implemented
> (it is hardcoded to fetch robots.txt from the same thread when requesting
> a page from a site from witch it hasn't tried to load robots.txt)
> 2. Move code for politeness away from (lib-http)plugin
> It is really usable outside http and also the current design limits
> changing of the implementation (to queue based)
> Where to move these, well my suggestion is the nutch core, does anybody
> see problems with this?
> These code refactoring activities are to be done in a way that none
> of the current functionality is (at least deliberately) changed leaving
> current functionality as is thus leaving room and possibility to build
> the next generation fetcher(s) without destroying the old one at same time.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher
improvements
Posted by "Uros Gruber (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-339?page=comments#action_12425777 ]
Uros Gruber commented on NUTCH-339:
-----------------------------------
I check my logs and see that the main speed issue with 0.8 is actualy MapReduce work. I takes about 3-4 seconds for one page. Fetching is done 20 maybe 30 miliseconds.
I don't know it this is right place to talk about this.
> Refactor nutch to allow fetcher improvements
> --------------------------------------------
>
> Key: NUTCH-339
> URL: http://issues.apache.org/jira/browse/NUTCH-339
> Project: Nutch
> Issue Type: Task
> Components: fetcher
> Affects Versions: 0.9
> Environment: n/a
> Reporter: Sami Siren
> Assigned To: Sami Siren
> Attachments: patch.txt
>
>
> As I (and Stefan?) see it there are two major areas the current fetcher could be
> improved (as in speed)
> 1. Politeness code and how it is implemented is the biggest
> problem of current fetcher(together with robots.txt handling).
> With a simple code changes like replacing it with a PriorityQueue
> based solution showed very promising results in increased IO.
> 2. Changing fetcher to use non blocking io (this requires great amount
> of work as we need to implement the protocols from scratch again).
> I would like to start with working towards #1 by first refactoring
> the current code (plugins actually) in following way:
> 1. Move robots.txt handling away from (lib-http)plugin.
> Even if this is related only to http, leaving it to lib-http
> does not allow other kinds of scheduling strategies to be implemented
> (it is hardcoded to fetch robots.txt from the same thread when requesting
> a page from a site from witch it hasn't tried to load robots.txt)
> 2. Move code for politeness away from (lib-http)plugin
> It is really usable outside http and also the current design limits
> changing of the implementation (to queue based)
> Where to move these, well my suggestion is the nutch core, does anybody
> see problems with this?
> These code refactoring activities are to be done in a way that none
> of the current functionality is (at least deliberately) changed leaving
> current functionality as is thus leaving room and possibility to build
> the next generation fetcher(s) without destroying the old one at same time.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher
improvements
Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-339?page=comments#action_12433354 ]
Doğacan Güney commented on NUTCH-339:
-------------------------------------
I have made a few changes to Andrzej's latest patch. The biggest change is that BLOCKED_ADDR_QUEUE is now a priority queue and cleanExpiredServerBlocks should block threads a lot less. I am attaching this as patch3.txt.
> Refactor nutch to allow fetcher improvements
> --------------------------------------------
>
> Key: NUTCH-339
> URL: http://issues.apache.org/jira/browse/NUTCH-339
> Project: Nutch
> Issue Type: Task
> Components: fetcher
> Affects Versions: 0.8
> Environment: n/a
> Reporter: Sami Siren
> Assigned To: Sami Siren
> Fix For: 0.9.0
>
> Attachments: patch.txt, patch2.txt, patch3.txt
>
>
> As I (and Stefan?) see it there are two major areas the current fetcher could be
> improved (as in speed)
> 1. Politeness code and how it is implemented is the biggest
> problem of current fetcher(together with robots.txt handling).
> With a simple code changes like replacing it with a PriorityQueue
> based solution showed very promising results in increased IO.
> 2. Changing fetcher to use non blocking io (this requires great amount
> of work as we need to implement the protocols from scratch again).
> I would like to start with working towards #1 by first refactoring
> the current code (plugins actually) in following way:
> 1. Move robots.txt handling away from (lib-http)plugin.
> Even if this is related only to http, leaving it to lib-http
> does not allow other kinds of scheduling strategies to be implemented
> (it is hardcoded to fetch robots.txt from the same thread when requesting
> a page from a site from witch it hasn't tried to load robots.txt)
> 2. Move code for politeness away from (lib-http)plugin
> It is really usable outside http and also the current design limits
> changing of the implementation (to queue based)
> Where to move these, well my suggestion is the nutch core, does anybody
> see problems with this?
> These code refactoring activities are to be done in a way that none
> of the current functionality is (at least deliberately) changed leaving
> current functionality as is thus leaving room and possibility to build
> the next generation fetcher(s) without destroying the old one at same time.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-339) Refactor nutch to allow fetcher
improvements
Posted by "Sami Siren (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-339?page=all ]
Sami Siren updated NUTCH-339:
-----------------------------
Fix Version/s: 0.9.0
Affects Version/s: 0.8
(was: 0.9.0)
> Refactor nutch to allow fetcher improvements
> --------------------------------------------
>
> Key: NUTCH-339
> URL: http://issues.apache.org/jira/browse/NUTCH-339
> Project: Nutch
> Issue Type: Task
> Components: fetcher
> Affects Versions: 0.8
> Environment: n/a
> Reporter: Sami Siren
> Assigned To: Sami Siren
> Fix For: 0.9.0
>
> Attachments: patch.txt
>
>
> As I (and Stefan?) see it there are two major areas the current fetcher could be
> improved (as in speed)
> 1. Politeness code and how it is implemented is the biggest
> problem of current fetcher(together with robots.txt handling).
> With a simple code changes like replacing it with a PriorityQueue
> based solution showed very promising results in increased IO.
> 2. Changing fetcher to use non blocking io (this requires great amount
> of work as we need to implement the protocols from scratch again).
> I would like to start with working towards #1 by first refactoring
> the current code (plugins actually) in following way:
> 1. Move robots.txt handling away from (lib-http)plugin.
> Even if this is related only to http, leaving it to lib-http
> does not allow other kinds of scheduling strategies to be implemented
> (it is hardcoded to fetch robots.txt from the same thread when requesting
> a page from a site from witch it hasn't tried to load robots.txt)
> 2. Move code for politeness away from (lib-http)plugin
> It is really usable outside http and also the current design limits
> changing of the implementation (to queue based)
> Where to move these, well my suggestion is the nutch core, does anybody
> see problems with this?
> These code refactoring activities are to be done in a way that none
> of the current functionality is (at least deliberately) changed leaving
> current functionality as is thus leaving room and possibility to build
> the next generation fetcher(s) without destroying the old one at same time.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-339) Refactor nutch to allow fetcher
improvements
Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-339?page=all ]
Andrzej Bialecki updated NUTCH-339:
------------------------------------
Attachment: patch.txt
Work-in-progress patch containing new Fetcher2, and supporting changes in Protocol API.
> Refactor nutch to allow fetcher improvements
> --------------------------------------------
>
> Key: NUTCH-339
> URL: http://issues.apache.org/jira/browse/NUTCH-339
> Project: Nutch
> Issue Type: Task
> Components: fetcher
> Affects Versions: 0.9
> Environment: n/a
> Reporter: Sami Siren
> Assigned To: Sami Siren
> Attachments: patch.txt
>
>
> As I (and Stefan?) see it there are two major areas the current fetcher could be
> improved (as in speed)
> 1. Politeness code and how it is implemented is the biggest
> problem of current fetcher(together with robots.txt handling).
> With a simple code changes like replacing it with a PriorityQueue
> based solution showed very promising results in increased IO.
> 2. Changing fetcher to use non blocking io (this requires great amount
> of work as we need to implement the protocols from scratch again).
> I would like to start with working towards #1 by first refactoring
> the current code (plugins actually) in following way:
> 1. Move robots.txt handling away from (lib-http)plugin.
> Even if this is related only to http, leaving it to lib-http
> does not allow other kinds of scheduling strategies to be implemented
> (it is hardcoded to fetch robots.txt from the same thread when requesting
> a page from a site from witch it hasn't tried to load robots.txt)
> 2. Move code for politeness away from (lib-http)plugin
> It is really usable outside http and also the current design limits
> changing of the implementation (to queue based)
> Where to move these, well my suggestion is the nutch core, does anybody
> see problems with this?
> These code refactoring activities are to be done in a way that none
> of the current functionality is (at least deliberately) changed leaving
> current functionality as is thus leaving room and possibility to build
> the next generation fetcher(s) without destroying the old one at same time.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira