You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Ferdy Galema <fe...@kalooga.com> on 2011/09/01 14:53:29 UTC

Re: Parse reduce slow as a snail

Just like to add that we've too seen extremely slow tasks as a result of 
ridicously long urls. Adding an urlfilter that filters out urls longer 
than 2000 characters (or something like that) is pretty mandatory for 
any serious internet crawl.

On 08/31/2011 11:49 AM, Markus Jelsma wrote:
> Thanks for shedding some light. I was already looking for filters/normalizers
> in the step but couldn't find it. I forgot to think about the job's output
> format. Makes sense indeed.
>
> Cheers
>
> On Wednesday 31 August 2011 11:26:46 Julien Nioche wrote:
>> Hi Markus,
>>
>> You are right in thinking that the reduce step does not do much in itself.
>> It is not so much the reduce step which is likely to be the source of your
>> problem but the URLFiltering / Normalizing within ParseoutputFormat.
>> Basically we get outlinks as a result of the parse and when writing the
>> output to HDFS we need to filter / normalise them.
>>
>> I have seen problems on large crawls with ridiculously large URLs which put
>> the normalisation in disarray with the symptoms you described. You can add
>> a trace in the log before normalising to see what the URLs look like and
>> add a custom normaliser which prevents large URLs to be processed.
>>
>> As usual jstack is your friend and will confirm that this is where the
>> problem is.
>>
>> HTH
>>
>> Julien
>>
>> On 30 August 2011 23:39, Markus Jelsma<ma...@openindex.io>  wrote:
>>> I should add that i sometimes see an url filter exception written to the
>>> reduce log. I don't understand why this is the case; all the
>>> ParseSegment.reduce() code does is collecting key/value data.
>>>
>>> I also should point out that most reducers finish in reasonable time and
>>> it's
>>> always one task stalling the job to excessive lengths. The cluster is
>>> homogenous, this is not an assumption (i know the fallacies of distibuted
>>> computing ;) ). A server stalling the process is identical to all others
>>> and
>>> replication factor is only 2 for all files except the crawl db.
>>>
>>> Please enlighten me.
>>>
>>>> Hi,
>>>>
>>>> Any idea why the reducer of the parse job is as slow as a snail taking
>>>> a detour? There is no processing in reducer; all it does it copy the
>>>> keys
>>> and
>>>
>>>> values.
>>>>
>>>> The reduce step (meaning the last 33% of the reducer) is even slower
>>>> than the whole parsing done in the mapper! It is even slower than the
>>>> whole fetch job while it is the fetcher that produces the most output
>>>> (high I/O).
>>>>
>>>> A running cycle has a fetcher writing 70GiB and 10GiB to HDFS (total
>>>> amount) while the reducer has 7 times less data to write and no
>>>> processing! Yet it takes about 3 times longer to complete, stunning
>>>> figures!
>>>>
>>>> This excessive run time came apparant only when i significantly
>>>> increased the number of url's to generate (topN). When the topN was
>>>> lower the difference between run times of the fetch and parse jobs
>>>> were a lot smaller, usually it was the fetcher being slow because of
>>>> merging the spills.
>>>>
>>>> Any thoughts?
>>>>
>>>> Thanks

Re: Parse reduce slow as a snail

Posted by Markus Jelsma <ma...@openindex.io>.

Yes, we hit another trap, an endless list of crap url's on many hosts. 

Be very careful when you see www.museum-zeeaquarium.netspirit.nl popping up in 
the logs a few too many times: never as host but always tailing the url. It 
has 'infected' url's of many different hosts.

http://<ANY_HOST>/<URI_SEGMENT>/<PAGE>.html;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Museum_Friesland;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Vluchtelingen;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Museum_Friesland;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/beheren.php;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Museum_Friesland;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Vluchtelingen;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Museum_Friesland;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/contact.php;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Museum_Friesland;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Vluchtelingen;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Museum_Friesland;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/beheren.php;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Museum_Friesland;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Vluchtelingen;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Museum_Friesland;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Vluchtelingen;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Museum_Friesland;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Vluchtelingen;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Museum_Friesland;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/beheren.php;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Museum_Friesland;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Vluchtelingen;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Museum_Friesland;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/contact.php;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Museum_Friesland;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Vluchtelingen;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Museum_Friesland;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/beheren.php;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Museum_Friesland;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Vluchtelingen;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Museum_Friesland;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/www.museum-
zeeaquarium.netspirit.nl


> Just like to add that we've too seen extremely slow tasks as a result of
> ridicously long urls. Adding an urlfilter that filters out urls longer
> than 2000 characters (or something like that) is pretty mandatory for
> any serious internet crawl.
> 
> On 08/31/2011 11:49 AM, Markus Jelsma wrote:
> > Thanks for shedding some light. I was already looking for
> > filters/normalizers in the step but couldn't find it. I forgot to think
> > about the job's output format. Makes sense indeed.
> > 
> > Cheers
> > 
> > On Wednesday 31 August 2011 11:26:46 Julien Nioche wrote:
> >> Hi Markus,
> >> 
> >> You are right in thinking that the reduce step does not do much in
> >> itself. It is not so much the reduce step which is likely to be the
> >> source of your problem but the URLFiltering / Normalizing within
> >> ParseoutputFormat. Basically we get outlinks as a result of the parse
> >> and when writing the output to HDFS we need to filter / normalise them.
> >> 
> >> I have seen problems on large crawls with ridiculously large URLs which
> >> put the normalisation in disarray with the symptoms you described. You
> >> can add a trace in the log before normalising to see what the URLs look
> >> like and add a custom normaliser which prevents large URLs to be
> >> processed.
> >> 
> >> As usual jstack is your friend and will confirm that this is where the
> >> problem is.
> >> 
> >> HTH
> >> 
> >> Julien
> >> 
> >> On 30 August 2011 23:39, Markus Jelsma<ma...@openindex.io>  
wrote:
> >>> I should add that i sometimes see an url filter exception written to
> >>> the reduce log. I don't understand why this is the case; all the
> >>> ParseSegment.reduce() code does is collecting key/value data.
> >>> 
> >>> I also should point out that most reducers finish in reasonable time
> >>> and it's
> >>> always one task stalling the job to excessive lengths. The cluster is
> >>> homogenous, this is not an assumption (i know the fallacies of
> >>> distibuted computing ;) ). A server stalling the process is identical
> >>> to all others and
> >>> replication factor is only 2 for all files except the crawl db.
> >>> 
> >>> Please enlighten me.
> >>> 
> >>>> Hi,
> >>>> 
> >>>> Any idea why the reducer of the parse job is as slow as a snail taking
> >>>> a detour? There is no processing in reducer; all it does it copy the
> >>>> keys
> >>> 
> >>> and
> >>> 
> >>>> values.
> >>>> 
> >>>> The reduce step (meaning the last 33% of the reducer) is even slower
> >>>> than the whole parsing done in the mapper! It is even slower than the
> >>>> whole fetch job while it is the fetcher that produces the most output
> >>>> (high I/O).
> >>>> 
> >>>> A running cycle has a fetcher writing 70GiB and 10GiB to HDFS (total
> >>>> amount) while the reducer has 7 times less data to write and no
> >>>> processing! Yet it takes about 3 times longer to complete, stunning
> >>>> figures!
> >>>> 
> >>>> This excessive run time came apparant only when i significantly
> >>>> increased the number of url's to generate (topN). When the topN was
> >>>> lower the difference between run times of the fetch and parse jobs
> >>>> were a lot smaller, usually it was the fetcher being slow because of
> >>>> merging the spills.
> >>>> 
> >>>> Any thoughts?
> >>>> 
> >>>> Thanks