You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Markus Jelsma <ma...@openindex.io> on 2011/08/31 00:29:55 UTC

Parse reduce slow as a snail

Hi,

Any idea why the reducer of the parse job is as slow as a snail taking a 
detour? There is no processing in reducer; all it does it copy the keys and 
values.

The reduce step (meaning the last 33% of the reducer) is even slower than the 
whole parsing done in the mapper! It is even slower than the whole fetch job 
while it is the fetcher that produces the most output (high I/O).

A running cycle has a fetcher writing 70GiB and 10GiB to HDFS (total amount) 
while the reducer has 7 times less data to write and no processing! Yet it 
takes about 3 times longer to complete, stunning figures!

This excessive run time came apparant only when i significantly increased the 
number of url's to generate (topN). When the topN was lower the difference 
between run times of the fetch and parse jobs were a lot smaller, usually it 
was the fetcher being slow because of merging the spills.

Any thoughts? 

Thanks

Re: Parse reduce slow as a snail

Posted by Markus Jelsma <ma...@openindex.io>.
Yes, we hit another trap, an endless list of crap url's on many hosts. 

Be very careful when you see www.museum-zeeaquarium.netspirit.nl popping up in 
the logs a few too many times: never as host but always tailing the url. It 
has 'infected' url's of many different hosts.

http://<ANY_HOST>/<URI_SEGMENT>/<PAGE>.html;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Museum_Friesland;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Vluchtelingen;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Museum_Friesland;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/beheren.php;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Museum_Friesland;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Vluchtelingen;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Museum_Friesland;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/contact.php;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Museum_Friesland;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Vluchtelingen;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Museum_Friesland;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/beheren.php;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Museum_Friesland;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Vluchtelingen;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Museum_Friesland;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Vluchtelingen;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Museum_Friesland;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Vluchtelingen;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Museum_Friesland;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/beheren.php;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Museum_Friesland;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Vluchtelingen;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Museum_Friesland;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/contact.php;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Museum_Friesland;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Vluchtelingen;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Museum_Friesland;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/beheren.php;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Museum_Friesland;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Vluchtelingen;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Museum_Friesland;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/www.museum-
zeeaquarium.netspirit.nl


> Just like to add that we've too seen extremely slow tasks as a result of
> ridicously long urls. Adding an urlfilter that filters out urls longer
> than 2000 characters (or something like that) is pretty mandatory for
> any serious internet crawl.
> 
> On 08/31/2011 11:49 AM, Markus Jelsma wrote:
> > Thanks for shedding some light. I was already looking for
> > filters/normalizers in the step but couldn't find it. I forgot to think
> > about the job's output format. Makes sense indeed.
> > 
> > Cheers
> > 
> > On Wednesday 31 August 2011 11:26:46 Julien Nioche wrote:
> >> Hi Markus,
> >> 
> >> You are right in thinking that the reduce step does not do much in
> >> itself. It is not so much the reduce step which is likely to be the
> >> source of your problem but the URLFiltering / Normalizing within
> >> ParseoutputFormat. Basically we get outlinks as a result of the parse
> >> and when writing the output to HDFS we need to filter / normalise them.
> >> 
> >> I have seen problems on large crawls with ridiculously large URLs which
> >> put the normalisation in disarray with the symptoms you described. You
> >> can add a trace in the log before normalising to see what the URLs look
> >> like and add a custom normaliser which prevents large URLs to be
> >> processed.
> >> 
> >> As usual jstack is your friend and will confirm that this is where the
> >> problem is.
> >> 
> >> HTH
> >> 
> >> Julien
> >> 
> >> On 30 August 2011 23:39, Markus Jelsma<ma...@openindex.io>  
wrote:
> >>> I should add that i sometimes see an url filter exception written to
> >>> the reduce log. I don't understand why this is the case; all the
> >>> ParseSegment.reduce() code does is collecting key/value data.
> >>> 
> >>> I also should point out that most reducers finish in reasonable time
> >>> and it's
> >>> always one task stalling the job to excessive lengths. The cluster is
> >>> homogenous, this is not an assumption (i know the fallacies of
> >>> distibuted computing ;) ). A server stalling the process is identical
> >>> to all others and
> >>> replication factor is only 2 for all files except the crawl db.
> >>> 
> >>> Please enlighten me.
> >>> 
> >>>> Hi,
> >>>> 
> >>>> Any idea why the reducer of the parse job is as slow as a snail taking
> >>>> a detour? There is no processing in reducer; all it does it copy the
> >>>> keys
> >>> 
> >>> and
> >>> 
> >>>> values.
> >>>> 
> >>>> The reduce step (meaning the last 33% of the reducer) is even slower
> >>>> than the whole parsing done in the mapper! It is even slower than the
> >>>> whole fetch job while it is the fetcher that produces the most output
> >>>> (high I/O).
> >>>> 
> >>>> A running cycle has a fetcher writing 70GiB and 10GiB to HDFS (total
> >>>> amount) while the reducer has 7 times less data to write and no
> >>>> processing! Yet it takes about 3 times longer to complete, stunning
> >>>> figures!
> >>>> 
> >>>> This excessive run time came apparant only when i significantly
> >>>> increased the number of url's to generate (topN). When the topN was
> >>>> lower the difference between run times of the fetch and parse jobs
> >>>> were a lot smaller, usually it was the fetcher being slow because of
> >>>> merging the spills.
> >>>> 
> >>>> Any thoughts?
> >>>> 
> >>>> Thanks

Re: Parse reduce slow as a snail

Posted by Ferdy Galema <fe...@kalooga.com>.
Just like to add that we've too seen extremely slow tasks as a result of 
ridicously long urls. Adding an urlfilter that filters out urls longer 
than 2000 characters (or something like that) is pretty mandatory for 
any serious internet crawl.

On 08/31/2011 11:49 AM, Markus Jelsma wrote:
> Thanks for shedding some light. I was already looking for filters/normalizers
> in the step but couldn't find it. I forgot to think about the job's output
> format. Makes sense indeed.
>
> Cheers
>
> On Wednesday 31 August 2011 11:26:46 Julien Nioche wrote:
>> Hi Markus,
>>
>> You are right in thinking that the reduce step does not do much in itself.
>> It is not so much the reduce step which is likely to be the source of your
>> problem but the URLFiltering / Normalizing within ParseoutputFormat.
>> Basically we get outlinks as a result of the parse and when writing the
>> output to HDFS we need to filter / normalise them.
>>
>> I have seen problems on large crawls with ridiculously large URLs which put
>> the normalisation in disarray with the symptoms you described. You can add
>> a trace in the log before normalising to see what the URLs look like and
>> add a custom normaliser which prevents large URLs to be processed.
>>
>> As usual jstack is your friend and will confirm that this is where the
>> problem is.
>>
>> HTH
>>
>> Julien
>>
>> On 30 August 2011 23:39, Markus Jelsma<ma...@openindex.io>  wrote:
>>> I should add that i sometimes see an url filter exception written to the
>>> reduce log. I don't understand why this is the case; all the
>>> ParseSegment.reduce() code does is collecting key/value data.
>>>
>>> I also should point out that most reducers finish in reasonable time and
>>> it's
>>> always one task stalling the job to excessive lengths. The cluster is
>>> homogenous, this is not an assumption (i know the fallacies of distibuted
>>> computing ;) ). A server stalling the process is identical to all others
>>> and
>>> replication factor is only 2 for all files except the crawl db.
>>>
>>> Please enlighten me.
>>>
>>>> Hi,
>>>>
>>>> Any idea why the reducer of the parse job is as slow as a snail taking
>>>> a detour? There is no processing in reducer; all it does it copy the
>>>> keys
>>> and
>>>
>>>> values.
>>>>
>>>> The reduce step (meaning the last 33% of the reducer) is even slower
>>>> than the whole parsing done in the mapper! It is even slower than the
>>>> whole fetch job while it is the fetcher that produces the most output
>>>> (high I/O).
>>>>
>>>> A running cycle has a fetcher writing 70GiB and 10GiB to HDFS (total
>>>> amount) while the reducer has 7 times less data to write and no
>>>> processing! Yet it takes about 3 times longer to complete, stunning
>>>> figures!
>>>>
>>>> This excessive run time came apparant only when i significantly
>>>> increased the number of url's to generate (topN). When the topN was
>>>> lower the difference between run times of the fetch and parse jobs
>>>> were a lot smaller, usually it was the fetcher being slow because of
>>>> merging the spills.
>>>>
>>>> Any thoughts?
>>>>
>>>> Thanks

Re: Parse reduce slow as a snail

Posted by Markus Jelsma <ma...@openindex.io>.
Thanks for shedding some light. I was already looking for filters/normalizers 
in the step but couldn't find it. I forgot to think about the job's output 
format. Makes sense indeed.

Cheers

On Wednesday 31 August 2011 11:26:46 Julien Nioche wrote:
> Hi Markus,
> 
> You are right in thinking that the reduce step does not do much in itself.
> It is not so much the reduce step which is likely to be the source of your
> problem but the URLFiltering / Normalizing within ParseoutputFormat.
> Basically we get outlinks as a result of the parse and when writing the
> output to HDFS we need to filter / normalise them.
> 
> I have seen problems on large crawls with ridiculously large URLs which put
> the normalisation in disarray with the symptoms you described. You can add
> a trace in the log before normalising to see what the URLs look like and
> add a custom normaliser which prevents large URLs to be processed.
> 
> As usual jstack is your friend and will confirm that this is where the
> problem is.
> 
> HTH
> 
> Julien
> 
> On 30 August 2011 23:39, Markus Jelsma <ma...@openindex.io> wrote:
> > I should add that i sometimes see an url filter exception written to the
> > reduce log. I don't understand why this is the case; all the
> > ParseSegment.reduce() code does is collecting key/value data.
> > 
> > I also should point out that most reducers finish in reasonable time and
> > it's
> > always one task stalling the job to excessive lengths. The cluster is
> > homogenous, this is not an assumption (i know the fallacies of distibuted
> > computing ;) ). A server stalling the process is identical to all others
> > and
> > replication factor is only 2 for all files except the crawl db.
> > 
> > Please enlighten me.
> > 
> > > Hi,
> > > 
> > > Any idea why the reducer of the parse job is as slow as a snail taking
> > > a detour? There is no processing in reducer; all it does it copy the
> > > keys
> > 
> > and
> > 
> > > values.
> > > 
> > > The reduce step (meaning the last 33% of the reducer) is even slower
> > > than the whole parsing done in the mapper! It is even slower than the
> > > whole fetch job while it is the fetcher that produces the most output
> > > (high I/O).
> > > 
> > > A running cycle has a fetcher writing 70GiB and 10GiB to HDFS (total
> > > amount) while the reducer has 7 times less data to write and no
> > > processing! Yet it takes about 3 times longer to complete, stunning
> > > figures!
> > > 
> > > This excessive run time came apparant only when i significantly
> > > increased the number of url's to generate (topN). When the topN was
> > > lower the difference between run times of the fetch and parse jobs
> > > were a lot smaller, usually it was the fetcher being slow because of
> > > merging the spills.
> > > 
> > > Any thoughts?
> > > 
> > > Thanks

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Parse reduce slow as a snail

Posted by Julien Nioche <li...@gmail.com>.
Hi Markus,

You are right in thinking that the reduce step does not do much in itself.
It is not so much the reduce step which is likely to be the source of your
problem but the URLFiltering / Normalizing within ParseoutputFormat.
Basically we get outlinks as a result of the parse and when writing the
output to HDFS we need to filter / normalise them.

I have seen problems on large crawls with ridiculously large URLs which put
the normalisation in disarray with the symptoms you described. You can add a
trace in the log before normalising to see what the URLs look like and add a
custom normaliser which prevents large URLs to be processed.

As usual jstack is your friend and will confirm that this is where the
problem is.

HTH

Julien

On 30 August 2011 23:39, Markus Jelsma <ma...@openindex.io> wrote:

> I should add that i sometimes see an url filter exception written to the
> reduce log. I don't understand why this is the case; all the
> ParseSegment.reduce() code does is collecting key/value data.
>
> I also should point out that most reducers finish in reasonable time and
> it's
> always one task stalling the job to excessive lengths. The cluster is
> homogenous, this is not an assumption (i know the fallacies of distibuted
> computing ;) ). A server stalling the process is identical to all others
> and
> replication factor is only 2 for all files except the crawl db.
>
> Please enlighten me.
>
> > Hi,
> >
> > Any idea why the reducer of the parse job is as slow as a snail taking a
> > detour? There is no processing in reducer; all it does it copy the keys
> and
> > values.
> >
> > The reduce step (meaning the last 33% of the reducer) is even slower than
> > the whole parsing done in the mapper! It is even slower than the whole
> > fetch job while it is the fetcher that produces the most output (high
> > I/O).
> >
> > A running cycle has a fetcher writing 70GiB and 10GiB to HDFS (total
> > amount) while the reducer has 7 times less data to write and no
> > processing! Yet it takes about 3 times longer to complete, stunning
> > figures!
> >
> > This excessive run time came apparant only when i significantly increased
> > the number of url's to generate (topN). When the topN was lower the
> > difference between run times of the fetch and parse jobs were a lot
> > smaller, usually it was the fetcher being slow because of merging the
> > spills.
> >
> > Any thoughts?
> >
> > Thanks
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: Parse reduce slow as a snail

Posted by Markus Jelsma <ma...@openindex.io>.
I should add that i sometimes see an url filter exception written to the 
reduce log. I don't understand why this is the case; all the 
ParseSegment.reduce() code does is collecting key/value data.

I also should point out that most reducers finish in reasonable time and it's 
always one task stalling the job to excessive lengths. The cluster is 
homogenous, this is not an assumption (i know the fallacies of distibuted 
computing ;) ). A server stalling the process is identical to all others and 
replication factor is only 2 for all files except the crawl db.

Please enlighten me.

> Hi,
> 
> Any idea why the reducer of the parse job is as slow as a snail taking a
> detour? There is no processing in reducer; all it does it copy the keys and
> values.
> 
> The reduce step (meaning the last 33% of the reducer) is even slower than
> the whole parsing done in the mapper! It is even slower than the whole
> fetch job while it is the fetcher that produces the most output (high
> I/O).
> 
> A running cycle has a fetcher writing 70GiB and 10GiB to HDFS (total
> amount) while the reducer has 7 times less data to write and no
> processing! Yet it takes about 3 times longer to complete, stunning
> figures!
> 
> This excessive run time came apparant only when i significantly increased
> the number of url's to generate (topN). When the topN was lower the
> difference between run times of the fetch and parse jobs were a lot
> smaller, usually it was the fetcher being slow because of merging the
> spills.
> 
> Any thoughts?
> 
> Thanks