You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Kevin MacDonald <ke...@hautesecure.com> on 2008/09/16 23:10:33 UTC

Possible Crawling bug

See the code snippet below from org.apache.nutch.crawl.Crawl. I think
parsing happens opposite to what the nutch-site.xml config file indicates.

public static void main(...) {
     ...

      if (!Fetcher.isParsing(job)) {
        parseSegment.parse(segment);    // parse it, if needed
      }

     ...
}


Kevin

Re: Possible Crawling bug

Posted by Kevin MacDonald <ke...@hautesecure.com>.
That is odd, because I am finding that upon completion of the last fetch
there is a lengthy period of computation that has to complete before a
fetch/parse is done. Fetching itself happens at a rate of about 1000 urls
per minute which seems fine, but then the additional process makes the
overall time rather slow. I cranked up logging, and saw a great deal of
output like shown below. That seems to be taking up all the time. I'm
wondering if there's something I can do to optimize nutch for a single
machine install.
mapred.Counters (Counters.java:<init>(135)) - Creating group
org.apache.hadoop.mapred.Task$FileSyst
mapred.Counters (Counters.java:getCounter(190)) - Adding Local bytes read at
0
mapred.Counters (Counters.java:getCounter(190)) - Adding Local bytes written
at 1
mapred.Counters (Counters.java:<init>(135)) - Creating group
org.apache.hadoop.mapred.Task$Counte
mapred.Counters (Counters.java:getCounter(190)) - Adding Map input records
at 0
mapred.Counters (Counters.java:getCounter(190)) - Adding Map output records
at 1
mapred.Counters (Counters.java:getCounter(190)) - Adding Map input bytes at
2
mapred.Counters (Counters.java:getCounter(190)) - Adding Map output bytes at
3
mapred.Counters (Counters.java:getCounter(190)) - Adding Combine input
records at 4
mapred.Counters (Counters.java:getCounter(190)) - Adding Combine output
records at 5
apred.LocalJobRunner (LocalJobRunner.java:statusUpdate(258)) - 2613 pages,
613 errors, 6.7 pages/s, 1
mapred.Counters (Counters.java:<init>(135)) - Creating group
org.apache.hadoop.mapred.Task$FileSyst

On Fri, Sep 19, 2008 at 2:27 AM, Andrzej Bialecki <ab...@getopt.org> wrote:

> Kevin MacDonald wrote:
>
>> Which is better for overall performance? To parse during fetching or
>> afterward?
>>
>
> It's slightly faster to parse during fetching ... BUT if a parser crashes
> or catches OOM exception, you are left without content and without parsed
> text, whereas if you fetch and then parse then at least you already have the
> content and can re-run the parse job. Usually the process of getting content
> from remote sites is the bottleneck.
>
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>

Re: Possible Crawling bug

Posted by Andrzej Bialecki <ab...@getopt.org>.
Kevin MacDonald wrote:
> Which is better for overall performance? To parse during fetching or
> afterward?

It's slightly faster to parse during fetching ... BUT if a parser 
crashes or catches OOM exception, you are left without content and 
without parsed text, whereas if you fetch and then parse then at least 
you already have the content and can re-run the parse job. Usually the 
process of getting content from remote sites is the bottleneck.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: Possible Crawling bug

Posted by Kevin MacDonald <ke...@hautesecure.com>.
Which is better for overall performance? To parse during fetching or
afterward?

On Thu, Sep 18, 2008 at 4:01 PM, Andrzej Bialecki <ab...@getopt.org> wrote:

> Kevin MacDonald wrote:
>
>> I'm sure it's just my ignorance of some basics of nutch. The way I read
>> that
>> code it said to me "if I'm not supposed to parse, go ahead and parse".
>>
>
> "If I'm not supposed to parse during fetching, go ahead and parse it after
> I'm done with fetching, because I only have unparsed content".
>
> You still need parsing in order to get the outlinks.
>
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>

Re: Possible Crawling bug

Posted by Andrzej Bialecki <ab...@getopt.org>.
Kevin MacDonald wrote:
> I'm sure it's just my ignorance of some basics of nutch. The way I read that
> code it said to me "if I'm not supposed to parse, go ahead and parse".

"If I'm not supposed to parse during fetching, go ahead and parse it 
after I'm done with fetching, because I only have unparsed content".

You still need parsing in order to get the outlinks.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: Possible Crawling bug

Posted by Kevin MacDonald <ke...@hautesecure.com>.
I'm sure it's just my ignorance of some basics of nutch. The way I read that
code it said to me "if I'm not supposed to parse, go ahead and parse".

On Thu, Sep 18, 2008 at 2:33 PM, Andrzej Bialecki <ab...@getopt.org> wrote:

> Kevin MacDonald wrote:
>
>> See the code snippet below from org.apache.nutch.crawl.Crawl. I think
>> parsing happens opposite to what the nutch-site.xml config file indicates.
>>
>> public static void main(...) {
>>     ...
>>
>>      if (!Fetcher.isParsing(job)) {
>>        parseSegment.parse(segment);    // parse it, if needed
>>      }
>>
>>     ...
>> }
>>
>
> What do you mean? This snippet simply shows that if you set the Fetcher to
> non-parsing mode we need to run the parsing as a separate explicit step. In
> any case you need to parse the content in order to collect links and update
> the db.
>
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>

Re: Possible Crawling bug

Posted by Andrzej Bialecki <ab...@getopt.org>.
Kevin MacDonald wrote:
> See the code snippet below from org.apache.nutch.crawl.Crawl. I think
> parsing happens opposite to what the nutch-site.xml config file indicates.
> 
> public static void main(...) {
>      ...
> 
>       if (!Fetcher.isParsing(job)) {
>         parseSegment.parse(segment);    // parse it, if needed
>       }
> 
>      ...
> }

What do you mean? This snippet simply shows that if you set the Fetcher 
to non-parsing mode we need to run the parsing as a separate explicit 
step. In any case you need to parse the content in order to collect 
links and update the db.



-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com