You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Karen Church <ka...@ucd.ie> on 2005/10/25 23:14:54 UTC

Outlinks?

Hi,

I have a strange question regarding outlinks. I have crawled the same page on two consecutive days. On the first day the page has 10 outlinks but on the 2nd day no outlinks are generated/recorded.  However the content of the page hasn't changed. Can anyone suggest a reason for this??? Am I doing something wrong?

Thanks,
Karen

Re: Outlinks?

Posted by Karen Church <ka...@ucd.ie>.

----- Original Message ----- 
From: "Andrzej Bialecki" <ab...@getopt.org>
To: <nu...@lucene.apache.org>
Sent: Monday, November 07, 2005 11:40 AM
Subject: Re: Outlinks?

> Karen Church wrote:
>
>> Hi Andrzej,
>>
>> Thanks for the reply. Regarding the outlink limit - I thought it was a 
>> limit of 100 outlinks per page by default? And in these cases the first 
>> 100 outlinks are stored. I have a few pages like this in the crawl 
>> database. The problem I'm having is the outlink object is empty for a 
>> some pages when on previous days the outlink object wasn't empty and 
>> contained outlinks.
>
>
> Ok, it's clear now.
>
>>
>> At the moment I'm using the following code in my FOR loop while reading 
>> the segment to make sure that I ignore pages that couldn't be fetched and 
>> pages that could not be parsed....
>>
>> if(fetcherOutput.getStatus() != FetcherOutput.SUCCESS)
>> {
>>    continue;
>> }
>>
>> I've also checked the status of a couple of pages whose outlinks are 
>> missing and they all appear to have a SUCCESS status.
>
>
> My point was that there is another status (ParseData.status) which you 
> should check - the absence of outlinks indicates that there were problems 
> in parsing the page. Can you see things like page title, metadata etc. 
> under ParseData section in the segread output? Can you also see the page 
> content, to confirm that it was fetched properly?
>

I didn't realize there was a ParseData.status. At the moment I'm not 
checking the ParseData status but I've just checked and for the pages with 
missing outlinks I can see the content (parsed text) and the metadata of the 
page but the title's are blank when they previously were not. It definitely 
points to a parsing error, however, I'm using version 6 of nutch which 
doesn't support ParseData.status.

Also, this isn't a problem with the HTML parser provided with Nutch - this 
is a parser I wrote for WML pages so it could well be a problem with this. 
It's just strange that the title and outlinks are present on one day and 
gone the next, even though the content and metadata remains untouched. This 
obviously points to errors in my code - I'll have to look into this in more 
detail....

Thanks and regards,
Karen

> -- 
> Best regards,
> Andrzej Bialecki     <><
> ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>

Re: Outlinks?

Posted by Andrzej Bialecki <ab...@getopt.org>.

Karen Church wrote:

> Hi Andrzej,
>
> Thanks for the reply. Regarding the outlink limit - I thought it was a 
> limit of 100 outlinks per page by default? And in these cases the 
> first 100 outlinks are stored. I have a few pages like this in the 
> crawl database. The problem I'm having is the outlink object is empty 
> for a some pages when on previous days the outlink object wasn't empty 
> and contained outlinks.


Ok, it's clear now.

>
> At the moment I'm using the following code in my FOR loop while 
> reading the segment to make sure that I ignore pages that couldn't be 
> fetched and pages that could not be parsed....
>
> if(fetcherOutput.getStatus() != FetcherOutput.SUCCESS)
> {
>    continue;
> }
>
> I've also checked the status of a couple of pages whose outlinks are 
> missing and they all appear to have a SUCCESS status.


My point was that there is another status (ParseData.status) which you 
should check - the absence of outlinks indicates that there were 
problems in parsing the page. Can you see things like page title, 
metadata etc. under ParseData section in the segread output? Can you 
also see the page content, to confirm that it was fetched properly?

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Outlinks?

Posted by Karen Church <ka...@ucd.ie>.

Hi Andrzej,

Thanks for the reply. Regarding the outlink limit - I thought it was a limit 
of 100 outlinks per page by default? And in these cases the first 100 
outlinks are stored. I have a few pages like this in the crawl database. The 
problem I'm having is the outlink object is empty for a some pages when on 
previous days the outlink object wasn't empty and contained outlinks.

At the moment I'm using the following code in my FOR loop while reading the 
segment to make sure that I ignore pages that couldn't be fetched and pages 
that could not be parsed....

if(fetcherOutput.getStatus() != FetcherOutput.SUCCESS)
{
    continue;
}

I've also checked the status of a couple of pages whose outlinks are missing 
and they all appear to have a SUCCESS status.

Regards,
Karen

> Hello Karen,
>
> Outlinks should be stored in the segment, so that's the right place to 
> look for them.
>
> One common source of missing outlinks is if you hit a maximum number of 
> outlinks limit - but this is set to 100 by default. Another common issue 
> is if the content parser catches an exception, then you will get a 
> positive status for fetch, but an error in parsing, hence no outlinks. 
> Could you use the "segread" command on these two records, and check the 
> status both for the fetch and the parsing stages?
>
> -- 
> Best regards,
> Andrzej Bialecki     <><
> ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>

Re: Outlinks?

Posted by Andrzej Bialecki <ab...@getopt.org>.

Karen Church wrote:

> I investigated the crawler output in more detail and discovered that 
> for over 90% of the pages I crawl that have outlinks one day but don't 
> the next day (even though their content has not changed) - I can 
> account for somewhere else in the crawl that day, i.e. the outlinks 
> either appear as the outlinks of another page or as the url of a page 
> so it looks like they aren't fetched because that have already been 
> fetched that day.
>
> However, I'm still encountering some problems in understanding what 
> happened to the other 10%. I checked a few of the outlinks by hand and 
> some could not be crawled due to HTTP errors but can someone please 
> explain why the rest of the outlinks aren't stored? Are there some 
> standard things I can check for? Is this normal behavior? At the 
> moment I'm only looking in the resulting crawl segment for these 
> outlinks - should I be looking somewhere else?
>
> I'd really, really appreciate some help with this.


Hello Karen,

Outlinks should be stored in the segment, so that's the right place to 
look for them.

One common source of missing outlinks is if you hit a maximum number of 
outlinks limit - but this is set to 100 by default. Another common issue 
is if the content parser catches an exception, then you will get a 
positive status for fetch, but an error in parsing, hence no outlinks. 
Could you use the "segread" command on these two records, and check the 
status both for the fetch and the parsing stages?

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Outlinks?

Posted by Karen Church <ka...@ucd.ie>.

I investigated the crawler output in more detail and discovered that for 
over 90% of the pages I crawl that have outlinks one day but don't the next 
day (even though their content has not changed) - I can account for 
somewhere else in the crawl that day, i.e. the outlinks either appear as the 
outlinks of another page or as the url of a page so it looks like they 
aren't fetched because that have already been fetched that day.

However, I'm still encountering some problems in understanding what happened 
to the other 10%. I checked a few of the outlinks by hand and some could not 
be crawled due to HTTP errors but can someone please explain why the rest of 
the outlinks aren't stored? Are there some standard things I can check for? 
Is this normal behavior? At the moment I'm only looking in the resulting 
crawl segment for these outlinks - should I be looking somewhere else?

I'd really, really appreciate some help with this.
Thanks,
Karen

----- Original Message ----- 
From: "Karen Church" <ka...@ucd.ie>
To: <nu...@lucene.apache.org>
Sent: Tuesday, October 25, 2005 9:14 PM
Subject: Outlinks?


> Hi,
>
> I have a strange question regarding outlinks. I have crawled the same page 
> on two consecutive days. On the first day the page has 10 outlinks but on 
> the 2nd day no outlinks are generated/recorded.  However the content of 
> the page hasn't changed. Can anyone suggest a reason for this??? Am I 
> doing something wrong?
>
> Thanks,
> Karen
>