You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Karen Church <ka...@ucd.ie> on 2005/10/25 23:14:54 UTC
Outlinks?
Hi,
I have a strange question regarding outlinks. I have crawled the same page on two consecutive days. On the first day the page has 10 outlinks but on the 2nd day no outlinks are generated/recorded. However the content of the page hasn't changed. Can anyone suggest a reason for this??? Am I doing something wrong?
Thanks,
Karen
Re: Outlinks?
Posted by Karen Church <ka...@ucd.ie>.
----- Original Message -----
From: "Andrzej Bialecki" <ab...@getopt.org>
To: <nu...@lucene.apache.org>
Sent: Monday, November 07, 2005 11:40 AM
Subject: Re: Outlinks?
> Karen Church wrote:
>
>> Hi Andrzej,
>>
>> Thanks for the reply. Regarding the outlink limit - I thought it was a
>> limit of 100 outlinks per page by default? And in these cases the first
>> 100 outlinks are stored. I have a few pages like this in the crawl
>> database. The problem I'm having is the outlink object is empty for a
>> some pages when on previous days the outlink object wasn't empty and
>> contained outlinks.
>
>
> Ok, it's clear now.
>
>>
>> At the moment I'm using the following code in my FOR loop while reading
>> the segment to make sure that I ignore pages that couldn't be fetched and
>> pages that could not be parsed....
>>
>> if(fetcherOutput.getStatus() != FetcherOutput.SUCCESS)
>> {
>> continue;
>> }
>>
>> I've also checked the status of a couple of pages whose outlinks are
>> missing and they all appear to have a SUCCESS status.
>
>
> My point was that there is another status (ParseData.status) which you
> should check - the absence of outlinks indicates that there were problems
> in parsing the page. Can you see things like page title, metadata etc.
> under ParseData section in the segread output? Can you also see the page
> content, to confirm that it was fetched properly?
>
I didn't realize there was a ParseData.status. At the moment I'm not
checking the ParseData status but I've just checked and for the pages with
missing outlinks I can see the content (parsed text) and the metadata of the
page but the title's are blank when they previously were not. It definitely
points to a parsing error, however, I'm using version 6 of nutch which
doesn't support ParseData.status.
Also, this isn't a problem with the HTML parser provided with Nutch - this
is a parser I wrote for WML pages so it could well be a problem with this.
It's just strange that the title and outlinks are present on one day and
gone the next, even though the content and metadata remains untouched. This
obviously points to errors in my code - I'll have to look into this in more
detail....
Thanks and regards,
Karen
> --
> Best regards,
> Andrzej Bialecki <><
> ___. ___ ___ ___ _ _ __________________________________
> [__ || __|__/|__||\/| Information Retrieval, Semantic Web
> ___|||__|| \| || | Embedded Unix, System Integration
> http://www.sigram.com Contact: info at sigram dot com
>
>
>
Re: Outlinks?
Posted by Andrzej Bialecki <ab...@getopt.org>.
Karen Church wrote:
> Hi Andrzej,
>
> Thanks for the reply. Regarding the outlink limit - I thought it was a
> limit of 100 outlinks per page by default? And in these cases the
> first 100 outlinks are stored. I have a few pages like this in the
> crawl database. The problem I'm having is the outlink object is empty
> for a some pages when on previous days the outlink object wasn't empty
> and contained outlinks.
Ok, it's clear now.
>
> At the moment I'm using the following code in my FOR loop while
> reading the segment to make sure that I ignore pages that couldn't be
> fetched and pages that could not be parsed....
>
> if(fetcherOutput.getStatus() != FetcherOutput.SUCCESS)
> {
> continue;
> }
>
> I've also checked the status of a couple of pages whose outlinks are
> missing and they all appear to have a SUCCESS status.
My point was that there is another status (ParseData.status) which you
should check - the absence of outlinks indicates that there were
problems in parsing the page. Can you see things like page title,
metadata etc. under ParseData section in the segread output? Can you
also see the page content, to confirm that it was fetched properly?
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: Outlinks?
Posted by Karen Church <ka...@ucd.ie>.
Hi Andrzej,
Thanks for the reply. Regarding the outlink limit - I thought it was a limit
of 100 outlinks per page by default? And in these cases the first 100
outlinks are stored. I have a few pages like this in the crawl database. The
problem I'm having is the outlink object is empty for a some pages when on
previous days the outlink object wasn't empty and contained outlinks.
At the moment I'm using the following code in my FOR loop while reading the
segment to make sure that I ignore pages that couldn't be fetched and pages
that could not be parsed....
if(fetcherOutput.getStatus() != FetcherOutput.SUCCESS)
{
continue;
}
I've also checked the status of a couple of pages whose outlinks are missing
and they all appear to have a SUCCESS status.
Regards,
Karen
> Hello Karen,
>
> Outlinks should be stored in the segment, so that's the right place to
> look for them.
>
> One common source of missing outlinks is if you hit a maximum number of
> outlinks limit - but this is set to 100 by default. Another common issue
> is if the content parser catches an exception, then you will get a
> positive status for fetch, but an error in parsing, hence no outlinks.
> Could you use the "segread" command on these two records, and check the
> status both for the fetch and the parsing stages?
>
> --
> Best regards,
> Andrzej Bialecki <><
> ___. ___ ___ ___ _ _ __________________________________
> [__ || __|__/|__||\/| Information Retrieval, Semantic Web
> ___|||__|| \| || | Embedded Unix, System Integration
> http://www.sigram.com Contact: info at sigram dot com
>
>
>
Re: Outlinks?
Posted by Andrzej Bialecki <ab...@getopt.org>.
Karen Church wrote:
> I investigated the crawler output in more detail and discovered that
> for over 90% of the pages I crawl that have outlinks one day but don't
> the next day (even though their content has not changed) - I can
> account for somewhere else in the crawl that day, i.e. the outlinks
> either appear as the outlinks of another page or as the url of a page
> so it looks like they aren't fetched because that have already been
> fetched that day.
>
> However, I'm still encountering some problems in understanding what
> happened to the other 10%. I checked a few of the outlinks by hand and
> some could not be crawled due to HTTP errors but can someone please
> explain why the rest of the outlinks aren't stored? Are there some
> standard things I can check for? Is this normal behavior? At the
> moment I'm only looking in the resulting crawl segment for these
> outlinks - should I be looking somewhere else?
>
> I'd really, really appreciate some help with this.
Hello Karen,
Outlinks should be stored in the segment, so that's the right place to
look for them.
One common source of missing outlinks is if you hit a maximum number of
outlinks limit - but this is set to 100 by default. Another common issue
is if the content parser catches an exception, then you will get a
positive status for fetch, but an error in parsing, hence no outlinks.
Could you use the "segread" command on these two records, and check the
status both for the fetch and the parsing stages?
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: Outlinks?
Posted by Karen Church <ka...@ucd.ie>.
I investigated the crawler output in more detail and discovered that for
over 90% of the pages I crawl that have outlinks one day but don't the next
day (even though their content has not changed) - I can account for
somewhere else in the crawl that day, i.e. the outlinks either appear as the
outlinks of another page or as the url of a page so it looks like they
aren't fetched because that have already been fetched that day.
However, I'm still encountering some problems in understanding what happened
to the other 10%. I checked a few of the outlinks by hand and some could not
be crawled due to HTTP errors but can someone please explain why the rest of
the outlinks aren't stored? Are there some standard things I can check for?
Is this normal behavior? At the moment I'm only looking in the resulting
crawl segment for these outlinks - should I be looking somewhere else?
I'd really, really appreciate some help with this.
Thanks,
Karen
----- Original Message -----
From: "Karen Church" <ka...@ucd.ie>
To: <nu...@lucene.apache.org>
Sent: Tuesday, October 25, 2005 9:14 PM
Subject: Outlinks?
> Hi,
>
> I have a strange question regarding outlinks. I have crawled the same page
> on two consecutive days. On the first day the page has 10 outlinks but on
> the 2nd day no outlinks are generated/recorded. However the content of
> the page hasn't changed. Can anyone suggest a reason for this??? Am I
> doing something wrong?
>
> Thanks,
> Karen
>