You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Peter Swoboda <pr...@gmx.de> on 2007/02/13 08:21:09 UTC

How does "ignore external links" work?

Hi,
we're using Nutch 0.8.
In deafault.xml "ignore external links" is set "true".
Can anybody tell me where we can find the code to this property?
We've got the problem, that now, there are many "intern" pages, that 
aren't indexed.
Doesn't seem to make sense, because they are on the same server, like 
other indexed pages.
When we set "ignore external links" "false" they are indexed.
What could be the problem?

Peter

Re: How does "ignore external links" work?

Posted by Peter Swoboda <pr...@gmx.de>.

Hi,
thanx for your answer.

It is definitely the same host.
I'll give you an example:

in crawl-urlfilter host is set to "uni-siegen.de

http://www.uni-siegen.de/dept/fb05/dekanat/
is indexed, but
http://www.uni-siegen.de/~merk/
isn't indexed.
Any idea?
What about the code. I'd like to see how it works.


Doğacan Güney schrieb:
> Hi,
>
> Peter Swoboda wrote:
>   
>> Hi,
>> we're using Nutch 0.8.
>> In deafault.xml "ignore external links" is set "true".
>> Can anybody tell me where we can find the code to this property?
>> We've got the problem, that now, there are many "intern" pages, that
>> aren't indexed.
>> Doesn't seem to make sense, because they are on the same server, like
>> other indexed pages.
>> When we set "ignore external links" "false" they are indexed.
>> What could be the problem?
>>
>>     
> Do you have different hosts in your server?
>
> ignore.external.links property, if set to true, ignores links whose
> _host_ is different from the source page.
>
> For example,
> Assume page www.bar.com/index.html contains a link to foo.bar.com/page.html.
> if ignore.external.links is true, host of the source page (www.bar.com)
> and host of the
> link (foo.bar.com) will be compared and since they are different this
> link will be ignored.
> Even though, they are probably on the same server.
>
> So only links within the exact same host (in this case, www.bar.com) are
> followed.
>
> --
> Doğacan Güney
>
>   
>> Peter
>>
>>
>>
>> .
>>
>>     
>
>

Re: How does "ignore external links" work?

Posted by Doğacan Güney <do...@agmlab.com>.

Hi,

Peter Swoboda wrote:
> Hi,
> we're using Nutch 0.8.
> In deafault.xml "ignore external links" is set "true".
> Can anybody tell me where we can find the code to this property?
> We've got the problem, that now, there are many "intern" pages, that
> aren't indexed.
> Doesn't seem to make sense, because they are on the same server, like
> other indexed pages.
> When we set "ignore external links" "false" they are indexed.
> What could be the problem?
>
Do you have different hosts in your server?

ignore.external.links property, if set to true, ignores links whose
_host_ is different from the source page.

For example,
Assume page www.bar.com/index.html contains a link to foo.bar.com/page.html.
if ignore.external.links is true, host of the source page (www.bar.com)
and host of the
link (foo.bar.com) will be compared and since they are different this
link will be ignored.
Even though, they are probably on the same server.

So only links within the exact same host (in this case, www.bar.com) are
followed.

--
Doğacan Güney

> Peter
>
>
>
> .
>