You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Steve Kallestad <ka...@gmail.com> on 2007/02/08 11:50:18 UTC

Nutch Link Detection

I've discovered that nutch follows links that aren't necessarily links -

in my MediaWiki implementation, there is some out-of-the-box
javascript that contains:

var wgArticlePath = "/wiki/$1";

Nutch actually tries to go to /wiki/$1.  I've eliminated this
particular problem by adding -[$] to my url-crawlfilters.txt file, but
I can't imagine that this is the only time this kind of problem will
pop up.  I'm wondering if there isn't a way to ensure that all links
start with one of:
href="
href = "
href="
href ="

I'm a little shy about trying to implement such a filter without any
advice.  Does anyone have any thoughts on how to build such a filter
into nutch?

Right now, I'm just doing site-search which means this isn't that big
a problem.  But I'm concerned about implementing a wider ranging
search index without having a resolution to this problem - I'd hate
for my spider to be grabbing a bunch of unlinked 404's.

Also - does nutch follow rel="nofollow" links out of the box?

I imagine that it respects robots.txt, but I thought I'd ask about
that one too, just to be safe - I'm a newbie after all :)

Re: Nutch Link Detection

Posted by Renaud Richardet <re...@apache.org>.
Hi Steve,

We are all learning :-), and your question was not annoying at all. Keep 
asking!

Cheers,
Renaud


Steve Kallestad wrote:
> Thanks.  I'll do a bit more research on the subject.  Believe it or
> not, I've never heard of DOMContentUtils.getOutlinks, but I'm learning
> a bit more and more every day :).
>
> Step 1 - install it
> Step 2 - ask a bunch of annoying newbie questions on the mailing list
> Step 3 - RTFM
> Step 4 - fix my installation so that it works to my needs
> Step 5 - answer annoying newbie questions from other people and help
> spread the word.
>
> I'm still on step 2, but I'm getting there :)
>
> Thanks,
> Steve
> http://www.stevekallestad.com/
>
> On 2/8/07, Renaud Richardet <re...@apache.org> wrote:
>> Hi Steve,
>>
>> Steve Kallestad wrote:
>> > I've discovered that nutch follows links that aren't necessarily 
>> links -
>> >
>> > in my MediaWiki implementation, there is some out-of-the-box
>> > javascript that contains:
>> >
>> > var wgArticlePath = "/wiki/$1";
>> >
>> > Nutch actually tries to go to /wiki/$1.  I've eliminated this
>> > particular problem by adding -[$] to my url-crawlfilters.txt file,
>> You also might want to disable the parse-js parser if you don't need it
>> (in your nutch-site.xml)...
>> > but
>> > I can't imagine that this is the only time this kind of problem will
>> > pop up.  I'm wondering if there isn't a way to ensure that all links
>> > start with one of:
>> > href="
>> > href = "
>> > href="
>> > href ="
>> >
>> > I'm a little shy about trying to implement such a filter without any
>> > advice.  Does anyone have any thoughts on how to build such a filter
>> > into nutch?
>> This is taken care of by DOMContentUtils.getOutlinks: it relies on DOM
>> attributes, not Strings
>> >
>> > Right now, I'm just doing site-search which means this isn't that big
>> > a problem.  But I'm concerned about implementing a wider ranging
>> > search index without having a resolution to this problem - I'd hate
>> > for my spider to be grabbing a bunch of unlinked 404's.
>> >
>> > Also - does nutch follow rel="nofollow" links out of the box?
>> >
>> > I imagine that it respects robots.txt, but I thought I'd ask about
>> > that one too, just to be safe - I'm a newbie after all :)
>> DOMContentUtils.getOutlinks ignores rel="nofollow" links
>>
>> HTH,
>> Renaud


Re: Nutch Link Detection

Posted by Steve Kallestad <ka...@gmail.com>.
Thanks.  I'll do a bit more research on the subject.  Believe it or
not, I've never heard of DOMContentUtils.getOutlinks, but I'm learning
a bit more and more every day :).

Step 1 - install it
Step 2 - ask a bunch of annoying newbie questions on the mailing list
Step 3 - RTFM
Step 4 - fix my installation so that it works to my needs
Step 5 - answer annoying newbie questions from other people and help
spread the word.

I'm still on step 2, but I'm getting there :)

Thanks,
Steve
http://www.stevekallestad.com/

On 2/8/07, Renaud Richardet <re...@apache.org> wrote:
> Hi Steve,
>
> Steve Kallestad wrote:
> > I've discovered that nutch follows links that aren't necessarily links -
> >
> > in my MediaWiki implementation, there is some out-of-the-box
> > javascript that contains:
> >
> > var wgArticlePath = "/wiki/$1";
> >
> > Nutch actually tries to go to /wiki/$1.  I've eliminated this
> > particular problem by adding -[$] to my url-crawlfilters.txt file,
> You also might want to disable the parse-js parser if you don't need it
> (in your nutch-site.xml)...
> > but
> > I can't imagine that this is the only time this kind of problem will
> > pop up.  I'm wondering if there isn't a way to ensure that all links
> > start with one of:
> > href="
> > href = "
> > href="
> > href ="
> >
> > I'm a little shy about trying to implement such a filter without any
> > advice.  Does anyone have any thoughts on how to build such a filter
> > into nutch?
> This is taken care of by DOMContentUtils.getOutlinks: it relies on DOM
> attributes, not Strings
> >
> > Right now, I'm just doing site-search which means this isn't that big
> > a problem.  But I'm concerned about implementing a wider ranging
> > search index without having a resolution to this problem - I'd hate
> > for my spider to be grabbing a bunch of unlinked 404's.
> >
> > Also - does nutch follow rel="nofollow" links out of the box?
> >
> > I imagine that it respects robots.txt, but I thought I'd ask about
> > that one too, just to be safe - I'm a newbie after all :)
> DOMContentUtils.getOutlinks ignores rel="nofollow" links
>
> HTH,
> Renaud
>
> --
> Renaud Richardet                                      +1 617 230 9112
> my email is my first name at apache.org      http://www.oslutions.com
>
>

Re: Nutch Link Detection

Posted by Renaud Richardet <re...@apache.org>.
Hi Steve,

Steve Kallestad wrote:
> I've discovered that nutch follows links that aren't necessarily links -
>
> in my MediaWiki implementation, there is some out-of-the-box
> javascript that contains:
>
> var wgArticlePath = "/wiki/$1";
>
> Nutch actually tries to go to /wiki/$1.  I've eliminated this
> particular problem by adding -[$] to my url-crawlfilters.txt file, 
You also might want to disable the parse-js parser if you don't need it 
(in your nutch-site.xml)...
> but
> I can't imagine that this is the only time this kind of problem will
> pop up.  I'm wondering if there isn't a way to ensure that all links
> start with one of:
> href="
> href = "
> href="
> href ="
>
> I'm a little shy about trying to implement such a filter without any
> advice.  Does anyone have any thoughts on how to build such a filter
> into nutch?
This is taken care of by DOMContentUtils.getOutlinks: it relies on DOM 
attributes, not Strings
>
> Right now, I'm just doing site-search which means this isn't that big
> a problem.  But I'm concerned about implementing a wider ranging
> search index without having a resolution to this problem - I'd hate
> for my spider to be grabbing a bunch of unlinked 404's.
>
> Also - does nutch follow rel="nofollow" links out of the box?
>
> I imagine that it respects robots.txt, but I thought I'd ask about
> that one too, just to be safe - I'm a newbie after all :)
DOMContentUtils.getOutlinks ignores rel="nofollow" links

HTH,
Renaud

-- 
Renaud Richardet                                      +1 617 230 9112
my email is my first name at apache.org      http://www.oslutions.com