You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Raphael A. Bauer" <ra...@charite.de> on 2007/08/06 18:02:47 UTC

Relative Links Problem

hi nutch-list,

i am currently doing a

"nutch crawl urls -dir crawl -depth 10"

- pretty much what is described in the tutorial. and in fact everything 
works.

the only problem is that relative links - say <a href="../XYZ">
are not crawled and cannot be searched, what is quite a problem for me.

is there an option i am missing out - or any suggestions how i can fix 
this issue?

thanks!

ra

Re: Relative Links Problem IS ALSO +escape(document.referrer)+

Posted by "Raphael A. Bauer" <ra...@charite.de>.
Doğacan Güney schrieb:
....
> Nutch should crawl relative links (actually I don't think that there
> is an option to disable it :). However, by default, nutch only stores
> first 100 hundred links  from that page. So it is possible that that
> particular link is, say, 105th link in that page. Try increasing max
> outlinks per page option (or make it -1).
this was exactly the problem! thanks for the solution doğacan (kai just 
gave the same answer thanks!). so easy...
ra


Re: Relative Links Problem IS ALSO +escape(document.referrer)+

Posted by Doğacan Güney <do...@gmail.com>.
On 8/9/07, Raphael A. Bauer <ra...@charite.de> wrote:
> Raphael A. Bauer wrote:
> > i am currently doing a
> >
> > "nutch crawl urls -dir crawl -depth 10"
> >
> > - pretty much what is described in the tutorial. and in fact everything
> > works.
> >
> > the only problem is that relative links - say <a href="../XYZ">
> > are not crawled and cannot be searched, what is quite a problem for me.
> >
> > is there an option i am missing out - or any suggestions how i can fix
> > this issue?
> hi,
>
> just to bring the question up again. i am still searching for a solution
> to my problem that the nutch crawl tools does not crawl relative links.
>
> it states:
> fetching http://url/+escape(document.referrer)+ and does not investigate
> into those html page any further.
>
> so - maybe my question is way too stupid (RTFM - arg.. i read it ;) ),
> or the solution is too simple to tell - in either case i really would
> appreciate any statement regarding my problem. is there a switch to
> enable this?  something i've missed?

Nutch should crawl relative links (actually I don't think that there
is an option to disable it :). However, by default, nutch only stores
first 100 hundred links  from that page. So it is possible that that
particular link is, say, 105th link in that page. Try increasing max
outlinks per page option (or make it -1).

>
> there is no problem reimplemeting the fetch code - but i don't want to
> write the code twice.
>
> thanks again!
>
> ra
>
>
>
>
>
>


-- 
Doğacan Güney

Re: Relative Links Problem IS ALSO +escape(document.referrer)+

Posted by "Raphael A. Bauer" <ra...@charite.de>.
Raphael A. Bauer wrote:
> i am currently doing a
> 
> "nutch crawl urls -dir crawl -depth 10"
> 
> - pretty much what is described in the tutorial. and in fact everything 
> works.
> 
> the only problem is that relative links - say <a href="../XYZ">
> are not crawled and cannot be searched, what is quite a problem for me.
> 
> is there an option i am missing out - or any suggestions how i can fix 
> this issue?
hi,

just to bring the question up again. i am still searching for a solution 
to my problem that the nutch crawl tools does not crawl relative links.

it states:
fetching http://url/+escape(document.referrer)+ and does not investigate 
into those html page any further.

so - maybe my question is way too stupid (RTFM - arg.. i read it ;) ), 
or the solution is too simple to tell - in either case i really would 
appreciate any statement regarding my problem. is there a switch to 
enable this?  something i've missed?

there is no problem reimplemeting the fetch code - but i don't want to 
write the code twice.

thanks again!

ra