You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by al...@aim.com on 2009/02/28 09:50:37 UTC
urls with ? and & symbols
Hello,
I use nutch-0.9 and try to index urls with ? and & symbols. I have commented this line? -[?*!@=] in conf/crawl-urlfilter.txt, conf/automaton-urlfilter and conf/regex-urlfilter.txt files.
However nutch still ignores these urls.
Does anyone know how this can be fixed?
Thanks in advance.
A.
Re: urls with ? and & symbols
Posted by al...@aim.com.
I was trying to fetch one specific url with ? symbol and nutch was refusing to fetch it. But if I fetch domain itself, nutch fetched links with ? symbol also. Now, I noticed that nutch did not fetch all files on this given domain. But if I direct nutch to an unfetched? file's? url it fetches it.? I used this command "bin/nutch crawl urls -dir crawl -depth 6". If I specify -topN 50 nutch does not fetch my files at all.
So, my question is, how to make nutch to fetch all files under a given domain?
Thanks.
A.
-----Original Message-----
From: alxsss@aim.com
To: nutch-user@lucene.apache.org
Sent: Mon, 2 Mar 2009 3:36 pm
Subject: Re: urls with ? and & symbols
Hello,
I have one specific domain. I tested further and it looks like nutch? fetches
this domain's other links but the ones with ?. Also nutch fetches other domains
with ? symbol.
How to know if robots.txt on this domain blocks this specific links to be
fetched?
Thanks.
A.
-----Original Message-----
From: Bartosz Gadzimski <ba...@o2.pl>
To: nutch-user@lucene.apache.org
Sent: Sun, 1 Mar 2009 11:13 am
Subject: Re: urls with ? and & symbols
alxsss@aim.com pisze:?
> Hello,?
>?
> I use nutch-0.9 and try to index urls with ? and & symbols. I have commented
this line? -[?*!@=] in conf/crawl-urlfilter.txt, conf/automaton-urlfilter and
conf/regex-urlfilter.txt files.?
> However nutch still ignores these urls.?
>?
> Does anyone know how this can be fixed??
>?
> Thanks in advance.?
> A.?
>?
>?
>
>?
>?
>?
>?
>
Hi,?
?
If you commented out those line it should be fine. That part is correct
so problem is somewhere else.?
?
I must give us more information like:?
- does your nutch crawles and index "normal" URL's (without ? and &)?
- are you crawling domains that are NOT blocked in crawl-urlfilter?
- is robots.txt on this domain doesn't block your url's?
- are your talking about one specific domain or many different??
?
Thanks,?
Bartosz?
Re: urls with ? and & symbols
Posted by al...@aim.com.
Hello,
I have one specific domain. I tested further and it looks like nutch? fetches this domain's other links but the ones with ?. Also nutch fetches other domains with ? symbol.
How to know if robots.txt on this domain blocks this specific links to be fetched?
Thanks.
A.
-----Original Message-----
From: Bartosz Gadzimski <ba...@o2.pl>
To: nutch-user@lucene.apache.org
Sent: Sun, 1 Mar 2009 11:13 am
Subject: Re: urls with ? and & symbols
alxsss@aim.com pisze:?
> Hello,?
>?
> I use nutch-0.9 and try to index urls with ? and & symbols. I have commented this line? -[?*!@=] in conf/crawl-urlfilter.txt, conf/automaton-urlfilter and conf/regex-urlfilter.txt files.?
> However nutch still ignores these urls.?
>?
> Does anyone know how this can be fixed??
>?
> Thanks in advance.?
> A.?
>?
>?
>
>?
>?
>?
>?
>
Hi,?
?
If you commented out those line it should be fine. That part is correct
so problem is somewhere else.?
?
I must give us more information like:?
- does your nutch crawles and index "normal" URL's (without ? and &)?
- are you crawling domains that are NOT blocked in crawl-urlfilter?
- is robots.txt on this domain doesn't block your url's?
- are your talking about one specific domain or many different??
?
Thanks,?
Bartosz?
Re: urls with ? and & symbols
Posted by Bartosz Gadzimski <ba...@o2.pl>.
alxsss@aim.com pisze:
> Hello,
>
> I use nutch-0.9 and try to index urls with ? and & symbols. I have commented this line? -[?*!@=] in conf/crawl-urlfilter.txt, conf/automaton-urlfilter and conf/regex-urlfilter.txt files.
> However nutch still ignores these urls.
>
> Does anyone know how this can be fixed?
>
> Thanks in advance.
> A.
>
>
>
>
>
>
>
>
Hi,
If you commented out those line it should be fine. That part is correct
so problem is somewhere else.
I must give us more information like:
- does your nutch crawles and index "normal" URL's (without ? and &)
- are you crawling domains that are NOT blocked in crawl-urlfilter
- is robots.txt on this domain doesn't block your url's
- are your talking about one specific domain or many different?
Thanks,
Bartosz