You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by al...@aim.com on 2009/02/28 09:50:37 UTC

urls with ? and & symbols

 Hello,

I use nutch-0.9 and try to index urls with ? and & symbols. I have commented this line? -[?*!@=] in conf/crawl-urlfilter.txt, conf/automaton-urlfilter and conf/regex-urlfilter.txt files.
However nutch still ignores these urls.

Does anyone know how this can be fixed?

Thanks in advance.
A.

Re: urls with ? and & symbols

Posted by al...@aim.com.

 


 I was trying to fetch one specific url with ? symbol and nutch was refusing to fetch it. But if I fetch domain itself, nutch fetched links with ? symbol also. Now, I noticed that nutch did not fetch all files on this given domain. But if I direct nutch to an unfetched? file's? url it fetches it.? I used this command "bin/nutch crawl urls -dir crawl -depth 6". If I specify -topN 50 nutch does not fetch my files at all.

So, my question is, how to make nutch to fetch all files under a given domain?


Thanks.
A.


 

-----Original Message-----
From: alxsss@aim.com
To: nutch-user@lucene.apache.org
Sent: Mon, 2 Mar 2009 3:36 pm
Subject: Re: urls with ? and & symbols











 Hello,

I have one specific domain. I tested further and it looks like nutch? fetches 
this domain's other links but the ones with ?. Also nutch fetches other domains 
with ? symbol.

 
How to know if robots.txt on this domain blocks this specific links to be 
fetched?

Thanks.
A.


 

-----Original Message-----
From: Bartosz Gadzimski <ba...@o2.pl>
To: nutch-user@lucene.apache.org
Sent: Sun, 1 Mar 2009 11:13 am
Subject: Re: urls with ? and & symbols









alxsss@aim.com pisze:?

>  Hello,?

>?

> I use nutch-0.9 and try to index urls with ? and & symbols. I have commented 
this line? -[?*!@=] in conf/crawl-urlfilter.txt, conf/automaton-urlfilter and 
conf/regex-urlfilter.txt files.?

> However nutch still ignores these urls.?

>?

> Does anyone know how this can be fixed??

>?

> Thanks in advance.?

> A.?

>?

>?

>  
>?

>?

>?

>?

>   
Hi,?
?

If you commented out those line it should be fine. That part is correct 
so problem is somewhere else.?
?

I must give us more information like:?

- does your nutch crawles and index "normal" URL's (without ? and &)?

- are you crawling domains that are NOT blocked in crawl-urlfilter?

- is robots.txt on this domain doesn't block your url's?

- are your talking about one specific domain or many different??
?

Thanks,?

Bartosz?

Re: urls with ? and & symbols

Posted by al...@aim.com.

 Hello,

I have one specific domain. I tested further and it looks like nutch? fetches this domain's other links but the ones with ?. Also nutch fetches other domains with ? symbol.

 
How to know if robots.txt on this domain blocks this specific links to be fetched?

Thanks.
A.


 

-----Original Message-----
From: Bartosz Gadzimski <ba...@o2.pl>
To: nutch-user@lucene.apache.org
Sent: Sun, 1 Mar 2009 11:13 am
Subject: Re: urls with ? and & symbols









alxsss@aim.com pisze:?

>  Hello,?

>?

> I use nutch-0.9 and try to index urls with ? and & symbols. I have commented this line? -[?*!@=] in conf/crawl-urlfilter.txt, conf/automaton-urlfilter and conf/regex-urlfilter.txt files.?

> However nutch still ignores these urls.?

>?

> Does anyone know how this can be fixed??

>?

> Thanks in advance.?

> A.?

>?

>?

>  
>?

>?

>?

>?

>   
Hi,?
?

If you commented out those line it should be fine. That part is correct 
so problem is somewhere else.?
?

I must give us more information like:?

- does your nutch crawles and index "normal" URL's (without ? and &)?

- are you crawling domains that are NOT blocked in crawl-urlfilter?

- is robots.txt on this domain doesn't block your url's?

- are your talking about one specific domain or many different??
?

Thanks,?

Bartosz?

Re: urls with ? and & symbols

Posted by Bartosz Gadzimski <ba...@o2.pl>.

alxsss@aim.com pisze:
>  Hello,
>
> I use nutch-0.9 and try to index urls with ? and & symbols. I have commented this line? -[?*!@=] in conf/crawl-urlfilter.txt, conf/automaton-urlfilter and conf/regex-urlfilter.txt files.
> However nutch still ignores these urls.
>
> Does anyone know how this can be fixed?
>
> Thanks in advance.
> A.
>
>
>  
>
>
>
>
>   
Hi,

If you commented out those line it should be fine. That part is correct 
so problem is somewhere else.

I must give us more information like:
- does your nutch crawles and index "normal" URL's (without ? and &)
- are you crawling domains that are NOT blocked in crawl-urlfilter
- is robots.txt on this domain doesn't block your url's
- are your talking about one specific domain or many different?

Thanks,
Bartosz