You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by abhayd <aj...@hotmail.com> on 2011/08/17 15:01:41 UTC

nutch redirect treatment

hi 
I have seen similar posts in this forum but still not able to understand how
redirect is handled..

I m trying to crawl http://developer.att.com/developer/ . After successful
crawl i dump the crawldb using readdb. I see entries like following.  What
does this mean? Has nutch crawled the redirected page and is it in index?

 I tried using readseg command  with all the segments under crawl/segments
directory but i could not find 
http://developer.att.com/developer/tier1page.jsp?passedItemId=100006&_requestid=35037
url.

heres is my crawl/segments directory listing.
20110817001833  20110817002117  20110817003028  20110817003930 
20110817004202
20110817001844  20110817002556  20110817003532  20110817004105

Any help why redirected page is not crawled?

http://developer.att.com/developer/     Version: 7
Status: 4 (db_redir_temp)
Fetch time: Fri Sep 16 00:18:36 CDT 2011
Modified time: Wed Dec 31 18:00:00 CST 1969
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata: _pst_: temp_moved(13), lastModified=0:
http://developer.att.com/developer/tier1page.jsp?passedItemId=100006&_requestid=35037

http://developer.att.com/developer/100006       Version: 7
Status: 5 (db_redir_perm)
Fetch time: Fri Sep 16 00:43:33 CDT 2011
Modified time: Wed Dec 31 18:00:00 CST 1969
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 0.0
Signature: null
Metadata: _pst_: moved(12), lastModified=0:
http://developer.att.com/developer/forward.jsp?passedItemId=100006



--
View this message in context: http://lucene.472066.n3.nabble.com/nutch-redirect-treatment-tp3261546p3261546.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: nutch redirect treatment

Posted by Dinçer Kavraal <dk...@gmail.com>.
Once I had such an issue, I checked it via an HTTP sniffer.

So I suggest you to check HTTP headers of these transfers, from start to
end. And please share with us; I am out of tools at the moment.

In my case, this was the case
1. load Page A
2. Page A redirects to Page B
3. Page B sets a cookie and redirects back to Page A
4. unfetched, because URL is already on the list.

Check whether this is the same situation. If so, change your page control
settings, so that same URL does not mean already crawled. (in
nutch-default.xml values like db.signature.class, db.fetch.schedule.class)

Best,
Dincer


2011/8/18 <al...@aim.com>

> As far as I understood redirected urls are scored 0 and that is why fetcher
> does not pick them up in the earlier depths. They may be crawled starting
> depth 4  depending on the size of the seed list.
>
>
>
>
>
> -----Original Message-----
> From: abhayd <aj...@hotmail.com>
> To: nutch-user <nu...@lucene.apache.org>
> Sent: Wed, Aug 17, 2011 4:41 pm
> Subject: Re: nutch redirect treatment
>
>
> thanks for response.
>
> But my issue is after redirect new url is not being crawled. Not a scoring
> issue.
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/nutch-redirect-treatment-tp3261546p3263311.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>
>

Re: nutch redirect treatment

Posted by al...@aim.com.
As far as I understood redirected urls are scored 0 and that is why fetcher does not pick them up in the earlier depths. They may be crawled starting depth 4  depending on the size of the seed list.

 

 

-----Original Message-----
From: abhayd <aj...@hotmail.com>
To: nutch-user <nu...@lucene.apache.org>
Sent: Wed, Aug 17, 2011 4:41 pm
Subject: Re: nutch redirect treatment


thanks for response.

But my issue is after redirect new url is not being crawled. Not a scoring
issue.

--
View this message in context: http://lucene.472066.n3.nabble.com/nutch-redirect-treatment-tp3261546p3263311.html
Sent from the Nutch - User mailing list archive at Nabble.com.

 

Re: nutch redirect treatment

Posted by abhayd <aj...@hotmail.com>.
thanks for response.

But my issue is after redirect new url is not being crawled. Not a scoring
issue.

--
View this message in context: http://lucene.472066.n3.nabble.com/nutch-redirect-treatment-tp3261546p3263311.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: nutch redirect treatment

Posted by al...@aim.com.
https://issues.apache.org/jira/browse/NUTCH-1044
 

 


 

 

-----Original Message-----
From: abhayd <aj...@hotmail.com>
To: nutch-user <nu...@lucene.apache.org>
Sent: Wed, Aug 17, 2011 11:44 am
Subject: nutch redirect treatment


hi 
I have seen similar posts in this forum but still not able to understand how
redirect is handled..

I m trying to crawl http://developer.att.com/developer/ . After successful
crawl i dump the crawldb using readdb. I see entries like following.  What
does this mean? Has nutch crawled the redirected page and is it in index?

 I tried using readseg command  with all the segments under crawl/segments
directory but i could not find 
http://developer.att.com/developer/tier1page.jsp?passedItemId=100006&_requestid=35037
url.

heres is my crawl/segments directory listing.
20110817001833  20110817002117  20110817003028  20110817003930 
20110817004202
20110817001844  20110817002556  20110817003532  20110817004105

Any help why redirected page is not crawled?

http://developer.att.com/developer/     Version: 7
Status: 4 (db_redir_temp)
Fetch time: Fri Sep 16 00:18:36 CDT 2011
Modified time: Wed Dec 31 18:00:00 CST 1969
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata: _pst_: temp_moved(13), lastModified=0:
http://developer.att.com/developer/tier1page.jsp?passedItemId=100006&_requestid=35037

http://developer.att.com/developer/100006       Version: 7
Status: 5 (db_redir_perm)
Fetch time: Fri Sep 16 00:43:33 CDT 2011
Modified time: Wed Dec 31 18:00:00 CST 1969
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 0.0
Signature: null
Metadata: _pst_: moved(12), lastModified=0:
http://developer.att.com/developer/forward.jsp?passedItemId=100006



--
View this message in context: http://lucene.472066.n3.nabble.com/nutch-redirect-treatment-tp3261546p3261546.html
Sent from the Nutch - User mailing list archive at Nabble.com.

 

RE: nutch redirect treatment

Posted by abhayd <aj...@hotmail.com>.

that was it... i dint not modify that...

Thanks!!!
Date: Thu, 18 Aug 2011 08:06:44 -0700
From: ml-node+3265176-1746354017-210077@n3.nabble.com
To: ajdabholkar@hotmail.com
Subject: Re: nutch redirect treatment



	Did you modify the URL filtering rules to allow URLs with ? & etc...? By

default such URLs will be filtered out


On 17 August 2011 14:01, abhayd <[hidden email]> wrote:


> hi

> I have seen similar posts in this forum but still not able to understand

> how

> redirect is handled..

>

> I m trying to crawl http://developer.att.com/developer/ . After successful

> crawl i dump the crawldb using readdb. I see entries like following.  What

> does this mean? Has nutch crawled the redirected page and is it in index?

>

>  I tried using readseg command  with all the segments under crawl/segments

> directory but i could not find

>

> http://developer.att.com/developer/tier1page.jsp?passedItemId=100006&_requestid=35037
> url.

>

> heres is my crawl/segments directory listing.

> 20110817001833  20110817002117  20110817003028  20110817003930

> 20110817004202

> 20110817001844  20110817002556  20110817003532  20110817004105

>

> Any help why redirected page is not crawled?

>

> http://developer.att.com/developer/     Version: 7

> Status: 4 (db_redir_temp)

> Fetch time: Fri Sep 16 00:18:36 CDT 2011

> Modified time: Wed Dec 31 18:00:00 CST 1969

> Retries since fetch: 0

> Retry interval: 2592000 seconds (30 days)

> Score: 1.0

> Signature: null

> Metadata: _pst_: temp_moved(13), lastModified=0:

>

> http://developer.att.com/developer/tier1page.jsp?passedItemId=100006&_requestid=35037
>

> http://developer.att.com/developer/100006       Version: 7

> Status: 5 (db_redir_perm)

> Fetch time: Fri Sep 16 00:43:33 CDT 2011

> Modified time: Wed Dec 31 18:00:00 CST 1969

> Retries since fetch: 0

> Retry interval: 2592000 seconds (30 days)

> Score: 0.0

> Signature: null

> Metadata: _pst_: moved(12), lastModified=0:

> http://developer.att.com/developer/forward.jsp?passedItemId=100006
>

>

>

> --

> View this message in context:

> http://lucene.472066.n3.nabble.com/nutch-redirect-treatment-tp3261546p3261546.html
> Sent from the Nutch - User mailing list archive at Nabble.com.

>



-- 

*

*Open Source Solutions for Text Engineering


http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

	
	

	

	
	
		If you reply to this email, your message will be added to the discussion below:
		http://lucene.472066.n3.nabble.com/nutch-redirect-treatment-tp3261546p3265176.html
	
	
		
		To unsubscribe from nutch redirect treatment, click here.
	 		 	   		  

--
View this message in context: http://lucene.472066.n3.nabble.com/nutch-redirect-treatment-tp3261546p3265959.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: nutch redirect treatment

Posted by Julien Nioche <li...@gmail.com>.
Did you modify the URL filtering rules to allow URLs with ? & etc...? By
default such URLs will be filtered out

On 17 August 2011 14:01, abhayd <aj...@hotmail.com> wrote:

> hi
> I have seen similar posts in this forum but still not able to understand
> how
> redirect is handled..
>
> I m trying to crawl http://developer.att.com/developer/ . After successful
> crawl i dump the crawldb using readdb. I see entries like following.  What
> does this mean? Has nutch crawled the redirected page and is it in index?
>
>  I tried using readseg command  with all the segments under crawl/segments
> directory but i could not find
>
> http://developer.att.com/developer/tier1page.jsp?passedItemId=100006&_requestid=35037
> url.
>
> heres is my crawl/segments directory listing.
> 20110817001833  20110817002117  20110817003028  20110817003930
> 20110817004202
> 20110817001844  20110817002556  20110817003532  20110817004105
>
> Any help why redirected page is not crawled?
>
> http://developer.att.com/developer/     Version: 7
> Status: 4 (db_redir_temp)
> Fetch time: Fri Sep 16 00:18:36 CDT 2011
> Modified time: Wed Dec 31 18:00:00 CST 1969
> Retries since fetch: 0
> Retry interval: 2592000 seconds (30 days)
> Score: 1.0
> Signature: null
> Metadata: _pst_: temp_moved(13), lastModified=0:
>
> http://developer.att.com/developer/tier1page.jsp?passedItemId=100006&_requestid=35037
>
> http://developer.att.com/developer/100006       Version: 7
> Status: 5 (db_redir_perm)
> Fetch time: Fri Sep 16 00:43:33 CDT 2011
> Modified time: Wed Dec 31 18:00:00 CST 1969
> Retries since fetch: 0
> Retry interval: 2592000 seconds (30 days)
> Score: 0.0
> Signature: null
> Metadata: _pst_: moved(12), lastModified=0:
> http://developer.att.com/developer/forward.jsp?passedItemId=100006
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/nutch-redirect-treatment-tp3261546p3261546.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com