You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by devang pandey <de...@gmail.com> on 2013/07/12 10:48:09 UTC

nutch redirection behaviour issue

Hello,

I am using nutch 1.4 to crawl a url . After crawling the content of segment
is :
Status: 1 (db_unfetched)
Fetch time: Fri Jul 12 13:43:43 IST 2013
Modified time: Thu Jan 01 05:30:00 IST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata: _ngt_: 1373616835706

Content::
Version: -1
url: http://farmer.gov.in/COLD_STROAGE_Link.aspx
base: http://farmer.gov.in/COLD_STROAGE_Link.aspx
contentType: text/html
metadata: X-AspNet-Version=4.0.30319 Date=Fri, 12 Jul 2013 08:19:30 GMT
Content-Length=170 nutch.crawl.score=1.0
Location=/(S(ulnaaubb1l0bku22vik2tzjt))/COLD_STROAGE_Link.aspx _fst_=35
nutch.segment.name=20130712134358 Content-Type=text/html; charset=utf-8
Connection=close Server=Microsoft-IIS/7.5
X-Powered-By=ASP.NETCache-Control=private
Content:
<html><head><title>Object moved</title></head><body>
<h2>Object moved to <a
href="/(S(ulnaaubb1l0bku22vik2tzjt))/COLD_STROAGE_Link.aspx">here</a>.</h2>
</body></html>

CrawlDatum::
Version: 7
Status: 35 (fetch_redir_temp)
Fetch time: Fri Jul 12 13:44:03 IST 2013
Modified time: Thu Jan 01 05:30:00 IST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata: _ngt_: 1373616835706_pst_: temp_moved(13), lastModified=0:
http://farmer.gov.in/(S(ulnaaubb1l0bku22vik2tzjt))/COLD_STROAGE_Link.aspx

Thing is my seed url is redirected to a diiferent url . But problem is the
content of this redirected url is not fetched by nutch. I have changed
rediect.max to 5 . But still content is not fetched .Please help

Re: nutch redirection behaviour issue

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi,

if you are able to extract content via parsechecker
you should be able to crawl the content.

For all _3_ URLs in the redirect chain

1. check whether they pass URL filters and normalizers

2. check whether "http.redirect.max" is set appropriately

3. run crawl. Ideally, set the URL to be checked as seed URL
   and choose small values for depth and topN. That makes
   analysis simpler. If "http.redirect.max" >= 3 you can even
   set depth and topN to 1.

4. check you logs for all _3_ URLs. You should see "fetching ..."
   3 times (3 URLs)

5. then check crawl Db for all URLs
   % bin/nutch readdb .../crawldb -url URL

6. check content of segment(s) for all URLs

Sorry, there is no tool which does all the steps automatically.
You have to do it by hand.

Good luck,
Sebastian

On 07/15/2013 06:39 AM, devang pandey wrote:
> Hello Sebastian, Thankyou for your response . But thing is that my task is
> to crawl this url and using parsechecker command I am able to see content
> of page but not able to crawl it .Please help me with crawling aspect also.
> 
> 


Re: nutch redirection behaviour issue

Posted by devang pandey <de...@gmail.com>.
Hello Sebastian, Thankyou for your response . But thing is that my task is
to crawl this url and using parsechecker command I am able to see content
of page but not able to crawl it .Please help me with crawling aspect also.


Re: nutch redirection behaviour issue

Posted by Sebastian Nagel <wa...@googlemail.com>.
> Thing is my seed url is redirected to a diiferent url .
Yes, it is.

> But problem is the
> content of this redirected url is not fetched by nutch.I  have changed
> rediect.max to 5 . But still content is not fetched .
I assume you set "http.redirect.max" to 5.
If the property is spelled correctly, you will find the content
under the final target of a redirect chain (up to five hops).

You can do this also manually by:

% bin/nutch parsechecker 'http://farmer.gov.in/COLD_STROAGE_Link.aspx'
fetching: http://farmer.gov.in/COLD_STROAGE_Link.aspx
Fetch failed with protocol status: temp_moved(13), lastModified=0:
http://farmer.gov.in/(S(frbcuppdu1rmeu30yisifam5))/COLD_STROAGE_Link.aspx

% bin/nutch parsechecker 'http://farmer.gov.in/(S(ulnaaubb1l0bku22vik2tzjt))/COLD_STROAGE_Link.aspx'
fetching: http://farmer.gov.in/(S(ulnaaubb1l0bku22vik2tzjt))/COLD_STROAGE_Link.aspx
Fetch failed with protocol status: temp_moved(13), lastModified=0:
http://farmer.gov.in/(S(knqavhgccae51czzxdb100zr))/COLD_STROAGE_Link.aspx

% bin/nutch parsechecker 'http://farmer.gov.in/(S(knqavhgccae51czzxdb100zr))/COLD_STROAGE_Link.aspx'
fetching: http://farmer.gov.in/(S(knqavhgccae51czzxdb100zr))/COLD_STROAGE_Link.aspx
...
---------
ParseText
---------

State wise list of Cold-Storage ANDHRA PRADESH Warehouse Project Descriptio
...

Now you get content.

Finally, don't forget to check your URL filters.

Cheers,
Sebastian



On 07/12/2013 10:48 AM, devang pandey wrote:
> Hello,
> 
> I am using nutch 1.4 to crawl a url . After crawling the content of segment
> is :
> Status: 1 (db_unfetched)
> Fetch time: Fri Jul 12 13:43:43 IST 2013
> Modified time: Thu Jan 01 05:30:00 IST 1970
> Retries since fetch: 0
> Retry interval: 2592000 seconds (30 days)
> Score: 1.0
> Signature: null
> Metadata: _ngt_: 1373616835706
> 
> Content::
> Version: -1
> url: http://farmer.gov.in/COLD_STROAGE_Link.aspx
> base: http://farmer.gov.in/COLD_STROAGE_Link.aspx
> contentType: text/html
> metadata: X-AspNet-Version=4.0.30319 Date=Fri, 12 Jul 2013 08:19:30 GMT
> Content-Length=170 nutch.crawl.score=1.0
> Location=/(S(ulnaaubb1l0bku22vik2tzjt))/COLD_STROAGE_Link.aspx _fst_=35
> nutch.segment.name=20130712134358 Content-Type=text/html; charset=utf-8
> Connection=close Server=Microsoft-IIS/7.5
> X-Powered-By=ASP.NETCache-Control=private
> Content:
> <html><head><title>Object moved</title></head><body>
> <h2>Object moved to <a
> href="/(S(ulnaaubb1l0bku22vik2tzjt))/COLD_STROAGE_Link.aspx">here</a>.</h2>
> </body></html>
> 
> CrawlDatum::
> Version: 7
> Status: 35 (fetch_redir_temp)
> Fetch time: Fri Jul 12 13:44:03 IST 2013
> Modified time: Thu Jan 01 05:30:00 IST 1970
> Retries since fetch: 0
> Retry interval: 2592000 seconds (30 days)
> Score: 1.0
> Signature: null
> Metadata: _ngt_: 1373616835706_pst_: temp_moved(13), lastModified=0:
> http://farmer.gov.in/(S(ulnaaubb1l0bku22vik2tzjt))/COLD_STROAGE_Link.aspx
> 
> Thing is my seed url is redirected to a diiferent url . But problem is the
> content of this redirected url is not fetched by nutch. I have changed
> rediect.max to 5 . But still content is not fetched .Please help
>