You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Michael Coffey <mc...@yahoo.com.INVALID> on 2018/03/09 19:39:19 UTC

dealing with redirects from http to https

I am having a problem crawling some sites that seem to be transitioning to https. All their links contain http urls and the fetcher gets response code 301 and content that says "the document has moved" because the actual content is accessible only via https. This has been happening for a few days with my news crawler.

What is the best way to handle this, in general? I am thinking of specifying http.redirect.max=1 (rather than the default 0) in nutch-site.xml because I want it to fetch these pages right away, rather than waiting until the next cycle.
I think I want the redirection target to get stored in the crawldb, but I don't know how to achieve that. In fact, I thought that would be the default behavior, and I am surprised to see it not doing that.

Are there any other settings I should change, and is there any drawback to using http.redirect.max for this purpose?


Re: dealing with redirects from http to https

Posted by Sebastian Nagel <wa...@googlemail.com>.
> Another problem is that they have fetch_time well into the future,
> I guess because retry_interval is applied.

Correct. Fetch time is
- time when to fetch next for a CrawlDatum in the CrawlDb
- time when fetch has happened for those in segments crawl_fetch folder

On 03/09/2018 11:04 PM, Michael Coffey wrote:
> Thanks for the suggestion. On closer inspection, I see that redirection targets do show up in the crawldb.
> One problem is that the target urls all have scores equal to zero, because no other pages point to them. Another problem is that they have fetch_time well into the future, I guess because retry_interval is applied.
> Interestingly, the target urls do sometimes show up in a segment. When I dump the segment after attempted fetching, they show responseCode 301 (even for the redirection targets), nutchStatus 67, and empty content. I imagine this might be just the result of the fetcher noticing the redirection and this is how it communicates to the updatedb.
> Here are some examples urls (the http and https examples are the same, except for the "s")
> https://m.sfgate.com/49ers/article/2018-49ers-calendar-outdated-players-gone-12287960.php    http://m.sfgate.com/49ers/article/2018-49ers-calendar-outdated-players-gone-12287960.phphttps://m.sfgate.com/bayarea/article/new-california-laws-going-into-effect-in-2018-12458046.php    http://m.sfgate.com/bayarea/article/new-california-laws-going-into-effect-in-2018-12458046.phphttps://m.sfgate.com/business/article/Tesla-s-enormous-battery-in-Australia-just-weeks-12455377.php    http://m.sfgate.com/business/article/Tesla-s-enormous-battery-in-Australia-just-weeks-12455377.php
> 
> In case anybody wants to replicate this, here are the key parts in my regex-urlfilter.
> # reject certain sfgate urls
> -blog\.sfgate\.com
> -findnsave\.sfgate\.com
> -homeguides\.sfgate\.com
> -healthyeating\.sfgate\.com
> -cars\.sfgate\.com
> -marketing\.sfgate\.com
> -insidescoopsf\.sfgate\.com
> -reviews\.sfgate\.com
> -stats\.sfgate\.com
> -video\.sfgate\.com
> 
> # accept other mobile sfgate urls
> +/m\.sfgate\.com
> 
> 
>      
>> What is the best way to handle this, in general? I am thinking of specifying http.redirect.max=1
> (rather than the default 0) in nutch-site.xml because I want it to fetch these pages right away,
> rather than waiting until the next cycle.
> 
> Of course, you can do this. But keep in mind: if both, the http and the https URLs are in the
> CrawlDb, this may lead to duplicates. Fetcher redirect targets are not checked in the Crawldb.
> 
>> I think I want the redirection target to get stored in the crawldb
> 
> That's done by the updatedb command, independent from the value of http.redirect.max
> Is there any URL filter which may cause that the redirect targets are filtered?
> 
> On 03/09/2018 08:39 PM, Michael Coffey wrote:
>> I am having a problem crawling some sites that seem to be transitioning to https. All their links contain http urls and the fetcher gets response code 301 and content that says "the document has moved" because the actual content is accessible only via https. This has been happening for a few days with my news crawler.
>>
>> What is the best way to handle this, in general? I am thinking of specifying http.redirect.max=1 (rather than the default 0) in nutch-site.xml because I want it to fetch these pages right away, rather than waiting until the next cycle.
>> I think I want the redirection target to get stored in the crawldb, but I don't know how to achieve that. In fact, I thought that would be the default behavior, and I am surprised to see it not doing that.
>>
>> Are there any other settings I should change, and is there any drawback to using http.redirect.max for this purpose?
>>
>>
> 
> 
> 
>    
> 


Re: dealing with redirects from http to https

Posted by Michael Coffey <mc...@yahoo.com.INVALID>.
Thanks for the suggestion. On closer inspection, I see that redirection targets do show up in the crawldb.
One problem is that the target urls all have scores equal to zero, because no other pages point to them. Another problem is that they have fetch_time well into the future, I guess because retry_interval is applied.
Interestingly, the target urls do sometimes show up in a segment. When I dump the segment after attempted fetching, they show responseCode 301 (even for the redirection targets), nutchStatus 67, and empty content. I imagine this might be just the result of the fetcher noticing the redirection and this is how it communicates to the updatedb.
Here are some examples urls (the http and https examples are the same, except for the "s")
https://m.sfgate.com/49ers/article/2018-49ers-calendar-outdated-players-gone-12287960.php    http://m.sfgate.com/49ers/article/2018-49ers-calendar-outdated-players-gone-12287960.phphttps://m.sfgate.com/bayarea/article/new-california-laws-going-into-effect-in-2018-12458046.php    http://m.sfgate.com/bayarea/article/new-california-laws-going-into-effect-in-2018-12458046.phphttps://m.sfgate.com/business/article/Tesla-s-enormous-battery-in-Australia-just-weeks-12455377.php    http://m.sfgate.com/business/article/Tesla-s-enormous-battery-in-Australia-just-weeks-12455377.php

In case anybody wants to replicate this, here are the key parts in my regex-urlfilter.
# reject certain sfgate urls
-blog\.sfgate\.com
-findnsave\.sfgate\.com
-homeguides\.sfgate\.com
-healthyeating\.sfgate\.com
-cars\.sfgate\.com
-marketing\.sfgate\.com
-insidescoopsf\.sfgate\.com
-reviews\.sfgate\.com
-stats\.sfgate\.com
-video\.sfgate\.com

# accept other mobile sfgate urls
+/m\.sfgate\.com


     
> What is the best way to handle this, in general? I am thinking of specifying http.redirect.max=1
(rather than the default 0) in nutch-site.xml because I want it to fetch these pages right away,
rather than waiting until the next cycle.

Of course, you can do this. But keep in mind: if both, the http and the https URLs are in the
CrawlDb, this may lead to duplicates. Fetcher redirect targets are not checked in the Crawldb.

> I think I want the redirection target to get stored in the crawldb

That's done by the updatedb command, independent from the value of http.redirect.max
Is there any URL filter which may cause that the redirect targets are filtered?

On 03/09/2018 08:39 PM, Michael Coffey wrote:
> I am having a problem crawling some sites that seem to be transitioning to https. All their links contain http urls and the fetcher gets response code 301 and content that says "the document has moved" because the actual content is accessible only via https. This has been happening for a few days with my news crawler.
> 
> What is the best way to handle this, in general? I am thinking of specifying http.redirect.max=1 (rather than the default 0) in nutch-site.xml because I want it to fetch these pages right away, rather than waiting until the next cycle.
> I think I want the redirection target to get stored in the crawldb, but I don't know how to achieve that. In fact, I thought that would be the default behavior, and I am surprised to see it not doing that.
> 
> Are there any other settings I should change, and is there any drawback to using http.redirect.max for this purpose?
> 
> 



   

Re: dealing with redirects from http to https

Posted by Sebastian Nagel <wa...@googlemail.com>.
> What is the best way to handle this, in general? I am thinking of specifying http.redirect.max=1
(rather than the default 0) in nutch-site.xml because I want it to fetch these pages right away,
rather than waiting until the next cycle.

Of course, you can do this. But keep in mind: if both, the http and the https URLs are in the
CrawlDb, this may lead to duplicates. Fetcher redirect targets are not checked in the Crawldb.

> I think I want the redirection target to get stored in the crawldb

That's done by the updatedb command, independent from the value of http.redirect.max
Is there any URL filter which may cause that the redirect targets are filtered?

On 03/09/2018 08:39 PM, Michael Coffey wrote:
> I am having a problem crawling some sites that seem to be transitioning to https. All their links contain http urls and the fetcher gets response code 301 and content that says "the document has moved" because the actual content is accessible only via https. This has been happening for a few days with my news crawler.
> 
> What is the best way to handle this, in general? I am thinking of specifying http.redirect.max=1 (rather than the default 0) in nutch-site.xml because I want it to fetch these pages right away, rather than waiting until the next cycle.
> I think I want the redirection target to get stored in the crawldb, but I don't know how to achieve that. In fact, I thought that would be the default behavior, and I am surprised to see it not doing that.
> 
> Are there any other settings I should change, and is there any drawback to using http.redirect.max for this purpose?
> 
>