You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Jorge Luis Betancourt González <jl...@uci.cu> on 2015/06/03 04:13:56 UTC

Re: [MASSMAIL]Can Nutch crawling shortened url?

Since you're using Nutch 1.9 you should check [1] there is a bug with the http.redirect.max setting (fixed in 1.10), a workaround is basically to set http.redirect.max = 0 and follow the redirect in the next cycle. This is basically without taking into account that you're dealing with shortened URLs, but it should work as a normal redirect.

Regards,

[1] https://issues.apache.org/jira/browse/NUTCH-1939

----- Original Message -----
From: "Ankit Goel" <an...@gmail.com>
To: user@nutch.apache.org
Sent: Tuesday, June 2, 2015 9:59:40 PM
Subject: [MASSMAIL]Can Nutch crawling shortened url?

Hi,
I was playing around with nutch 1.9 when I came across some twitter t.co
links. When I ran it through parsechecker, I got failed fetch protocol
status : moved(12).  I have set my http.redirect.max count to 5
(experimented with 10) which works for other links, but didnt seem to
redirect me. I did get a forwarding link. For example,

bin/nutch parsechecker http://t.co/FcpZhY9FrL

Fetch failed with protocol status: moved(12), lastModified=0:
http://timesofindia.indiatimes.com/city/nashik/1st-seaplane-service-from-Ozar-to-Pune-begins-from-June-15/articleshow/47522116.cms?utm_source=twitter.com&utm_medium=referral&utm_campaign=timesofindia

running the forwarding link seperately works fine.  I've tried this with a
bitly link which had a double forward to goo.gl and the final site, but
each time I had to crawl the forwarding link in a seperate command.

My regex filter has the rule to allow t.co

+^http://t.co

+^http://t.co/[a-z0-9]*

Is there a way to crawl shortened urls seemlessly in nutch??

-- 
Regards,
Ankit Goel
http://about.me/ankitgoel

Re: [MASSMAIL]Can Nutch crawling shortened url?

Posted by Ankit Goel <an...@gmail.com>.
I tried it with 1.10, but the shortened urls still dont get followed
through. I think theres another issue here maybe with how shortened urls
work. Not to mention the crawl command for 1.10 is diff than previous
versions n absolutely undocumented. That became another task in itself. The
tutorial and wiki only talks about the previous bin/crawl command
structure. Thanks though.

On Wed, Jun 3, 2015 at 7:43 AM, Jorge Luis Betancourt González <
jlbetancourt@uci.cu> wrote:

> Since you're using Nutch 1.9 you should check [1] there is a bug with the
> http.redirect.max setting (fixed in 1.10), a workaround is basically to set
> http.redirect.max = 0 and follow the redirect in the next cycle. This is
> basically without taking into account that you're dealing with shortened
> URLs, but it should work as a normal redirect.
>
> Regards,
>
> [1] https://issues.apache.org/jira/browse/NUTCH-1939
>
> ----- Original Message -----
> From: "Ankit Goel" <an...@gmail.com>
> To: user@nutch.apache.org
> Sent: Tuesday, June 2, 2015 9:59:40 PM
> Subject: [MASSMAIL]Can Nutch crawling shortened url?
>
> Hi,
> I was playing around with nutch 1.9 when I came across some twitter t.co
> links. When I ran it through parsechecker, I got failed fetch protocol
> status : moved(12).  I have set my http.redirect.max count to 5
> (experimented with 10) which works for other links, but didnt seem to
> redirect me. I did get a forwarding link. For example,
>
> bin/nutch parsechecker http://t.co/FcpZhY9FrL
>
> Fetch failed with protocol status: moved(12), lastModified=0:
>
> http://timesofindia.indiatimes.com/city/nashik/1st-seaplane-service-from-Ozar-to-Pune-begins-from-June-15/articleshow/47522116.cms?utm_source=twitter.com&utm_medium=referral&utm_campaign=timesofindia
>
> running the forwarding link seperately works fine.  I've tried this with a
> bitly link which had a double forward to goo.gl and the final site, but
> each time I had to crawl the forwarding link in a seperate command.
>
> My regex filter has the rule to allow t.co
>
> +^http://t.co
>
> +^http://t.co/[a-z0-9]*
>
> Is there a way to crawl shortened urls seemlessly in nutch??
>
> --
> Regards,
> Ankit Goel
> http://about.me/ankitgoel
>



-- 
Regards,
Ankit Goel
http://about.me/ankitgoel