You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by xuyuanme <xu...@gmail.com> on 2012/02/23 05:08:27 UTC

Re: http.redirect.max

Thanks for the information. But I found the wiki page 
http://wiki.apache.org/nutch/RedirectHandling
http://wiki.apache.org/nutch/RedirectHandling  still doesn't have too much
content about Nutch redirects.

I found even if I set http.redirect.max=2 and
db.ignore.external.links=false, the crawler still can't get redirect pages.
And with further digging, I found the plugin lib-http (in Nutch 1.1)
contains following code:

Java file: org.apache.nutch.protocol.http.api.HttpBase

  public ProtocolOutput getProtocolOutput(Text url, CrawlDatum datum) {
......
        response = getResponse(u, datum, */false/*); // make a request
......
  }

  protected abstract Response getResponse(URL url,
                                          CrawlDatum datum,
                                          boolean followRedirects)
    throws ProtocolException, IOException;

After I changed the call to getResponse(u, datum, */true/*) and recompile
the plugin, the crawler fetches redirected pages as expected.

So is this a bug in lib-http library or I had some misunderstanding on how
redirect works?

Thanks!

lewis john mcgibbney wrote
> 
> Hi Rafael,
> 
> The page we are talking about will be added on the link below.
> 
> http://wiki.apache.org/nutch/InternalDocumentation
> 
> and will be available here
> 
> http://wiki.apache.org/nutch/RedirectHandling
> 
> 


--
View this message in context: http://lucene.472066.n3.nabble.com/http-redirect-max-tp3513652p3768657.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: http.redirect.max

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Alex,

Can you please have a look at NUTCH-1042?

Might it be the case that your redirect possibly has a crawl-delay which
then falls into the boundary case we witness in the issue above?

You may want to chabge your log properties to debug for a while and run
some small crawls on your problem URLs, maybe try adding in some LOG.debug
statements to see what kind of conditions are being satisfied around the
fetcher areas mentioned in NUTCH-1042.

hth

On Thu, Mar 1, 2012 at 8:09 PM, <al...@aim.com> wrote:

>
>  Hello,
>
> I tried 1, 2, -1 for the config http.redirect.max, but nutch still
> postpones redirected urls to later depths.
> What is the correct config  setting to have nutch crawl redirected urls
> immediately. I need it because I have restriction on depth be at most 2.
>
> Thanks.
> Alex.
>
>
>
>
>
> -----Original Message-----
> From: xuyuanme <xu...@gmail.com>
> To: user <us...@nutch.apache.org>
> Sent: Fri, Feb 24, 2012 1:31 am
> Subject: Re: http.redirect.max
>
>
> The config file is used for some proof of concept testing so the content
> might be confusing, please ignore some incorrect part.
>
> Yes from my end I can see the crawl for website http://www.scotland.gov.uk
> is redirected as expected.
>
> However the website I tried to crawl is a bit more tricky.
>
> Here's what I want to do:
>
> 1. Set
>
> http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.SearchResults_Browse&DrugInitial=B
> as the seed page
>
> 2. And try to crawl one of the link
> (
> http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.Overview&DrugName=BACIGUENT
> )
> as a test
>
> If you click the link, you'll find the website use redirect and cookie to
> control page navigation. So I used protocol-httpclient plugin instead of
> protocol-http to handle the cookie.
>
> However, the redirect does not happen as expected. The only way I can fetch
> second link is to manually change "response = getResponse(u, datum,
> *false*)" call to "response = getResponse(u, datum, *true*)" in
> org.apache.nutch.protocol.http.api.HttpBase.java file and recompile the
> lib-http plugin.
>
> So my issue is related to this specific site
>
> http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.SearchResults_Browse&DrugInitial=B
>
>
> lewis john mcgibbney wrote
> >
> > I've checked working with redirects and everything seems to work fine for
> > me.
> >
> > The site I checked on
> >
> > http://www.scotland.gov.uk
> >
> > temp redirect to
> >
> > http://home.scotland.gov.uk/home
> >
> > Nutch gets this fine when I do some tweaking with nutch-site.xml
> >
> > redirects property -1 (just to demonstrate, I would usually not set it
> so)
> >
> > Lewis
> >
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/http-redirect-max-tp3513652p3772115.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>
>


-- 
*Lewis*

Re: http.redirect.max

Posted by al...@aim.com.
 Hello,

I tried 1, 2, -1 for the config http.redirect.max, but nutch still postpones redirected urls to later depths.
What is the correct config  setting to have nutch crawl redirected urls immediately. I need it because I have restriction on depth be at most 2.

Thanks.
Alex.

 

 

-----Original Message-----
From: xuyuanme <xu...@gmail.com>
To: user <us...@nutch.apache.org>
Sent: Fri, Feb 24, 2012 1:31 am
Subject: Re: http.redirect.max


The config file is used for some proof of concept testing so the content
might be confusing, please ignore some incorrect part.

Yes from my end I can see the crawl for website http://www.scotland.gov.uk
is redirected as expected.

However the website I tried to crawl is a bit more tricky.

Here's what I want to do:

1. Set
http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.SearchResults_Browse&DrugInitial=B
as the seed page

2. And try to crawl one of the link
(http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.Overview&DrugName=BACIGUENT)
as a test

If you click the link, you'll find the website use redirect and cookie to
control page navigation. So I used protocol-httpclient plugin instead of
protocol-http to handle the cookie.

However, the redirect does not happen as expected. The only way I can fetch
second link is to manually change "response = getResponse(u, datum,
*false*)" call to "response = getResponse(u, datum, *true*)" in
org.apache.nutch.protocol.http.api.HttpBase.java file and recompile the
lib-http plugin.

So my issue is related to this specific site
http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.SearchResults_Browse&DrugInitial=B


lewis john mcgibbney wrote
> 
> I've checked working with redirects and everything seems to work fine for
> me.
> 
> The site I checked on
> 
> http://www.scotland.gov.uk
> 
> temp redirect to
> 
> http://home.scotland.gov.uk/home
> 
> Nutch gets this fine when I do some tweaking with nutch-site.xml
> 
> redirects property -1 (just to demonstrate, I would usually not set it so)
> 
> Lewis
> 

--
View this message in context: http://lucene.472066.n3.nabble.com/http-redirect-max-tp3513652p3772115.html
Sent from the Nutch - User mailing list archive at Nabble.com.

 

Re: http.redirect.max

Posted by xuyuanme <xu...@gmail.com>.
The config file is used for some proof of concept testing so the content
might be confusing, please ignore some incorrect part.

Yes from my end I can see the crawl for website http://www.scotland.gov.uk
is redirected as expected.

However the website I tried to crawl is a bit more tricky.

Here's what I want to do:

1. Set
http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.SearchResults_Browse&DrugInitial=B
as the seed page

2. And try to crawl one of the link
(http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.Overview&DrugName=BACIGUENT)
as a test

If you click the link, you'll find the website use redirect and cookie to
control page navigation. So I used protocol-httpclient plugin instead of
protocol-http to handle the cookie.

However, the redirect does not happen as expected. The only way I can fetch
second link is to manually change "response = getResponse(u, datum,
*false*)" call to "response = getResponse(u, datum, *true*)" in
org.apache.nutch.protocol.http.api.HttpBase.java file and recompile the
lib-http plugin.

So my issue is related to this specific site
http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.SearchResults_Browse&DrugInitial=B


lewis john mcgibbney wrote
> 
> I've checked working with redirects and everything seems to work fine for
> me.
> 
> The site I checked on
> 
> http://www.scotland.gov.uk
> 
> temp redirect to
> 
> http://home.scotland.gov.uk/home
> 
> Nutch gets this fine when I do some tweaking with nutch-site.xml
> 
> redirects property -1 (just to demonstrate, I would usually not set it so)
> 
> Lewis
> 

--
View this message in context: http://lucene.472066.n3.nabble.com/http-redirect-max-tp3513652p3772115.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: http.redirect.max

Posted by Lewis John Mcgibbney <le...@gmail.com>.
I've checked working with redirects and everything seems to work fine for
me.

The site I checked on

http://www.scotland.gov.uk

temp redirect to

http://home.scotland.gov.uk/home

Nutch gets this fine when I do some tweaking with nutch-site.xml

redirects property -1 (just to demonstrate, I would usually not set it so)

Lewis

On Thu, Feb 23, 2012 at 3:18 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> Additionally in your nutch-site.xml we don't maintain any query-(plugins),
> and there is no parse-text plugin either.
>
>
> On Thu, Feb 23, 2012 at 3:13 PM, Lewis John Mcgibbney <
> lewis.mcgibbney@gmail.com> wrote:
>
>> OK, for starters we don't use crawl-urlfilter.txt anymore, this is
>> deprecated as of Nutch 1.2 iirc.
>>
>> Secondly, what are you trying to achieve here? Your url filter includes
>> +^http://www
>> \.accessdata\.fda\.gov/scripts/cder/drugsatfda/index\.cfm\?fuseaction=Search\.SearchResults_Browse&DrugInitial=B$
>> +^http://www
>> \.accessdata\.fda\.gov/scripts/cder/drugsatfda/index\.cfm\?fuseaction=Search\.Overview&DrugName=BACIGUENT$
>>
>> Your seed urls are also not exactly what I would expect for a seed list.
>>
>> One last thing, your fetcher.threads.per.host is pretty aggressive, I
>> wouldn't personally set it this high unless it was my own server I was
>> communicating with.
>>
>> So what exactly is it that you are having problems with?
>>
>> Lewis
>>
>>
>>
>>
>> On Thu, Feb 23, 2012 at 12:11 PM, xuyuanme <xu...@gmail.com> wrote:
>>
>>> Thanks! The config file can be get here:
>>> http://dl.dropbox.com/u/6614015/temp/config.zip
>>> http://dl.dropbox.com/u/6614015/temp/config.zip
>>>
>>>
>>> lewis john mcgibbney wrote
>>> >
>>> > Hi,
>>> >
>>> > Can you post your nutch-site.xml and I will give it a spin.
>>> >
>>> > Thank you
>>> >
>>> > Lewis
>>> >
>>> > On Thu, Feb 23, 2012 at 5:07 AM, xuyuanme &lt;xuyuanme@&gt; wrote:
>>> >
>>> >> Just checked the latest code in 1.4 but it's the same. See code line
>>> 138
>>> >> in
>>> >> below link:
>>> >>
>>> >>
>>> >>
>>> http://svn.apache.org/viewvc/nutch/branches/branch-1.4/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java?view=markup
>>> >>
>>> >>
>>> http://svn.apache.org/viewvc/nutch/branches/branch-1.4/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java?view=markup
>>> >>
>>> >> The method just call getResponse() and set followRedirects parameter
>>> to
>>> >> *false*.
>>> >>
>>> >> So I guess the http.redirect.max setting is not working on it?
>>> >>
>>> >>
>>> >
>>>
>>> --
>>> View this message in context:
>>> http://lucene.472066.n3.nabble.com/http-redirect-max-tp3513652p3769491.html
>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>>
>>
>>
>>
>> --
>> *Lewis*
>>
>>
>
>
> --
> *Lewis*
>
>


-- 
*Lewis*

Re: http.redirect.max

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Additionally in your nutch-site.xml we don't maintain any query-(plugins),
and there is no parse-text plugin either.

On Thu, Feb 23, 2012 at 3:13 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> OK, for starters we don't use crawl-urlfilter.txt anymore, this is
> deprecated as of Nutch 1.2 iirc.
>
> Secondly, what are you trying to achieve here? Your url filter includes
> +^http://www
> \.accessdata\.fda\.gov/scripts/cder/drugsatfda/index\.cfm\?fuseaction=Search\.SearchResults_Browse&DrugInitial=B$
> +^http://www
> \.accessdata\.fda\.gov/scripts/cder/drugsatfda/index\.cfm\?fuseaction=Search\.Overview&DrugName=BACIGUENT$
>
> Your seed urls are also not exactly what I would expect for a seed list.
>
> One last thing, your fetcher.threads.per.host is pretty aggressive, I
> wouldn't personally set it this high unless it was my own server I was
> communicating with.
>
> So what exactly is it that you are having problems with?
>
> Lewis
>
>
>
>
> On Thu, Feb 23, 2012 at 12:11 PM, xuyuanme <xu...@gmail.com> wrote:
>
>> Thanks! The config file can be get here:
>> http://dl.dropbox.com/u/6614015/temp/config.zip
>> http://dl.dropbox.com/u/6614015/temp/config.zip
>>
>>
>> lewis john mcgibbney wrote
>> >
>> > Hi,
>> >
>> > Can you post your nutch-site.xml and I will give it a spin.
>> >
>> > Thank you
>> >
>> > Lewis
>> >
>> > On Thu, Feb 23, 2012 at 5:07 AM, xuyuanme &lt;xuyuanme@&gt; wrote:
>> >
>> >> Just checked the latest code in 1.4 but it's the same. See code line
>> 138
>> >> in
>> >> below link:
>> >>
>> >>
>> >>
>> http://svn.apache.org/viewvc/nutch/branches/branch-1.4/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java?view=markup
>> >>
>> >>
>> http://svn.apache.org/viewvc/nutch/branches/branch-1.4/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java?view=markup
>> >>
>> >> The method just call getResponse() and set followRedirects parameter to
>> >> *false*.
>> >>
>> >> So I guess the http.redirect.max setting is not working on it?
>> >>
>> >>
>> >
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/http-redirect-max-tp3513652p3769491.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>
>
>
> --
> *Lewis*
>
>


-- 
*Lewis*

Re: http.redirect.max

Posted by Lewis John Mcgibbney <le...@gmail.com>.
OK, for starters we don't use crawl-urlfilter.txt anymore, this is
deprecated as of Nutch 1.2 iirc.

Secondly, what are you trying to achieve here? Your url filter includes
+^http://www
\.accessdata\.fda\.gov/scripts/cder/drugsatfda/index\.cfm\?fuseaction=Search\.SearchResults_Browse&DrugInitial=B$
+^http://www
\.accessdata\.fda\.gov/scripts/cder/drugsatfda/index\.cfm\?fuseaction=Search\.Overview&DrugName=BACIGUENT$

Your seed urls are also not exactly what I would expect for a seed list.

One last thing, your fetcher.threads.per.host is pretty aggressive, I
wouldn't personally set it this high unless it was my own server I was
communicating with.

So what exactly is it that you are having problems with?

Lewis



On Thu, Feb 23, 2012 at 12:11 PM, xuyuanme <xu...@gmail.com> wrote:

> Thanks! The config file can be get here:
> http://dl.dropbox.com/u/6614015/temp/config.zip
> http://dl.dropbox.com/u/6614015/temp/config.zip
>
>
> lewis john mcgibbney wrote
> >
> > Hi,
> >
> > Can you post your nutch-site.xml and I will give it a spin.
> >
> > Thank you
> >
> > Lewis
> >
> > On Thu, Feb 23, 2012 at 5:07 AM, xuyuanme &lt;xuyuanme@&gt; wrote:
> >
> >> Just checked the latest code in 1.4 but it's the same. See code line 138
> >> in
> >> below link:
> >>
> >>
> >>
> http://svn.apache.org/viewvc/nutch/branches/branch-1.4/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java?view=markup
> >>
> >>
> http://svn.apache.org/viewvc/nutch/branches/branch-1.4/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java?view=markup
> >>
> >> The method just call getResponse() and set followRedirects parameter to
> >> *false*.
> >>
> >> So I guess the http.redirect.max setting is not working on it?
> >>
> >>
> >
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/http-redirect-max-tp3513652p3769491.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
*Lewis*

Re: http.redirect.max

Posted by xuyuanme <xu...@gmail.com>.
Thanks! The config file can be get here: 
http://dl.dropbox.com/u/6614015/temp/config.zip
http://dl.dropbox.com/u/6614015/temp/config.zip 
 

lewis john mcgibbney wrote
> 
> Hi,
> 
> Can you post your nutch-site.xml and I will give it a spin.
> 
> Thank you
> 
> Lewis
> 
> On Thu, Feb 23, 2012 at 5:07 AM, xuyuanme &lt;xuyuanme@&gt; wrote:
> 
>> Just checked the latest code in 1.4 but it's the same. See code line 138
>> in
>> below link:
>>
>>
>> http://svn.apache.org/viewvc/nutch/branches/branch-1.4/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java?view=markup
>>
>> http://svn.apache.org/viewvc/nutch/branches/branch-1.4/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java?view=markup
>>
>> The method just call getResponse() and set followRedirects parameter to
>> *false*.
>>
>> So I guess the http.redirect.max setting is not working on it?
>>
>>
> 

--
View this message in context: http://lucene.472066.n3.nabble.com/http-redirect-max-tp3513652p3769491.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: http.redirect.max

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi,

Can you post your nutch-site.xml and I will give it a spin.

Thank you

Lewis

On Thu, Feb 23, 2012 at 5:07 AM, xuyuanme <xu...@gmail.com> wrote:

> Just checked the latest code in 1.4 but it's the same. See code line 138 in
> below link:
>
>
> http://svn.apache.org/viewvc/nutch/branches/branch-1.4/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java?view=markup
>
> http://svn.apache.org/viewvc/nutch/branches/branch-1.4/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java?view=markup
>
> The method just call getResponse() and set followRedirects parameter to
> *false*.
>
> So I guess the http.redirect.max setting is not working on it?
>
>
> remi tassing wrote
> >
> > Would you give Nucth-1.4 a try? Maybe this bug is already solved?
> >
> > Remi
> >
> > On Thursday, February 23, 2012, xuyuanme &lt;xuyuanme@&gt; wrote:
> >> Thanks for the information. But I found the wiki page
> >> http://wiki.apache.org/nutch/RedirectHandling
> >> http://wiki.apache.org/nutch/RedirectHandling  still doesn't have too
> >> much
> >> content about Nutch redirects.
> >>
> >> I found even if I set http.redirect.max=2 and
> >> db.ignore.external.links=false, the crawler still can't get redirect
> > pages.
> >> And with further digging, I found the plugin lib-http (in Nutch 1.1)
> >> contains following code:
> >>
> >> Java file: org.apache.nutch.protocol.http.api.HttpBase
> >>
> >>  public ProtocolOutput getProtocolOutput(Text url, CrawlDatum datum) {
> >> ......
> >>        response = getResponse(u, datum, */false/*); // make a request
> >> ......
> >>  }
> >>
> >>  protected abstract Response getResponse(URL url,
> >>                                          CrawlDatum datum,
> >>                                          boolean followRedirects)
> >>    throws ProtocolException, IOException;
> >>
> >> After I changed the call to getResponse(u, datum, */true/*) and
> recompile
> >> the plugin, the crawler fetches redirected pages as expected.
> >>
> >> So is this a bug in lib-http library or I had some misunderstanding on
> >> how
> >> redirect works?
> >
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/http-redirect-max-tp3513652p3768744.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
*Lewis*

Re: http.redirect.max

Posted by xuyuanme <xu...@gmail.com>.
Just checked the latest code in 1.4 but it's the same. See code line 138 in
below link:

http://svn.apache.org/viewvc/nutch/branches/branch-1.4/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java?view=markup
http://svn.apache.org/viewvc/nutch/branches/branch-1.4/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java?view=markup 

The method just call getResponse() and set followRedirects parameter to
*false*.

So I guess the http.redirect.max setting is not working on it?


remi tassing wrote
> 
> Would you give Nucth-1.4 a try? Maybe this bug is already solved?
> 
> Remi
> 
> On Thursday, February 23, 2012, xuyuanme &lt;xuyuanme@&gt; wrote:
>> Thanks for the information. But I found the wiki page
>> http://wiki.apache.org/nutch/RedirectHandling
>> http://wiki.apache.org/nutch/RedirectHandling  still doesn't have too
>> much
>> content about Nutch redirects.
>>
>> I found even if I set http.redirect.max=2 and
>> db.ignore.external.links=false, the crawler still can't get redirect
> pages.
>> And with further digging, I found the plugin lib-http (in Nutch 1.1)
>> contains following code:
>>
>> Java file: org.apache.nutch.protocol.http.api.HttpBase
>>
>>  public ProtocolOutput getProtocolOutput(Text url, CrawlDatum datum) {
>> ......
>>        response = getResponse(u, datum, */false/*); // make a request
>> ......
>>  }
>>
>>  protected abstract Response getResponse(URL url,
>>                                          CrawlDatum datum,
>>                                          boolean followRedirects)
>>    throws ProtocolException, IOException;
>>
>> After I changed the call to getResponse(u, datum, */true/*) and recompile
>> the plugin, the crawler fetches redirected pages as expected.
>>
>> So is this a bug in lib-http library or I had some misunderstanding on
>> how
>> redirect works?
> 

--
View this message in context: http://lucene.472066.n3.nabble.com/http-redirect-max-tp3513652p3768744.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: http.redirect.max

Posted by remi tassing <ta...@gmail.com>.
Would you give Nucth-1.4 a try? Maybe this bug is already solved?

Remi

On Thursday, February 23, 2012, xuyuanme <xu...@gmail.com> wrote:
> Thanks for the information. But I found the wiki page
> http://wiki.apache.org/nutch/RedirectHandling
> http://wiki.apache.org/nutch/RedirectHandling  still doesn't have too much
> content about Nutch redirects.
>
> I found even if I set http.redirect.max=2 and
> db.ignore.external.links=false, the crawler still can't get redirect
pages.
> And with further digging, I found the plugin lib-http (in Nutch 1.1)
> contains following code:
>
> Java file: org.apache.nutch.protocol.http.api.HttpBase
>
>  public ProtocolOutput getProtocolOutput(Text url, CrawlDatum datum) {
> ......
>        response = getResponse(u, datum, */false/*); // make a request
> ......
>  }
>
>  protected abstract Response getResponse(URL url,
>                                          CrawlDatum datum,
>                                          boolean followRedirects)
>    throws ProtocolException, IOException;
>
> After I changed the call to getResponse(u, datum, */true/*) and recompile
> the plugin, the crawler fetches redirected pages as expected.
>
> So is this a bug in lib-http library or I had some misunderstanding on how
> redirect works?
>
> Thanks!
>
> lewis john mcgibbney wrote
>>
>> Hi Rafael,
>>
>> The page we are talking about will be added on the link below.
>>
>> http://wiki.apache.org/nutch/InternalDocumentation
>>
>> and will be available here
>>
>> http://wiki.apache.org/nutch/RedirectHandling
>>
>>
>
>
> --
> View this message in context:
http://lucene.472066.n3.nabble.com/http-redirect-max-tp3513652p3768657.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>