You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Rafael Pappert <rp...@fwpsystems.com> on 2011/11/16 20:17:09 UTC

http.redirect.max

Hello List,

is it possible to follow http 301 redirects immediately?

I tried to set http.redirect.max to 3 but the page is
still not indexed. readdb is still showing 1 page is
unfetched / db_redir_perm. And I can't find the
redirection target in the crawldb.

How does nutch handle redirects?

Thanks in advance,
Rafael.

Re: http.redirect.max

Posted by Rafael Pappert <rp...@fwpsystems.com>.

Hi Alex,

this is not really a bug. It's a "undocumented" feature.
db.ignore.external.links prevents the fetcher from breaking
out of your set of domains. And this is what you need, if you
won't crawl the whole web.

Best regards,
Rafael.


On 17/Nov/ 2011, at 23:05 , alxsss@aim.com wrote:

> 
> Hi,
> 
> Is this issue resolved in https://issues.apache.org/jira/browse/NUTCH-1044
> for the case when 
> db.ignore.external.links set to true
> ?
> 
> Thanks.
> Alex.
> 
> 
> 
> 
> 
> 
> -----Original Message-----
> From: Ferdy Galema <fe...@kalooga.com>
> To: user <us...@nutch.apache.org>
> Sent: Thu, Nov 17, 2011 6:01 am
> Subject: Re: http.redirect.max
> 
> 
> Thanks for updating the list.
> 
> On 11/17/2011 02:52 PM, Rafael Pappert wrote:
>> Hi,
>> 
>> after some investigation i got the problem.
>> I had db.ignore.external.links set to true, this is why
>> fetcher isn't following the redirection from domain.com to
>> www.domain.com.
>> 
>> Rafael.
>> 
>> 
>> 
>> On 16/Nov/ 2011, at 20:17 , Rafael Pappert wrote:
>> 
>>> Hello List,
>>> 
>>> is it possible to follow http 301 redirects immediately?
>>> 
>>> I tried to set http.redirect.max to 3 but the page is
>>> still not indexed. readdb is still showing 1 page is
>>> unfetched / db_redir_perm. And I can't find the
>>> redirection target in the crawldb.
>>> 
>>> How does nutch handle redirects?
>>> 
>>> Thanks in advance,
>>> Rafael.
>>> 
>>> 
>>> 
>>> 
> 
>

Re: http.redirect.max

Posted by al...@aim.com.

 Hi,

Is this issue resolved in https://issues.apache.org/jira/browse/NUTCH-1044
for the case when 
db.ignore.external.links set to true
?

Thanks.
Alex.


 

 

-----Original Message-----
From: Ferdy Galema <fe...@kalooga.com>
To: user <us...@nutch.apache.org>
Sent: Thu, Nov 17, 2011 6:01 am
Subject: Re: http.redirect.max


Thanks for updating the list.

On 11/17/2011 02:52 PM, Rafael Pappert wrote:
> Hi,
>
> after some investigation i got the problem.
> I had db.ignore.external.links set to true, this is why
> fetcher isn't following the redirection from domain.com to
> www.domain.com.
>
> Rafael.
>
>
>
> On 16/Nov/ 2011, at 20:17 , Rafael Pappert wrote:
>
>> Hello List,
>>
>> is it possible to follow http 301 redirects immediately?
>>
>> I tried to set http.redirect.max to 3 but the page is
>> still not indexed. readdb is still showing 1 page is
>> unfetched / db_redir_perm. And I can't find the
>> redirection target in the crawldb.
>>
>> How does nutch handle redirects?
>>
>> Thanks in advance,
>> Rafael.
>>
>>
>>
>>

Re: http.redirect.max

Posted by Ferdy Galema <fe...@kalooga.com>.

Thanks for updating the list.

On 11/17/2011 02:52 PM, Rafael Pappert wrote:
> Hi,
>
> after some investigation i got the problem.
> I had db.ignore.external.links set to true, this is why
> fetcher isn't following the redirection from domain.com to
> www.domain.com.
>
> Rafael.
>
>
>
> On 16/Nov/ 2011, at 20:17 , Rafael Pappert wrote:
>
>> Hello List,
>>
>> is it possible to follow http 301 redirects immediately?
>>
>> I tried to set http.redirect.max to 3 but the page is
>> still not indexed. readdb is still showing 1 page is
>> unfetched / db_redir_perm. And I can't find the
>> redirection target in the crawldb.
>>
>> How does nutch handle redirects?
>>
>> Thanks in advance,
>> Rafael.
>>
>>
>>
>>

Re: http.redirect.max

Posted by Rafael Pappert <rp...@fwpsystems.com>.

Hi,

after some investigation i got the problem.
I had db.ignore.external.links set to true, this is why
fetcher isn't following the redirection from domain.com to
www.domain.com.

Rafael.

On 16/Nov/ 2011, at 20:17 , Rafael Pappert wrote:

> Hello List,
> 
> is it possible to follow http 301 redirects immediately?
> 
> I tried to set http.redirect.max to 3 but the page is
> still not indexed. readdb is still showing 1 page is
> unfetched / db_redir_perm. And I can't find the
> redirection target in the crawldb.
> 
> How does nutch handle redirects?
> 
> Thanks in advance,
> Rafael.
> 
> 
> 
>

Re: http.redirect.max

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi Alex,

Can you please have a look at NUTCH-1042?

Might it be the case that your redirect possibly has a crawl-delay which
then falls into the boundary case we witness in the issue above?

You may want to chabge your log properties to debug for a while and run
some small crawls on your problem URLs, maybe try adding in some LOG.debug
statements to see what kind of conditions are being satisfied around the
fetcher areas mentioned in NUTCH-1042.

hth

On Thu, Mar 1, 2012 at 8:09 PM, <al...@aim.com> wrote:

>
>  Hello,
>
> I tried 1, 2, -1 for the config http.redirect.max, but nutch still
> postpones redirected urls to later depths.
> What is the correct config  setting to have nutch crawl redirected urls
> immediately. I need it because I have restriction on depth be at most 2.
>
> Thanks.
> Alex.
>
>
>
>
>
> -----Original Message-----
> From: xuyuanme <xu...@gmail.com>
> To: user <us...@nutch.apache.org>
> Sent: Fri, Feb 24, 2012 1:31 am
> Subject: Re: http.redirect.max
>
>
> The config file is used for some proof of concept testing so the content
> might be confusing, please ignore some incorrect part.
>
> Yes from my end I can see the crawl for website http://www.scotland.gov.uk
> is redirected as expected.
>
> However the website I tried to crawl is a bit more tricky.
>
> Here's what I want to do:
>
> 1. Set
>
> http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.SearchResults_Browse&DrugInitial=B
> as the seed page
>
> 2. And try to crawl one of the link
> (
> http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.Overview&DrugName=BACIGUENT
> )
> as a test
>
> If you click the link, you'll find the website use redirect and cookie to
> control page navigation. So I used protocol-httpclient plugin instead of
> protocol-http to handle the cookie.
>
> However, the redirect does not happen as expected. The only way I can fetch
> second link is to manually change "response = getResponse(u, datum,
> *false*)" call to "response = getResponse(u, datum, *true*)" in
> org.apache.nutch.protocol.http.api.HttpBase.java file and recompile the
> lib-http plugin.
>
> So my issue is related to this specific site
>
> http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.SearchResults_Browse&DrugInitial=B
>
>
> lewis john mcgibbney wrote
> >
> > I've checked working with redirects and everything seems to work fine for
> > me.
> >
> > The site I checked on
> >
> > http://www.scotland.gov.uk
> >
> > temp redirect to
> >
> > http://home.scotland.gov.uk/home
> >
> > Nutch gets this fine when I do some tweaking with nutch-site.xml
> >
> > redirects property -1 (just to demonstrate, I would usually not set it
> so)
> >
> > Lewis
> >
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/http-redirect-max-tp3513652p3772115.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>
>


-- 
*Lewis*

Re: http.redirect.max

Posted by al...@aim.com.

 Hello,

I tried 1, 2, -1 for the config http.redirect.max, but nutch still postpones redirected urls to later depths.
What is the correct config  setting to have nutch crawl redirected urls immediately. I need it because I have restriction on depth be at most 2.

Thanks.
Alex.

 

 

-----Original Message-----
From: xuyuanme <xu...@gmail.com>
To: user <us...@nutch.apache.org>
Sent: Fri, Feb 24, 2012 1:31 am
Subject: Re: http.redirect.max


The config file is used for some proof of concept testing so the content
might be confusing, please ignore some incorrect part.

Yes from my end I can see the crawl for website http://www.scotland.gov.uk
is redirected as expected.

However the website I tried to crawl is a bit more tricky.

Here's what I want to do:

1. Set
http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.SearchResults_Browse&DrugInitial=B
as the seed page

2. And try to crawl one of the link
(http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.Overview&DrugName=BACIGUENT)
as a test

If you click the link, you'll find the website use redirect and cookie to
control page navigation. So I used protocol-httpclient plugin instead of
protocol-http to handle the cookie.

However, the redirect does not happen as expected. The only way I can fetch
second link is to manually change "response = getResponse(u, datum,
*false*)" call to "response = getResponse(u, datum, *true*)" in
org.apache.nutch.protocol.http.api.HttpBase.java file and recompile the
lib-http plugin.

So my issue is related to this specific site
http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.SearchResults_Browse&DrugInitial=B


lewis john mcgibbney wrote
> 
> I've checked working with redirects and everything seems to work fine for
> me.
> 
> The site I checked on
> 
> http://www.scotland.gov.uk
> 
> temp redirect to
> 
> http://home.scotland.gov.uk/home
> 
> Nutch gets this fine when I do some tweaking with nutch-site.xml
> 
> redirects property -1 (just to demonstrate, I would usually not set it so)
> 
> Lewis
> 

--
View this message in context: http://lucene.472066.n3.nabble.com/http-redirect-max-tp3513652p3772115.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: http.redirect.max

Posted by xuyuanme <xu...@gmail.com>.

The config file is used for some proof of concept testing so the content
might be confusing, please ignore some incorrect part.

Yes from my end I can see the crawl for website http://www.scotland.gov.uk
is redirected as expected.

However the website I tried to crawl is a bit more tricky.

Here's what I want to do:

1. Set
http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.SearchResults_Browse&DrugInitial=B
as the seed page

2. And try to crawl one of the link
(http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.Overview&DrugName=BACIGUENT)
as a test

If you click the link, you'll find the website use redirect and cookie to
control page navigation. So I used protocol-httpclient plugin instead of
protocol-http to handle the cookie.

However, the redirect does not happen as expected. The only way I can fetch
second link is to manually change "response = getResponse(u, datum,
*false*)" call to "response = getResponse(u, datum, *true*)" in
org.apache.nutch.protocol.http.api.HttpBase.java file and recompile the
lib-http plugin.

So my issue is related to this specific site
http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.SearchResults_Browse&DrugInitial=B


lewis john mcgibbney wrote
> 
> I've checked working with redirects and everything seems to work fine for
> me.
> 
> The site I checked on
> 
> http://www.scotland.gov.uk
> 
> temp redirect to
> 
> http://home.scotland.gov.uk/home
> 
> Nutch gets this fine when I do some tweaking with nutch-site.xml
> 
> redirects property -1 (just to demonstrate, I would usually not set it so)
> 
> Lewis
> 

--
View this message in context: http://lucene.472066.n3.nabble.com/http-redirect-max-tp3513652p3772115.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: http.redirect.max

Posted by Lewis John Mcgibbney <le...@gmail.com>.

I've checked working with redirects and everything seems to work fine for
me.

The site I checked on

http://www.scotland.gov.uk

temp redirect to

http://home.scotland.gov.uk/home

Nutch gets this fine when I do some tweaking with nutch-site.xml

redirects property -1 (just to demonstrate, I would usually not set it so)

Lewis

On Thu, Feb 23, 2012 at 3:18 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> Additionally in your nutch-site.xml we don't maintain any query-(plugins),
> and there is no parse-text plugin either.
>
>
> On Thu, Feb 23, 2012 at 3:13 PM, Lewis John Mcgibbney <
> lewis.mcgibbney@gmail.com> wrote:
>
>> OK, for starters we don't use crawl-urlfilter.txt anymore, this is
>> deprecated as of Nutch 1.2 iirc.
>>
>> Secondly, what are you trying to achieve here? Your url filter includes
>> +^http://www
>> \.accessdata\.fda\.gov/scripts/cder/drugsatfda/index\.cfm\?fuseaction=Search\.SearchResults_Browse&DrugInitial=B$
>> +^http://www
>> \.accessdata\.fda\.gov/scripts/cder/drugsatfda/index\.cfm\?fuseaction=Search\.Overview&DrugName=BACIGUENT$
>>
>> Your seed urls are also not exactly what I would expect for a seed list.
>>
>> One last thing, your fetcher.threads.per.host is pretty aggressive, I
>> wouldn't personally set it this high unless it was my own server I was
>> communicating with.
>>
>> So what exactly is it that you are having problems with?
>>
>> Lewis
>>
>>
>>
>>
>> On Thu, Feb 23, 2012 at 12:11 PM, xuyuanme <xu...@gmail.com> wrote:
>>
>>> Thanks! The config file can be get here:
>>> http://dl.dropbox.com/u/6614015/temp/config.zip
>>> http://dl.dropbox.com/u/6614015/temp/config.zip
>>>
>>>
>>> lewis john mcgibbney wrote
>>> >
>>> > Hi,
>>> >
>>> > Can you post your nutch-site.xml and I will give it a spin.
>>> >
>>> > Thank you
>>> >
>>> > Lewis
>>> >
>>> > On Thu, Feb 23, 2012 at 5:07 AM, xuyuanme &lt;xuyuanme@&gt; wrote:
>>> >
>>> >> Just checked the latest code in 1.4 but it's the same. See code line
>>> 138
>>> >> in
>>> >> below link:
>>> >>
>>> >>
>>> >>
>>> http://svn.apache.org/viewvc/nutch/branches/branch-1.4/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java?view=markup
>>> >>
>>> >>
>>> http://svn.apache.org/viewvc/nutch/branches/branch-1.4/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java?view=markup
>>> >>
>>> >> The method just call getResponse() and set followRedirects parameter
>>> to
>>> >> *false*.
>>> >>
>>> >> So I guess the http.redirect.max setting is not working on it?
>>> >>
>>> >>
>>> >
>>>
>>> --
>>> View this message in context:
>>> http://lucene.472066.n3.nabble.com/http-redirect-max-tp3513652p3769491.html
>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>>
>>
>>
>>
>> --
>> *Lewis*
>>
>>
>
>
> --
> *Lewis*
>
>


-- 
*Lewis*

Re: http.redirect.max

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Additionally in your nutch-site.xml we don't maintain any query-(plugins),
and there is no parse-text plugin either.

On Thu, Feb 23, 2012 at 3:13 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> OK, for starters we don't use crawl-urlfilter.txt anymore, this is
> deprecated as of Nutch 1.2 iirc.
>
> Secondly, what are you trying to achieve here? Your url filter includes
> +^http://www
> \.accessdata\.fda\.gov/scripts/cder/drugsatfda/index\.cfm\?fuseaction=Search\.SearchResults_Browse&DrugInitial=B$
> +^http://www
> \.accessdata\.fda\.gov/scripts/cder/drugsatfda/index\.cfm\?fuseaction=Search\.Overview&DrugName=BACIGUENT$
>
> Your seed urls are also not exactly what I would expect for a seed list.
>
> One last thing, your fetcher.threads.per.host is pretty aggressive, I
> wouldn't personally set it this high unless it was my own server I was
> communicating with.
>
> So what exactly is it that you are having problems with?
>
> Lewis
>
>
>
>
> On Thu, Feb 23, 2012 at 12:11 PM, xuyuanme <xu...@gmail.com> wrote:
>
>> Thanks! The config file can be get here:
>> http://dl.dropbox.com/u/6614015/temp/config.zip
>> http://dl.dropbox.com/u/6614015/temp/config.zip
>>
>>
>> lewis john mcgibbney wrote
>> >
>> > Hi,
>> >
>> > Can you post your nutch-site.xml and I will give it a spin.
>> >
>> > Thank you
>> >
>> > Lewis
>> >
>> > On Thu, Feb 23, 2012 at 5:07 AM, xuyuanme &lt;xuyuanme@&gt; wrote:
>> >
>> >> Just checked the latest code in 1.4 but it's the same. See code line
>> 138
>> >> in
>> >> below link:
>> >>
>> >>
>> >>
>> http://svn.apache.org/viewvc/nutch/branches/branch-1.4/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java?view=markup
>> >>
>> >>
>> http://svn.apache.org/viewvc/nutch/branches/branch-1.4/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java?view=markup
>> >>
>> >> The method just call getResponse() and set followRedirects parameter to
>> >> *false*.
>> >>
>> >> So I guess the http.redirect.max setting is not working on it?
>> >>
>> >>
>> >
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/http-redirect-max-tp3513652p3769491.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>
>
>
> --
> *Lewis*
>
>


-- 
*Lewis*

Re: http.redirect.max

Posted by Lewis John Mcgibbney <le...@gmail.com>.

OK, for starters we don't use crawl-urlfilter.txt anymore, this is
deprecated as of Nutch 1.2 iirc.

Secondly, what are you trying to achieve here? Your url filter includes
+^http://www
\.accessdata\.fda\.gov/scripts/cder/drugsatfda/index\.cfm\?fuseaction=Search\.SearchResults_Browse&DrugInitial=B$
+^http://www
\.accessdata\.fda\.gov/scripts/cder/drugsatfda/index\.cfm\?fuseaction=Search\.Overview&DrugName=BACIGUENT$

Your seed urls are also not exactly what I would expect for a seed list.

One last thing, your fetcher.threads.per.host is pretty aggressive, I
wouldn't personally set it this high unless it was my own server I was
communicating with.

So what exactly is it that you are having problems with?

Lewis



On Thu, Feb 23, 2012 at 12:11 PM, xuyuanme <xu...@gmail.com> wrote:

> Thanks! The config file can be get here:
> http://dl.dropbox.com/u/6614015/temp/config.zip
> http://dl.dropbox.com/u/6614015/temp/config.zip
>
>
> lewis john mcgibbney wrote
> >
> > Hi,
> >
> > Can you post your nutch-site.xml and I will give it a spin.
> >
> > Thank you
> >
> > Lewis
> >
> > On Thu, Feb 23, 2012 at 5:07 AM, xuyuanme &lt;xuyuanme@&gt; wrote:
> >
> >> Just checked the latest code in 1.4 but it's the same. See code line 138
> >> in
> >> below link:
> >>
> >>
> >>
> http://svn.apache.org/viewvc/nutch/branches/branch-1.4/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java?view=markup
> >>
> >>
> http://svn.apache.org/viewvc/nutch/branches/branch-1.4/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java?view=markup
> >>
> >> The method just call getResponse() and set followRedirects parameter to
> >> *false*.
> >>
> >> So I guess the http.redirect.max setting is not working on it?
> >>
> >>
> >
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/http-redirect-max-tp3513652p3769491.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
*Lewis*

Re: http.redirect.max

Posted by xuyuanme <xu...@gmail.com>.

Thanks! The config file can be get here: 
http://dl.dropbox.com/u/6614015/temp/config.zip
http://dl.dropbox.com/u/6614015/temp/config.zip 
 

lewis john mcgibbney wrote
> 
> Hi,
> 
> Can you post your nutch-site.xml and I will give it a spin.
> 
> Thank you
> 
> Lewis
> 
> On Thu, Feb 23, 2012 at 5:07 AM, xuyuanme &lt;xuyuanme@&gt; wrote:
> 
>> Just checked the latest code in 1.4 but it's the same. See code line 138
>> in
>> below link:
>>
>>
>> http://svn.apache.org/viewvc/nutch/branches/branch-1.4/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java?view=markup
>>
>> http://svn.apache.org/viewvc/nutch/branches/branch-1.4/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java?view=markup
>>
>> The method just call getResponse() and set followRedirects parameter to
>> *false*.
>>
>> So I guess the http.redirect.max setting is not working on it?
>>
>>
> 

--
View this message in context: http://lucene.472066.n3.nabble.com/http-redirect-max-tp3513652p3769491.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: http.redirect.max

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi,

Can you post your nutch-site.xml and I will give it a spin.

Thank you

Lewis

On Thu, Feb 23, 2012 at 5:07 AM, xuyuanme <xu...@gmail.com> wrote:

> Just checked the latest code in 1.4 but it's the same. See code line 138 in
> below link:
>
>
> http://svn.apache.org/viewvc/nutch/branches/branch-1.4/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java?view=markup
>
> http://svn.apache.org/viewvc/nutch/branches/branch-1.4/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java?view=markup
>
> The method just call getResponse() and set followRedirects parameter to
> *false*.
>
> So I guess the http.redirect.max setting is not working on it?
>
>
> remi tassing wrote
> >
> > Would you give Nucth-1.4 a try? Maybe this bug is already solved?
> >
> > Remi
> >
> > On Thursday, February 23, 2012, xuyuanme &lt;xuyuanme@&gt; wrote:
> >> Thanks for the information. But I found the wiki page
> >> http://wiki.apache.org/nutch/RedirectHandling
> >> http://wiki.apache.org/nutch/RedirectHandling  still doesn't have too
> >> much
> >> content about Nutch redirects.
> >>
> >> I found even if I set http.redirect.max=2 and
> >> db.ignore.external.links=false, the crawler still can't get redirect
> > pages.
> >> And with further digging, I found the plugin lib-http (in Nutch 1.1)
> >> contains following code:
> >>
> >> Java file: org.apache.nutch.protocol.http.api.HttpBase
> >>
> >>  public ProtocolOutput getProtocolOutput(Text url, CrawlDatum datum) {
> >> ......
> >>        response = getResponse(u, datum, */false/*); // make a request
> >> ......
> >>  }
> >>
> >>  protected abstract Response getResponse(URL url,
> >>                                          CrawlDatum datum,
> >>                                          boolean followRedirects)
> >>    throws ProtocolException, IOException;
> >>
> >> After I changed the call to getResponse(u, datum, */true/*) and
> recompile
> >> the plugin, the crawler fetches redirected pages as expected.
> >>
> >> So is this a bug in lib-http library or I had some misunderstanding on
> >> how
> >> redirect works?
> >
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/http-redirect-max-tp3513652p3768744.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
*Lewis*

Re: http.redirect.max

Posted by xuyuanme <xu...@gmail.com>.

Just checked the latest code in 1.4 but it's the same. See code line 138 in
below link:

http://svn.apache.org/viewvc/nutch/branches/branch-1.4/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java?view=markup
http://svn.apache.org/viewvc/nutch/branches/branch-1.4/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java?view=markup 

The method just call getResponse() and set followRedirects parameter to
*false*.

So I guess the http.redirect.max setting is not working on it?


remi tassing wrote
> 
> Would you give Nucth-1.4 a try? Maybe this bug is already solved?
> 
> Remi
> 
> On Thursday, February 23, 2012, xuyuanme &lt;xuyuanme@&gt; wrote:
>> Thanks for the information. But I found the wiki page
>> http://wiki.apache.org/nutch/RedirectHandling
>> http://wiki.apache.org/nutch/RedirectHandling  still doesn't have too
>> much
>> content about Nutch redirects.
>>
>> I found even if I set http.redirect.max=2 and
>> db.ignore.external.links=false, the crawler still can't get redirect
> pages.
>> And with further digging, I found the plugin lib-http (in Nutch 1.1)
>> contains following code:
>>
>> Java file: org.apache.nutch.protocol.http.api.HttpBase
>>
>>  public ProtocolOutput getProtocolOutput(Text url, CrawlDatum datum) {
>> ......
>>        response = getResponse(u, datum, */false/*); // make a request
>> ......
>>  }
>>
>>  protected abstract Response getResponse(URL url,
>>                                          CrawlDatum datum,
>>                                          boolean followRedirects)
>>    throws ProtocolException, IOException;
>>
>> After I changed the call to getResponse(u, datum, */true/*) and recompile
>> the plugin, the crawler fetches redirected pages as expected.
>>
>> So is this a bug in lib-http library or I had some misunderstanding on
>> how
>> redirect works?
> 

--
View this message in context: http://lucene.472066.n3.nabble.com/http-redirect-max-tp3513652p3768744.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: http.redirect.max

Posted by remi tassing <ta...@gmail.com>.

Would you give Nucth-1.4 a try? Maybe this bug is already solved?

Remi

On Thursday, February 23, 2012, xuyuanme <xu...@gmail.com> wrote:
> Thanks for the information. But I found the wiki page
> http://wiki.apache.org/nutch/RedirectHandling
> http://wiki.apache.org/nutch/RedirectHandling  still doesn't have too much
> content about Nutch redirects.
>
> I found even if I set http.redirect.max=2 and
> db.ignore.external.links=false, the crawler still can't get redirect
pages.
> And with further digging, I found the plugin lib-http (in Nutch 1.1)
> contains following code:
>
> Java file: org.apache.nutch.protocol.http.api.HttpBase
>
>  public ProtocolOutput getProtocolOutput(Text url, CrawlDatum datum) {
> ......
>        response = getResponse(u, datum, */false/*); // make a request
> ......
>  }
>
>  protected abstract Response getResponse(URL url,
>                                          CrawlDatum datum,
>                                          boolean followRedirects)
>    throws ProtocolException, IOException;
>
> After I changed the call to getResponse(u, datum, */true/*) and recompile
> the plugin, the crawler fetches redirected pages as expected.
>
> So is this a bug in lib-http library or I had some misunderstanding on how
> redirect works?
>
> Thanks!
>
> lewis john mcgibbney wrote
>>
>> Hi Rafael,
>>
>> The page we are talking about will be added on the link below.
>>
>> http://wiki.apache.org/nutch/InternalDocumentation
>>
>> and will be available here
>>
>> http://wiki.apache.org/nutch/RedirectHandling
>>
>>
>
>
> --
> View this message in context:
http://lucene.472066.n3.nabble.com/http-redirect-max-tp3513652p3768657.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

Re: http.redirect.max

Posted by xuyuanme <xu...@gmail.com>.

Thanks for the information. But I found the wiki page 
http://wiki.apache.org/nutch/RedirectHandling
http://wiki.apache.org/nutch/RedirectHandling  still doesn't have too much
content about Nutch redirects.

I found even if I set http.redirect.max=2 and
db.ignore.external.links=false, the crawler still can't get redirect pages.
And with further digging, I found the plugin lib-http (in Nutch 1.1)
contains following code:

Java file: org.apache.nutch.protocol.http.api.HttpBase

  public ProtocolOutput getProtocolOutput(Text url, CrawlDatum datum) {
......
        response = getResponse(u, datum, */false/*); // make a request
......
  }

  protected abstract Response getResponse(URL url,
                                          CrawlDatum datum,
                                          boolean followRedirects)
    throws ProtocolException, IOException;

After I changed the call to getResponse(u, datum, */true/*) and recompile
the plugin, the crawler fetches redirected pages as expected.

So is this a bug in lib-http library or I had some misunderstanding on how
redirect works?

Thanks!

lewis john mcgibbney wrote
> 
> Hi Rafael,
> 
> The page we are talking about will be added on the link below.
> 
> http://wiki.apache.org/nutch/InternalDocumentation
> 
> and will be available here
> 
> http://wiki.apache.org/nutch/RedirectHandling
> 
> 


--
View this message in context: http://lucene.472066.n3.nabble.com/http-redirect-max-tp3513652p3768657.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: http.redirect.max

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi Rafael,

The page we are talking about will be added on the link below.

http://wiki.apache.org/nutch/InternalDocumentation

and will be available here

http://wiki.apache.org/nutch/RedirectHandling


> I guess the poor documentation of nutch/hadoop is the biggest problem for
> beginners like me. I started with nutch ~4-6 month ago (not full time, but
> several
> hours every week). At first I wrote some plugins (parser/indexer). This was
> a bit tricky because i had learn directly from the source. Because most of
> the tutorials/documents were outdated (<1.0) or simply wrong.
>

Please note we are trying to remove as much duplication documentation
regarding Nutch & Hadoop as possible. The Nutch wiki has been updated
recently and this is ongoing work so hopefully we can improve this more in
the near future. As Nutch focuses purely on web crawling the Hadoop
material can be viewed directly in the Hadoop wiki. I've added a link to
this on our wiki Nutch Hadoop Tutorial.


> My crawler is now running and I need to scale it up. The current version
> runs in local mode but thats not really fast. So I started to setup a
> hadoop
> cluster (4 Nodes) to run nutch in the deploy mode. This is were I'm today
> and
> my current questions are:
>
> - i will buy some new hardware for the hadoop cluster, but i'm shure about
> the configuration. Is nutch i/o or cpu heavy?
>

On a brand new hardware configuration I have not hard of anyone blowing
gaskets or anything similar. If thereis something wrong, it can usually be
fixed by improving configuration.


>
> - what is the difference between protocol-httpclient and protocol-http?
> Just
> ssl and authentication? What about performance?
>

protocol-httpclient is broken, please see the jira issue that has been
filed. You will also need to have a look at the code for this as I am by no
means an expert with the protocol-httpclient material.

>
> - what is a good value for the following configuration parameter:
>        - fetcher.threads.fetch
>        - fetcher.threads.per.queue
>        - mapred.tasktracker.map.tasks.maximum
>        - mapred.tasktracker.reduce.tasks.maximum
>        - mapred.map.tasks
>        - mapred.reduce.tasks
>

Impossible to say, this varies significantly from crawl/network/nature of
crawl data etc. You simply need to experiment and read as much existing
documentation as possible. Sorry about this one.

>
>        My current hardware is a 4 Node Cluster  of  dual CPU (quad core
> xeon), 32GB RAM, 2*2TB SATA HDD.
>        I know it's impossible to define the "always right" value. But a
> rule of the thumb, to use as start value, would be very a great thing
>        and would save me a lot of "try-and-error" investigation.
>

Unfortunately this open source software you are using. Maybe Cloudera or
some of the other commercially motivated experts can help you with this
stuff. This is outwith my experience. Try here
http://wiki.apache.org/nutch/Support


> - what's the difference fetcher.threads.fetch from the configuration an
> the -threads option from the crawl
> command?
>
This depends on how you wish to monitor/schedule your Nutch crawls. As you
know, running individual commands gives you more flexibility/control over
how Nutch does the work for you.

>
> - is it possible to follow external links only on 301 redirects?
>
Not got a clue but will definitely include this type of material in the
wiki page I created above. Mayeb you can do a bit of investigation and halp
me out when I get round to writing up on this stuff.


>
> - what is happening if a page is marked as db_redir_temp / db_redir_perm?
>        Refetch after db.fetch.interval.default?
>
> Again we will need to work together to get our heads around this, if you
have a look at the code then maybe we can get somethign written up in due
course.

Sorry about the vague answers however its a pretty large task to answer
everything fully considering there are ~5-10 questions all in. I'm sure
there must be some material on the user@ archives so please have a look
there as well.

hth

Lewis

Re: http.redirect.max

Posted by Rafael Pappert <rp...@fwpsystems.com>.

Hi Lewis,
> 
> The honest truth is that there needs to be comprehensive documentation on
> the wiki for the way that Nutch handles redirects. This is a question that
> has gone fully unanswered for sometime.

That's true.

>  In the meantime, can you adivise if there is anything over
> and above the files in nutch-default.xml and o.a.n.protocol package which
> you would like to see documented?

I guess the poor documentation of nutch/hadoop is the biggest problem for
beginners like me. I started with nutch ~4-6 month ago (not full time, but several
hours every week). At first I wrote some plugins (parser/indexer). This was 
a bit tricky because i had learn directly from the source. Because most of
the tutorials/documents were outdated (<1.0) or simply wrong.

My crawler is now running and I need to scale it up. The current version
runs in local mode but thats not really fast. So I started to setup a hadoop
cluster (4 Nodes) to run nutch in the deploy mode. This is were I'm today and
my current questions are:

- i will buy some new hardware for the hadoop cluster, but i'm shure about
the configuration. Is nutch i/o or cpu heavy?

http://www.cloudera.com/blog/2010/03/clouderas-support-team-shares-some-basic-hardware-recommendations/

- what is the difference between protocol-httpclient and protocol-http? Just
ssl and authentication? What about performance?

- what is a good value for the following configuration parameter:
	- fetcher.threads.fetch
	- fetcher.threads.per.queue
	- mapred.tasktracker.map.tasks.maximum
	- mapred.tasktracker.reduce.tasks.maximum
	- mapred.map.tasks
	- mapred.reduce.tasks

	My current hardware is a 4 Node Cluster  of  dual CPU (quad core xeon), 32GB RAM, 2*2TB SATA HDD. 
	I know it's impossible to define the "always right" value. But a rule of the thumb, to use as start value, would be very a great thing
	and would save me a lot of "try-and-error" investigation.

- what's the difference fetcher.threads.fetch from the configuration an the -threads option from the crawl
command?

- is it possible to follow external links only on 301 redirects?

- what is happening if a page is marked as db_redir_temp / db_redir_perm? 
	Refetch after db.fetch.interval.default?


I found loads tutorials and all of them have the "same" content, only the the
very very basics (how to do your first crawl). I guess a comprehensive documentation
would be a big step for the amazing nutch/hadoop project.

Thanks in advance,
Rafael.


> 
> Thanks
> 
> On Wed, Nov 16, 2011 at 7:17 PM, Rafael Pappert <rp...@fwpsystems.com> wrote:
> 
>> Hello List,
>> 
>> is it possible to follow http 301 redirects immediately?
>> 
>> I tried to set http.redirect.max to 3 but the page is
>> still not indexed. readdb is still showing 1 page is
>> unfetched / db_redir_perm. And I can't find the
>> redirection target in the crawldb.
>> 
>> How does nutch handle redirects?
>> 
>> Thanks in advance,
>> Rafael.
>> 
>> 
>> 
>> 
>> 
> 
> 
> -- 
> *Lewis*

Re: http.redirect.max

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi Rafael,

The honest truth is that there needs to be comprehensive documentation on
the wiki for the way that Nutch handles redirects. This is a question that
has gone fully unanswered for sometime. That's just the way it is I suppose.
I'll get my head around everything and try to get some wiki page up and
running ASAP. In the meantime, can you adivise if there is anything over
and above the files in nutch-default.xml and o.a.n.protocol package which
you would like to see documented?

Thanks

On Wed, Nov 16, 2011 at 7:17 PM, Rafael Pappert <rp...@fwpsystems.com> wrote:

> Hello List,
>
> is it possible to follow http 301 redirects immediately?
>
> I tried to set http.redirect.max to 3 but the page is
> still not indexed. readdb is still showing 1 page is
> unfetched / db_redir_perm. And I can't find the
> redirection target in the crawldb.
>
> How does nutch handle redirects?
>
> Thanks in advance,
> Rafael.
>
>
>
>
>

-- 
*Lewis*