You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by al...@aim.com on 2012/03/01 21:09:42 UTC

Re: http.redirect.max

 Hello,

I tried 1, 2, -1 for the config http.redirect.max, but nutch still postpones redirected urls to later depths.
What is the correct config  setting to have nutch crawl redirected urls immediately. I need it because I have restriction on depth be at most 2.

Thanks.
Alex.

 

 

-----Original Message-----
From: xuyuanme <xu...@gmail.com>
To: user <us...@nutch.apache.org>
Sent: Fri, Feb 24, 2012 1:31 am
Subject: Re: http.redirect.max


The config file is used for some proof of concept testing so the content
might be confusing, please ignore some incorrect part.

Yes from my end I can see the crawl for website http://www.scotland.gov.uk
is redirected as expected.

However the website I tried to crawl is a bit more tricky.

Here's what I want to do:

1. Set
http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.SearchResults_Browse&DrugInitial=B
as the seed page

2. And try to crawl one of the link
(http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.Overview&DrugName=BACIGUENT)
as a test

If you click the link, you'll find the website use redirect and cookie to
control page navigation. So I used protocol-httpclient plugin instead of
protocol-http to handle the cookie.

However, the redirect does not happen as expected. The only way I can fetch
second link is to manually change "response = getResponse(u, datum,
*false*)" call to "response = getResponse(u, datum, *true*)" in
org.apache.nutch.protocol.http.api.HttpBase.java file and recompile the
lib-http plugin.

So my issue is related to this specific site
http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.SearchResults_Browse&DrugInitial=B


lewis john mcgibbney wrote
> 
> I've checked working with redirects and everything seems to work fine for
> me.
> 
> The site I checked on
> 
> http://www.scotland.gov.uk
> 
> temp redirect to
> 
> http://home.scotland.gov.uk/home
> 
> Nutch gets this fine when I do some tweaking with nutch-site.xml
> 
> redirects property -1 (just to demonstrate, I would usually not set it so)
> 
> Lewis
> 

--
View this message in context: http://lucene.472066.n3.nabble.com/http-redirect-max-tp3513652p3772115.html
Sent from the Nutch - User mailing list archive at Nabble.com.

 

Re: http.redirect.max

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Alex,

Can you please have a look at NUTCH-1042?

Might it be the case that your redirect possibly has a crawl-delay which
then falls into the boundary case we witness in the issue above?

You may want to chabge your log properties to debug for a while and run
some small crawls on your problem URLs, maybe try adding in some LOG.debug
statements to see what kind of conditions are being satisfied around the
fetcher areas mentioned in NUTCH-1042.

hth

On Thu, Mar 1, 2012 at 8:09 PM, <al...@aim.com> wrote:

>
>  Hello,
>
> I tried 1, 2, -1 for the config http.redirect.max, but nutch still
> postpones redirected urls to later depths.
> What is the correct config  setting to have nutch crawl redirected urls
> immediately. I need it because I have restriction on depth be at most 2.
>
> Thanks.
> Alex.
>
>
>
>
>
> -----Original Message-----
> From: xuyuanme <xu...@gmail.com>
> To: user <us...@nutch.apache.org>
> Sent: Fri, Feb 24, 2012 1:31 am
> Subject: Re: http.redirect.max
>
>
> The config file is used for some proof of concept testing so the content
> might be confusing, please ignore some incorrect part.
>
> Yes from my end I can see the crawl for website http://www.scotland.gov.uk
> is redirected as expected.
>
> However the website I tried to crawl is a bit more tricky.
>
> Here's what I want to do:
>
> 1. Set
>
> http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.SearchResults_Browse&DrugInitial=B
> as the seed page
>
> 2. And try to crawl one of the link
> (
> http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.Overview&DrugName=BACIGUENT
> )
> as a test
>
> If you click the link, you'll find the website use redirect and cookie to
> control page navigation. So I used protocol-httpclient plugin instead of
> protocol-http to handle the cookie.
>
> However, the redirect does not happen as expected. The only way I can fetch
> second link is to manually change "response = getResponse(u, datum,
> *false*)" call to "response = getResponse(u, datum, *true*)" in
> org.apache.nutch.protocol.http.api.HttpBase.java file and recompile the
> lib-http plugin.
>
> So my issue is related to this specific site
>
> http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.SearchResults_Browse&DrugInitial=B
>
>
> lewis john mcgibbney wrote
> >
> > I've checked working with redirects and everything seems to work fine for
> > me.
> >
> > The site I checked on
> >
> > http://www.scotland.gov.uk
> >
> > temp redirect to
> >
> > http://home.scotland.gov.uk/home
> >
> > Nutch gets this fine when I do some tweaking with nutch-site.xml
> >
> > redirects property -1 (just to demonstrate, I would usually not set it
> so)
> >
> > Lewis
> >
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/http-redirect-max-tp3513652p3772115.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>
>


-- 
*Lewis*