You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Mark Stephenson <ms...@us.ibm.com> on 2010/09/02 02:45:49 UTC
Nutch redirects.
Hi,
I am new to Nutch and I'm trying to understand how it handles
redirects. Let's say I want to fetch the following article from the
New York Times:
http://www.nytimes.com/2010/08/30/opinion/30mon1.html
That is the only URL I put in my 'urls' directory. Then I issue the
following command:
bin/nutch crawl urls -dir crawl -depth 3 -topN 50
I set http.redirect.max to 4 in my nutch-sites.xml file so that the
fetcher will immediately follow the redirect (and I get the same
problem when I fetch in stages).
Nutch seems to stumble on the redirects during the fetching of that
URL, and I can't seem to figure out why. Nutch receives response 301
from the server indicating that this URL has moved permanently. It
successfully determines the location to which it has moved and
performs another GET from this new location. So far so good. But
then, the server responds with response 302, which indicates that the
file has moved temporarily. This is where things break down. The
location set in this response is wrong. Instead of "http://www.nytimes.com/2010/08/30/opinion/30mon1.html?_r=1
", which is the correct response, the server responds with the
location set to "http://www.nytimes.com/www.nytimes.com/2010/08/30/opinion/30mon1.html?_r=1
" (and this page does not exist). This does not appear to be a
cookies issue either because wget successfully redirects even when I
disable cookies. In fact, when I used wireshark to trace the
connections, the only major protocol difference I could see between
wget and Nutch is that wget leaves the connection open--- that and the
erroneous response for the temporary redirect.
Has anyone else seen this sort of behavior before? I would appreciate
any guidance on getting this working.
Thanks a lot,
Mark
Re: Nutch redirects.
Posted by Mark Stephenson <ms...@us.ibm.com>.
Thanks so much Volli! I just verified that I am able to index that
URL (and other similar URLs) with your settings.
Thanks again,
Mark
On Sep 3, 2010, at 2:02 PM, Volli wrote:
> Whenever urlnormalizer-basic is involved:
> ==>
> "Generator: 0 records selected for fetching, exiting ..."
>
> ---------------------workout-----
> search string: "Mr. Feinberg and the Gulf Settlement"
>
> fetcher.threads.fetch:5 (nonrelevant)
> fetcher.threads.per.host:5 (nonrelevant)
> http.redirect.max:25 (never set to 0!)
>
> urlnormalizer-(pass|regex|basic):
> NO:"Generator: 0 records selected for fetching, exiting ..."
>
> [none]:
> YES:? Search hits
>
> urlnormalizer-(pass):
> YES:5 Search hits
>
> urlnormalizer-(regex):
> YES:5 Search hits
>
> urlnormalizer-(basic):
> "Generator: 0 records selected for fetching, exiting ..."
>
> urlnormalizer-(pass|regex):
> YES:5 Search hits
>
>
>
> Am 03.09.2010 20:22, schrieb Volli:
>> After hours playing around with threads, redirections,
>> removing plugins, changing protocol-x, adding seeds, filters
>> and things I can't remember I wanted to confirm that there's
>> no way (for me) to get rid of this redirection:
>> http://www.nytimes.com/www.nytimes.com/2010/08/30/...
>>
>> While wiping away my tears a last crawl was running...
>> successfully. I could search and find "Mr. Feinberg and the
>> Gulf Settlement".
>>
>> Don't ask why. These were my hardcore settings: Don't ask what:
>>
>> -----------------
>> plugin.includes:
>>
>> protocol-httpclient|urlfilter-regex|parse-(text|html|js|tika)|index-
>> (basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-
>> basic|scoring-opic
>>
>>
>> (yes, without any urlnormalizer-()). My next step would be
>> to check out which one of urlnormalizer-(pass|regex|basic).
>>
>> -----------------
>> http.redirect.max:
>> 25
>>
>> -----------------
>> fetcher.threads.fetch:
>> 1
>>
>> -----------------
>> fetcher.threads.per.host:
>> 1
>>
>> -----------------
>> Seeds (urls-testing):
>> http://www.nytimes.com/2010/08/30/opinion/30mon1.html
>>
>> -----------------
>> crawl-urlfilter.txt:
>> +^http://www\.nytimes\.com/
>>
>> (just 1 line. all others commented.)
>>
>> --------bash---------
>> removing all logs. removing crawl dir.
>>
>> bin/nutch crawl urls-testing -dir crawl-testing -depth 5
>> -topN 10
>>
>> (you know, that -topN is just a limiter for testing?)
>>
>>
>>
>> Am 02.09.2010 21:33, schrieb Andrzej Bialecki:
>>> On 2010-09-02 21:13, Mark Stephenson wrote:
>>>> Thanks a lot for the reply Andrzej. I was not aware of the
>>>> difference
>>>> between protocol-http and protocol-httpclient. I was
>>>> running with
>>>> protocol-http, but unfortunately I am getting the same
>>>> result with the
>>>> protocol-httpclient plugin. With either protocol, Nutch
>>>> isn't even being
>>>> redirected to a landing page. When it follows the
>>>> temporary redirect it
>>>> receives the 404 not found response.
>>>>
>>>> I am going to try to get to the bottom of this. In the
>>>> meantime, I
>>>> welcome any and all suggestions.
>>>
>>> You can use tcpdump (or Wireshark on Win) to capture this
>>> interaction, to at least verify what are the actual bits
>>> exchanged between Nutch and the server...
>>>
>>>
Re: Nutch redirects.
Posted by Volli <il...@web.de>.
Whenever urlnormalizer-basic is involved:
==>
"Generator: 0 records selected for fetching, exiting ..."
---------------------workout-----
search string: "Mr. Feinberg and the Gulf Settlement"
fetcher.threads.fetch:5 (nonrelevant)
fetcher.threads.per.host:5 (nonrelevant)
http.redirect.max:25 (never set to 0!)
urlnormalizer-(pass|regex|basic):
NO:"Generator: 0 records selected for fetching, exiting ..."
[none]:
YES:? Search hits
urlnormalizer-(pass):
YES:5 Search hits
urlnormalizer-(regex):
YES:5 Search hits
urlnormalizer-(basic):
"Generator: 0 records selected for fetching, exiting ..."
urlnormalizer-(pass|regex):
YES:5 Search hits
Am 03.09.2010 20:22, schrieb Volli:
> After hours playing around with threads, redirections,
> removing plugins, changing protocol-x, adding seeds, filters
> and things I can't remember I wanted to confirm that there's
> no way (for me) to get rid of this redirection:
> http://www.nytimes.com/www.nytimes.com/2010/08/30/...
>
> While wiping away my tears a last crawl was running...
> successfully. I could search and find "Mr. Feinberg and the
> Gulf Settlement".
>
> Don't ask why. These were my hardcore settings: Don't ask what:
>
> -----------------
> plugin.includes:
>
> protocol-httpclient|urlfilter-regex|parse-(text|html|js|tika)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic
>
>
> (yes, without any urlnormalizer-()). My next step would be
> to check out which one of urlnormalizer-(pass|regex|basic).
>
> -----------------
> http.redirect.max:
> 25
>
> -----------------
> fetcher.threads.fetch:
> 1
>
> -----------------
> fetcher.threads.per.host:
> 1
>
> -----------------
> Seeds (urls-testing):
> http://www.nytimes.com/2010/08/30/opinion/30mon1.html
>
> -----------------
> crawl-urlfilter.txt:
> +^http://www\.nytimes\.com/
>
> (just 1 line. all others commented.)
>
> --------bash---------
> removing all logs. removing crawl dir.
>
> bin/nutch crawl urls-testing -dir crawl-testing -depth 5
> -topN 10
>
> (you know, that -topN is just a limiter for testing?)
>
>
>
> Am 02.09.2010 21:33, schrieb Andrzej Bialecki:
>> On 2010-09-02 21:13, Mark Stephenson wrote:
>>> Thanks a lot for the reply Andrzej. I was not aware of the
>>> difference
>>> between protocol-http and protocol-httpclient. I was
>>> running with
>>> protocol-http, but unfortunately I am getting the same
>>> result with the
>>> protocol-httpclient plugin. With either protocol, Nutch
>>> isn't even being
>>> redirected to a landing page. When it follows the
>>> temporary redirect it
>>> receives the 404 not found response.
>>>
>>> I am going to try to get to the bottom of this. In the
>>> meantime, I
>>> welcome any and all suggestions.
>>
>> You can use tcpdump (or Wireshark on Win) to capture this
>> interaction, to at least verify what are the actual bits
>> exchanged between Nutch and the server...
>>
>>
Re: Nutch redirects.
Posted by Volli <il...@web.de>.
After hours playing around with threads, redirections,
removing plugins, changing protocol-x, adding seeds, filters
and things I can't remember I wanted to confirm that there's
no way (for me) to get rid of this redirection:
http://www.nytimes.com/www.nytimes.com/2010/08/30/...
While wiping away my tears a last crawl was running...
successfully. I could search and find "Mr. Feinberg and the
Gulf Settlement".
Don't ask why. These were my hardcore settings: Don't ask what:
-----------------
plugin.includes:
protocol-httpclient|urlfilter-regex|parse-(text|html|js|tika)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic
(yes, without any urlnormalizer-()). My next step would be
to check out which one of urlnormalizer-(pass|regex|basic).
-----------------
http.redirect.max:
25
-----------------
fetcher.threads.fetch:
1
-----------------
fetcher.threads.per.host:
1
-----------------
Seeds (urls-testing):
http://www.nytimes.com/2010/08/30/opinion/30mon1.html
-----------------
crawl-urlfilter.txt:
+^http://www\.nytimes\.com/
(just 1 line. all others commented.)
--------bash---------
removing all logs. removing crawl dir.
bin/nutch crawl urls-testing -dir crawl-testing -depth 5
-topN 10
(you know, that -topN is just a limiter for testing?)
Am 02.09.2010 21:33, schrieb Andrzej Bialecki:
> On 2010-09-02 21:13, Mark Stephenson wrote:
>> Thanks a lot for the reply Andrzej. I was not aware of the
>> difference
>> between protocol-http and protocol-httpclient. I was
>> running with
>> protocol-http, but unfortunately I am getting the same
>> result with the
>> protocol-httpclient plugin. With either protocol, Nutch
>> isn't even being
>> redirected to a landing page. When it follows the
>> temporary redirect it
>> receives the 404 not found response.
>>
>> I am going to try to get to the bottom of this. In the
>> meantime, I
>> welcome any and all suggestions.
>
> You can use tcpdump (or Wireshark on Win) to capture this
> interaction, to at least verify what are the actual bits
> exchanged between Nutch and the server...
>
>
Re: Nutch redirects.
Posted by Andrzej Bialecki <ab...@getopt.org>.
On 2010-09-02 21:13, Mark Stephenson wrote:
> Thanks a lot for the reply Andrzej. I was not aware of the difference
> between protocol-http and protocol-httpclient. I was running with
> protocol-http, but unfortunately I am getting the same result with the
> protocol-httpclient plugin. With either protocol, Nutch isn't even being
> redirected to a landing page. When it follows the temporary redirect it
> receives the 404 not found response.
>
> I am going to try to get to the bottom of this. In the meantime, I
> welcome any and all suggestions.
You can use tcpdump (or Wireshark on Win) to capture this interaction,
to at least verify what are the actual bits exchanged between Nutch and
the server...
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: Nutch redirects.
Posted by Mark Stephenson <ms...@us.ibm.com>.
Thanks a lot for the reply Andrzej. I was not aware of the difference
between protocol-http and protocol-httpclient. I was running with
protocol-http, but unfortunately I am getting the same result with the
protocol-httpclient plugin. With either protocol, Nutch isn't even
being redirected to a landing page. When it follows the temporary
redirect it receives the 404 not found response.
I am going to try to get to the bottom of this. In the meantime, I
welcome any and all suggestions.
Thanks again,
Mark
On Sep 2, 2010, at 5:05 AM, Andrzej Bialecki wrote:
> On 2010-09-02 02:45, Mark Stephenson wrote:
>> Hi,
>>
>> I am new to Nutch and I'm trying to understand how it handles
>> redirects.
>> Let's say I want to fetch the following article from the New York
>> Times:
>>
>> http://www.nytimes.com/2010/08/30/opinion/30mon1.html
>>
>> That is the only URL I put in my 'urls' directory. Then I issue the
>> following command:
>>
>> bin/nutch crawl urls -dir crawl -depth 3 -topN 50
>>
>> I set http.redirect.max to 4 in my nutch-sites.xml file so that the
>> fetcher will immediately follow the redirect (and I get the same
>> problem
>> when I fetch in stages).
>>
>> Nutch seems to stumble on the redirects during the fetching of that
>> URL,
>> and I can't seem to figure out why. Nutch receives response 301
>> from the
>> server indicating that this URL has moved permanently. It
>> successfully
>> determines the location to which it has moved and performs another
>> GET
>> from this new location. So far so good. But then, the server responds
>> with response 302, which indicates that the file has moved
>> temporarily.
>> This is where things break down. The location set in this response is
>> wrong. Instead of
>> "http://www.nytimes.com/2010/08/30/opinion/30mon1.html?_r=1", which
>> is
>> the correct response, the server responds with the location set to
>> "http://www.nytimes.com/www.nytimes.com/2010/08/30/opinion/30mon1.html?_r=1
>> "
>> (and this page does not exist). This does not appear to be a cookies
>> issue either because wget successfully redirects even when I disable
>> cookies. In fact, when I used wireshark to trace the connections, the
>> only major protocol difference I could see between wget and Nutch is
>> that wget leaves the connection open--- that and the erroneous
>> response
>> for the temporary redirect.
>>
>> Has anyone else seen this sort of behavior before? I would appreciate
>> any guidance on getting this working.
>
> I can reproduce this situation in a browser when I disable cookies.
> The page redirects 5 times before it arrives at the landing page
> (which is a login popup). So it looks like it's a cookies issue
> after all..
>
> Are you using protocol-http or protocol-httpclient plugin? Only the
> latter supports cookies.
>
>
> --
> Best regards,
> Andrzej Bialecki <><
> ___. ___ ___ ___ _ _ __________________________________
> [__ || __|__/|__||\/| Information Retrieval, Semantic Web
> ___|||__|| \| || | Embedded Unix, System Integration
> http://www.sigram.com Contact: info at sigram dot com
>
Re: Nutch redirects.
Posted by Andrzej Bialecki <ab...@getopt.org>.
On 2010-09-02 02:45, Mark Stephenson wrote:
> Hi,
>
> I am new to Nutch and I'm trying to understand how it handles redirects.
> Let's say I want to fetch the following article from the New York Times:
>
> http://www.nytimes.com/2010/08/30/opinion/30mon1.html
>
> That is the only URL I put in my 'urls' directory. Then I issue the
> following command:
>
> bin/nutch crawl urls -dir crawl -depth 3 -topN 50
>
> I set http.redirect.max to 4 in my nutch-sites.xml file so that the
> fetcher will immediately follow the redirect (and I get the same problem
> when I fetch in stages).
>
> Nutch seems to stumble on the redirects during the fetching of that URL,
> and I can't seem to figure out why. Nutch receives response 301 from the
> server indicating that this URL has moved permanently. It successfully
> determines the location to which it has moved and performs another GET
> from this new location. So far so good. But then, the server responds
> with response 302, which indicates that the file has moved temporarily.
> This is where things break down. The location set in this response is
> wrong. Instead of
> "http://www.nytimes.com/2010/08/30/opinion/30mon1.html?_r=1", which is
> the correct response, the server responds with the location set to
> "http://www.nytimes.com/www.nytimes.com/2010/08/30/opinion/30mon1.html?_r=1"
> (and this page does not exist). This does not appear to be a cookies
> issue either because wget successfully redirects even when I disable
> cookies. In fact, when I used wireshark to trace the connections, the
> only major protocol difference I could see between wget and Nutch is
> that wget leaves the connection open--- that and the erroneous response
> for the temporary redirect.
>
> Has anyone else seen this sort of behavior before? I would appreciate
> any guidance on getting this working.
I can reproduce this situation in a browser when I disable cookies. The
page redirects 5 times before it arrives at the landing page (which is a
login popup). So it looks like it's a cookies issue after all..
Are you using protocol-http or protocol-httpclient plugin? Only the
latter supports cookies.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com