You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Ar...@csiro.au on 2015/10/28 08:57:48 UTC

Bug: redirected URLs lost on indexing stage?

Hi,

I am using Nutch 1.9 with NUTCH-2124 patch applied. I've put a question mark in the subject because I work with Nutch modification called Arch (see http://www.atnf.csiro.au/computing/software/arch/). This is why I am only 99% sure that the same bug would occur in the original Nutch 1.9.

In my experience, Nutch follows redirects OK (after NUTCH-2124 applied), fetches target content, parses and saves it, but loses on the indexing stage. This happens because the db datum is being mapped with the original URL as the key, but the fetch and parse data and parse text are being mapped with the final URL in IndexerMapReduce. Therefore, when this condition is checked

if (fetchDatum == null || dbDatum == null|| parseText == null || parseData == null) {
      return;                                     // only have inlinks
    }

both sets get ignored because each one is incomplete.

I am going to fix this for Arch, but can't offer a patch for Nutch, sorry. This is because I am not completely sure that this is a bug in Nutch (see above) and also because what will work for Arch may not work for Nutch. They are different in the use of crawl db.

Regards,

Arkadi



Re: Bug: redirected URLs lost on indexing stage?

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi,

> I meant #1 and used if http.redirect.max == 3.

In this case you definitely have to apply the fix for
NUTCH-2124 / NUTCH-1939 and rebuild your 1.9 package.
Or use 1.10 where NUTCH-1939 is fixed and did not yet
appeared again as NUTCH-2124 :)

Alternatively, use http.redirect.max == 0 and crawl
a sufficient number of rounds.

Cheers,
Sebastian


On 11/06/2015 05:09 AM, Arkadi.Kosmynin@csiro.au wrote:
> Hi Sebastian,
> 
> I meant #1 and used if http.redirect.max == 3.
> 
> Thanks,
> Arkadi
> 
>> -----Original Message-----
>> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
>> Sent: Tuesday, 3 November 2015 6:13 PM
>> To: user@nutch.apache.org
>> Subject: Re: Bug: redirected URLs lost on indexing stage?
>>
>> Hi Arkadi,
>>
>>> Example: use http://www.atnf.csiro.au/observers/  as seed and set
>>> depth to 1. It will be redirected to
>>> http://www.atnf.csiro.au/observers/index.html, fetched and parsed
>> successfully and then lost. If you set depth to 2, it will get indexed.
>>
>> Just to be sure we use the same terminology: What does "depth" mean?
>> 1 number of rounds: number of generate-fetch-update cycles when running
>> nutch,
>>   see command-line help of bin/crawl
>> 2 value of property http.redirect.max
>> 3 value of property scoring.depth.max (used by plugin scoring-depth)
>>
>> If it's about #1 and if http.redirect.max == 0 (the default):
>> you need at least two rounds to index a redirected page.
>> During the first round the redirect is fetched and the redirect target is
>> recorded. The second round will fetch, parse and index the redirect target.
>>
>> If http.redirect.max is set to a value > 0, the fetcher will follow redirects
>> immediately in the current round. But there are some drawbacks, and that's
>> why this isn't the default:
>> - no deduplication if multiple pages are redirected
>>   to the same target, e.g., an error page.
>>   This means you'll spend extra network bandwidth
>>   to fetch the same content multiple times.
>>   Nutch will keep only one instance of the page anyway.
>> - by setting http.redirect.max to a high value you
>>   may get lost in round-trip redirects
>> - if http.redirect.max is too low longer redirect
>>   chains are cut-off. Nutch will not follow these
>>   redirects.
>>
>> Cheers,
>> Sebastian
>>
>>
>> On 11/03/2015 01:21 AM, Arkadi.Kosmynin@csiro.au wrote:
>>> Hi Sebastian,
>>>
>>> Thank you for very quick and detailed response. I've checked again and
>> found that redirected URLs get lost if they had been injected in the last
>> iteration.
>>>
>>> Example: use http://www.atnf.csiro.au/observers/  as seed and set depth
>> to 1. It will be redirected to http://www.atnf.csiro.au/observers/index.html,
>> fetched and parsed successfully and then lost. If you set depth to 2, it will get
>> indexed.
>>>
>>> If you use http://www.atnf.csiro.au/observers/index.html as seed, it will
>> be fetched, parsed and indexed successfully even if you set depth to 1.
>>>
>>>  Regards,
>>> Arkadi
>>>
>>>> -----Original Message-----
>>>> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
>>>> Sent: Thursday, 29 October 2015 7:23 AM
>>>> To: user@nutch.apache.org
>>>> Subject: Re: Bug: redirected URLs lost on indexing stage?
>>>>
>>>> Hi Arkadi,
>>>>
>>>>> In my experience, Nutch follows redirects OK (after NUTCH-2124
>>>>> applied),
>>>>
>>>> Yes, 1.9 is affected by NUTCH-2124 / NUTCH-1939 if http.redirect.max
>>>>> 0
>>>>
>>>>
>>>>> fetches target content, parses and saves it, but loses on the indexing
>> stage.
>>>>
>>>> Can you give a concrete example?
>>>>
>>>> While testing NUTCH-2124, I've verified that redirect targets get indexed.
>>>>
>>>>
>>>>> Therefore, when this condition is checked
>>>>>
>>>>> if (fetchDatum == null || dbDatum == null|| parseText == null ||
>>>>> parseData
>>>> == null) {
>>>>>       return;                                     // only have inlinks
>>>>>     }
>>>>>
>>>>> both sets get ignored because each one is incomplete.
>>>>
>>>> This code snippet is correct, a redirect is pretty much the same as a
>>>> link: the crawler follows it. Ok, there are many differences, but the
>>>> central point: a link does not get indexed, but only the link target.
>>>> And that's the same for redirects. There are always at least 2 URLs:
>>>> - the source or redirect
>>>> - and the target of the redirection
>>>> Only the latter gets indexed after it has been fetched and it is not
>>>> a redirect itself.
>>>>
>>>> The source has no parseText and parseData, and that's why cannot be
>>>> indexed.
>>>>
>>>> If the target does not make it into the index:
>>>> - first, check whether it passes URL filters and is not changed by
>>>> normalizers
>>>> - was it successfully fetched and parsed?
>>>> - not excluded by robots=noindex?
>>>>
>>>> You should check the CrawlDb and the segments for this URL.
>>>>
>>>> If you could provide a concrete example, I'm happy to have a detailed
>>>> look on it.
>>>>
>>>> Cheers,
>>>> Sebastian
>>>>
>>>>
>>>> On 10/28/2015 08:57 AM, Arkadi.Kosmynin@csiro.au wrote:
>>>>> Hi,
>>>>>
>>>>> I am using Nutch 1.9 with NUTCH-2124 patch applied. I've put a
>>>>> question
>>>> mark in the subject because I work with Nutch modification called
>>>> Arch (see http://www.atnf.csiro.au/computing/software/arch/). This is
>>>> why I am only 99% sure that the same bug would occur in the original
>> Nutch 1.9.
>>>>>
>>>>> In my experience, Nutch follows redirects OK (after NUTCH-2124
>>>>> applied), fetches target content, parses and saves it, but loses on
>>>>> the indexing stage. This happens because the db datum is being
>>>>> mapped with the original URL as the key, but the fetch and parse
>>>>> data and parse text are being mapped with the final URL in
>> IndexerMapReduce.
>>>>> Therefore, when this condition is checked
>>>>>
>>>>> if (fetchDatum == null || dbDatum == null|| parseText == null ||
>>>>> parseData
>>>> == null) {
>>>>>       return;                                     // only have inlinks
>>>>>     }
>>>>>
>>>>> both sets get ignored because each one is incomplete.
>>>>>
>>>>> I am going to fix this for Arch, but can't offer a patch for Nutch,
>>>>> sorry. This is
>>>> because I am not completely sure that this is a bug in Nutch (see
>>>> above) and also because what will work for Arch may not work for
>>>> Nutch. They are different in the use of crawl db.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Arkadi
>>>>>
>>>>>
>>>>>
>>>
> 


RE: Bug: redirected URLs lost on indexing stage?

Posted by Ar...@csiro.au.
Hi Sebastian,

I meant #1 and used if http.redirect.max == 3.

Thanks,
Arkadi

> -----Original Message-----
> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
> Sent: Tuesday, 3 November 2015 6:13 PM
> To: user@nutch.apache.org
> Subject: Re: Bug: redirected URLs lost on indexing stage?
> 
> Hi Arkadi,
> 
> > Example: use http://www.atnf.csiro.au/observers/  as seed and set
> > depth to 1. It will be redirected to
> > http://www.atnf.csiro.au/observers/index.html, fetched and parsed
> successfully and then lost. If you set depth to 2, it will get indexed.
> 
> Just to be sure we use the same terminology: What does "depth" mean?
> 1 number of rounds: number of generate-fetch-update cycles when running
> nutch,
>   see command-line help of bin/crawl
> 2 value of property http.redirect.max
> 3 value of property scoring.depth.max (used by plugin scoring-depth)
> 
> If it's about #1 and if http.redirect.max == 0 (the default):
> you need at least two rounds to index a redirected page.
> During the first round the redirect is fetched and the redirect target is
> recorded. The second round will fetch, parse and index the redirect target.
> 
> If http.redirect.max is set to a value > 0, the fetcher will follow redirects
> immediately in the current round. But there are some drawbacks, and that's
> why this isn't the default:
> - no deduplication if multiple pages are redirected
>   to the same target, e.g., an error page.
>   This means you'll spend extra network bandwidth
>   to fetch the same content multiple times.
>   Nutch will keep only one instance of the page anyway.
> - by setting http.redirect.max to a high value you
>   may get lost in round-trip redirects
> - if http.redirect.max is too low longer redirect
>   chains are cut-off. Nutch will not follow these
>   redirects.
> 
> Cheers,
> Sebastian
> 
> 
> On 11/03/2015 01:21 AM, Arkadi.Kosmynin@csiro.au wrote:
> > Hi Sebastian,
> >
> > Thank you for very quick and detailed response. I've checked again and
> found that redirected URLs get lost if they had been injected in the last
> iteration.
> >
> > Example: use http://www.atnf.csiro.au/observers/  as seed and set depth
> to 1. It will be redirected to http://www.atnf.csiro.au/observers/index.html,
> fetched and parsed successfully and then lost. If you set depth to 2, it will get
> indexed.
> >
> > If you use http://www.atnf.csiro.au/observers/index.html as seed, it will
> be fetched, parsed and indexed successfully even if you set depth to 1.
> >
> >  Regards,
> > Arkadi
> >
> >> -----Original Message-----
> >> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
> >> Sent: Thursday, 29 October 2015 7:23 AM
> >> To: user@nutch.apache.org
> >> Subject: Re: Bug: redirected URLs lost on indexing stage?
> >>
> >> Hi Arkadi,
> >>
> >>> In my experience, Nutch follows redirects OK (after NUTCH-2124
> >>> applied),
> >>
> >> Yes, 1.9 is affected by NUTCH-2124 / NUTCH-1939 if http.redirect.max
> >> > 0
> >>
> >>
> >>> fetches target content, parses and saves it, but loses on the indexing
> stage.
> >>
> >> Can you give a concrete example?
> >>
> >> While testing NUTCH-2124, I've verified that redirect targets get indexed.
> >>
> >>
> >>> Therefore, when this condition is checked
> >>>
> >>> if (fetchDatum == null || dbDatum == null|| parseText == null ||
> >>> parseData
> >> == null) {
> >>>       return;                                     // only have inlinks
> >>>     }
> >>>
> >>> both sets get ignored because each one is incomplete.
> >>
> >> This code snippet is correct, a redirect is pretty much the same as a
> >> link: the crawler follows it. Ok, there are many differences, but the
> >> central point: a link does not get indexed, but only the link target.
> >> And that's the same for redirects. There are always at least 2 URLs:
> >> - the source or redirect
> >> - and the target of the redirection
> >> Only the latter gets indexed after it has been fetched and it is not
> >> a redirect itself.
> >>
> >> The source has no parseText and parseData, and that's why cannot be
> >> indexed.
> >>
> >> If the target does not make it into the index:
> >> - first, check whether it passes URL filters and is not changed by
> >> normalizers
> >> - was it successfully fetched and parsed?
> >> - not excluded by robots=noindex?
> >>
> >> You should check the CrawlDb and the segments for this URL.
> >>
> >> If you could provide a concrete example, I'm happy to have a detailed
> >> look on it.
> >>
> >> Cheers,
> >> Sebastian
> >>
> >>
> >> On 10/28/2015 08:57 AM, Arkadi.Kosmynin@csiro.au wrote:
> >>> Hi,
> >>>
> >>> I am using Nutch 1.9 with NUTCH-2124 patch applied. I've put a
> >>> question
> >> mark in the subject because I work with Nutch modification called
> >> Arch (see http://www.atnf.csiro.au/computing/software/arch/). This is
> >> why I am only 99% sure that the same bug would occur in the original
> Nutch 1.9.
> >>>
> >>> In my experience, Nutch follows redirects OK (after NUTCH-2124
> >>> applied), fetches target content, parses and saves it, but loses on
> >>> the indexing stage. This happens because the db datum is being
> >>> mapped with the original URL as the key, but the fetch and parse
> >>> data and parse text are being mapped with the final URL in
> IndexerMapReduce.
> >>> Therefore, when this condition is checked
> >>>
> >>> if (fetchDatum == null || dbDatum == null|| parseText == null ||
> >>> parseData
> >> == null) {
> >>>       return;                                     // only have inlinks
> >>>     }
> >>>
> >>> both sets get ignored because each one is incomplete.
> >>>
> >>> I am going to fix this for Arch, but can't offer a patch for Nutch,
> >>> sorry. This is
> >> because I am not completely sure that this is a bug in Nutch (see
> >> above) and also because what will work for Arch may not work for
> >> Nutch. They are different in the use of crawl db.
> >>>
> >>> Regards,
> >>>
> >>> Arkadi
> >>>
> >>>
> >>>
> >


Re: Bug: redirected URLs lost on indexing stage?

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Arkadi,

> Example: use http://www.atnf.csiro.au/observers/  as seed and set depth to 1. It will be
> redirected to http://www.atnf.csiro.au/observers/index.html, fetched and parsed successfully and
> then lost. If you set depth to 2, it will get indexed.

Just to be sure we use the same terminology: What does "depth" mean?
1 number of rounds: number of generate-fetch-update cycles when running nutch,
  see command-line help of bin/crawl
2 value of property http.redirect.max
3 value of property scoring.depth.max (used by plugin scoring-depth)

If it's about #1 and if http.redirect.max == 0 (the default):
you need at least two rounds to index a redirected page.
During the first round the redirect is fetched and the
redirect target is recorded. The second round will fetch,
parse and index the redirect target.

If http.redirect.max is set to a value > 0,
the fetcher will follow redirects immediately
in the current round. But there are some drawbacks,
and that's why this isn't the default:
- no deduplication if multiple pages are redirected
  to the same target, e.g., an error page.
  This means you'll spend extra network bandwidth
  to fetch the same content multiple times.
  Nutch will keep only one instance of the page anyway.
- by setting http.redirect.max to a high value you
  may get lost in round-trip redirects
- if http.redirect.max is too low longer redirect
  chains are cut-off. Nutch will not follow these
  redirects.

Cheers,
Sebastian


On 11/03/2015 01:21 AM, Arkadi.Kosmynin@csiro.au wrote:
> Hi Sebastian,
> 
> Thank you for very quick and detailed response. I've checked again and found that redirected URLs get lost if they had been injected in the last iteration. 
> 
> Example: use http://www.atnf.csiro.au/observers/  as seed and set depth to 1. It will be redirected to http://www.atnf.csiro.au/observers/index.html, fetched and parsed successfully and then lost. If you set depth to 2, it will get indexed.
> 
> If you use http://www.atnf.csiro.au/observers/index.html as seed, it will be fetched, parsed and indexed successfully even if you set depth to 1.
> 
>  Regards,
> Arkadi
> 
>> -----Original Message-----
>> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
>> Sent: Thursday, 29 October 2015 7:23 AM
>> To: user@nutch.apache.org
>> Subject: Re: Bug: redirected URLs lost on indexing stage?
>>
>> Hi Arkadi,
>>
>>> In my experience, Nutch follows redirects OK (after NUTCH-2124
>>> applied),
>>
>> Yes, 1.9 is affected by NUTCH-2124 / NUTCH-1939 if http.redirect.max > 0
>>
>>
>>> fetches target content, parses and saves it, but loses on the indexing stage.
>>
>> Can you give a concrete example?
>>
>> While testing NUTCH-2124, I've verified that redirect targets get indexed.
>>
>>
>>> Therefore, when this condition is checked
>>>
>>> if (fetchDatum == null || dbDatum == null|| parseText == null || parseData
>> == null) {
>>>       return;                                     // only have inlinks
>>>     }
>>>
>>> both sets get ignored because each one is incomplete.
>>
>> This code snippet is correct, a redirect is pretty much the same as a link: the
>> crawler follows it. Ok, there are many differences, but the central point: a
>> link does not get indexed, but only the link target. And that's the same for
>> redirects. There are always at least 2 URLs:
>> - the source or redirect
>> - and the target of the redirection
>> Only the latter gets indexed after it has been fetched and it is not a redirect
>> itself.
>>
>> The source has no parseText and parseData, and that's why cannot be
>> indexed.
>>
>> If the target does not make it into the index:
>> - first, check whether it passes URL filters and is not changed by normalizers
>> - was it successfully fetched and parsed?
>> - not excluded by robots=noindex?
>>
>> You should check the CrawlDb and the segments for this URL.
>>
>> If you could provide a concrete example, I'm happy to have a detailed look
>> on it.
>>
>> Cheers,
>> Sebastian
>>
>>
>> On 10/28/2015 08:57 AM, Arkadi.Kosmynin@csiro.au wrote:
>>> Hi,
>>>
>>> I am using Nutch 1.9 with NUTCH-2124 patch applied. I've put a question
>> mark in the subject because I work with Nutch modification called Arch (see
>> http://www.atnf.csiro.au/computing/software/arch/). This is why I am only
>> 99% sure that the same bug would occur in the original Nutch 1.9.
>>>
>>> In my experience, Nutch follows redirects OK (after NUTCH-2124
>>> applied), fetches target content, parses and saves it, but loses on
>>> the indexing stage. This happens because the db datum is being mapped
>>> with the original URL as the key, but the fetch and parse data and
>>> parse text are being mapped with the final URL in IndexerMapReduce.
>>> Therefore, when this condition is checked
>>>
>>> if (fetchDatum == null || dbDatum == null|| parseText == null || parseData
>> == null) {
>>>       return;                                     // only have inlinks
>>>     }
>>>
>>> both sets get ignored because each one is incomplete.
>>>
>>> I am going to fix this for Arch, but can't offer a patch for Nutch, sorry. This is
>> because I am not completely sure that this is a bug in Nutch (see above) and
>> also because what will work for Arch may not work for Nutch. They are
>> different in the use of crawl db.
>>>
>>> Regards,
>>>
>>> Arkadi
>>>
>>>
>>>
> 


RE: Bug: redirected URLs lost on indexing stage?

Posted by Ar...@csiro.au.
Hi Sebastian,

Thank you for very quick and detailed response. I've checked again and found that redirected URLs get lost if they had been injected in the last iteration. 

Example: use http://www.atnf.csiro.au/observers/  as seed and set depth to 1. It will be redirected to http://www.atnf.csiro.au/observers/index.html, fetched and parsed successfully and then lost. If you set depth to 2, it will get indexed.

If you use http://www.atnf.csiro.au/observers/index.html as seed, it will be fetched, parsed and indexed successfully even if you set depth to 1.

 Regards,
Arkadi

> -----Original Message-----
> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
> Sent: Thursday, 29 October 2015 7:23 AM
> To: user@nutch.apache.org
> Subject: Re: Bug: redirected URLs lost on indexing stage?
> 
> Hi Arkadi,
> 
> > In my experience, Nutch follows redirects OK (after NUTCH-2124
> > applied),
> 
> Yes, 1.9 is affected by NUTCH-2124 / NUTCH-1939 if http.redirect.max > 0
> 
> 
> > fetches target content, parses and saves it, but loses on the indexing stage.
> 
> Can you give a concrete example?
> 
> While testing NUTCH-2124, I've verified that redirect targets get indexed.
> 
> 
> > Therefore, when this condition is checked
> >
> > if (fetchDatum == null || dbDatum == null|| parseText == null || parseData
> == null) {
> >       return;                                     // only have inlinks
> >     }
> >
> > both sets get ignored because each one is incomplete.
> 
> This code snippet is correct, a redirect is pretty much the same as a link: the
> crawler follows it. Ok, there are many differences, but the central point: a
> link does not get indexed, but only the link target. And that's the same for
> redirects. There are always at least 2 URLs:
> - the source or redirect
> - and the target of the redirection
> Only the latter gets indexed after it has been fetched and it is not a redirect
> itself.
> 
> The source has no parseText and parseData, and that's why cannot be
> indexed.
> 
> If the target does not make it into the index:
> - first, check whether it passes URL filters and is not changed by normalizers
> - was it successfully fetched and parsed?
> - not excluded by robots=noindex?
> 
> You should check the CrawlDb and the segments for this URL.
> 
> If you could provide a concrete example, I'm happy to have a detailed look
> on it.
> 
> Cheers,
> Sebastian
> 
> 
> On 10/28/2015 08:57 AM, Arkadi.Kosmynin@csiro.au wrote:
> > Hi,
> >
> > I am using Nutch 1.9 with NUTCH-2124 patch applied. I've put a question
> mark in the subject because I work with Nutch modification called Arch (see
> http://www.atnf.csiro.au/computing/software/arch/). This is why I am only
> 99% sure that the same bug would occur in the original Nutch 1.9.
> >
> > In my experience, Nutch follows redirects OK (after NUTCH-2124
> > applied), fetches target content, parses and saves it, but loses on
> > the indexing stage. This happens because the db datum is being mapped
> > with the original URL as the key, but the fetch and parse data and
> > parse text are being mapped with the final URL in IndexerMapReduce.
> > Therefore, when this condition is checked
> >
> > if (fetchDatum == null || dbDatum == null|| parseText == null || parseData
> == null) {
> >       return;                                     // only have inlinks
> >     }
> >
> > both sets get ignored because each one is incomplete.
> >
> > I am going to fix this for Arch, but can't offer a patch for Nutch, sorry. This is
> because I am not completely sure that this is a bug in Nutch (see above) and
> also because what will work for Arch may not work for Nutch. They are
> different in the use of crawl db.
> >
> > Regards,
> >
> > Arkadi
> >
> >
> >


Re: Bug: redirected URLs lost on indexing stage?

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Arkadi,

> In my experience, Nutch follows redirects OK (after NUTCH-2124 applied),

Yes, 1.9 is affected by NUTCH-2124 / NUTCH-1939 if http.redirect.max > 0


> fetches target content, parses and saves it, but loses on the indexing stage.

Can you give a concrete example?

While testing NUTCH-2124, I've verified that redirect targets get indexed.


> Therefore, when this condition is checked
>
> if (fetchDatum == null || dbDatum == null|| parseText == null || parseData == null) {
>       return;                                     // only have inlinks
>     }
>
> both sets get ignored because each one is incomplete.

This code snippet is correct, a redirect is pretty much the
same as a link: the crawler follows it. Ok, there are many
differences, but the central point: a link does not get
indexed, but only the link target. And that's the same
for redirects. There are always at least 2 URLs:
- the source or redirect
- and the target of the redirection
Only the latter gets indexed after it has been fetched
and it is not a redirect itself.

The source has no parseText and parseData, and that's
why cannot be indexed.

If the target does not make it into the index:
- first, check whether it passes URL filters and is not changed by normalizers
- was it successfully fetched and parsed?
- not excluded by robots=noindex?

You should check the CrawlDb and the segments for this URL.

If you could provide a concrete example, I'm happy to have
a detailed look on it.

Cheers,
Sebastian


On 10/28/2015 08:57 AM, Arkadi.Kosmynin@csiro.au wrote:
> Hi,
> 
> I am using Nutch 1.9 with NUTCH-2124 patch applied. I've put a question mark in the subject because I work with Nutch modification called Arch (see http://www.atnf.csiro.au/computing/software/arch/). This is why I am only 99% sure that the same bug would occur in the original Nutch 1.9.
> 
> In my experience, Nutch follows redirects OK (after NUTCH-2124 applied), fetches target content, parses and saves it, but loses on the indexing stage. This happens because the db datum is being mapped with the original URL as the key, but the fetch and parse data and parse text are being mapped with the final URL in IndexerMapReduce. Therefore, when this condition is checked
> 
> if (fetchDatum == null || dbDatum == null|| parseText == null || parseData == null) {
>       return;                                     // only have inlinks
>     }
> 
> both sets get ignored because each one is incomplete.
> 
> I am going to fix this for Arch, but can't offer a patch for Nutch, sorry. This is because I am not completely sure that this is a bug in Nutch (see above) and also because what will work for Arch may not work for Nutch. They are different in the use of crawl db.
> 
> Regards,
> 
> Arkadi
> 
> 
>