You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Andrzej Bialecki <ab...@getopt.org> on 2007/08/15 01:55:10 UTC

Redirects and alias handling (LONG)

Hi all,

I'm going to create a JIRA issue out of this discussion, but I think 
it's more convenient to first exchange our initial ideas here ...

Redirect handling is a difficult subject for all search engines, but the 
way it's currently done in Nutch could use some improvement. The same 
goes for handling aliases, i.e. the same sites that are accessible via a 
slightly different non-canonical URLs (i.e. they are not mirrors but the 
same sites), which cannot be easily handled by url normalizers.

A. Problem description
======================

1. "Aliases" problem
---------------------------------------
This is a case where the same content is available from the same site 
under several equivalent URLs. Example:

    http://example.com/
    http://example.org/
    http://example.net/
    http://example.com/index.html
    http://www.example.com/
    http://www.example.com/index.html

These URLs yield the same page (there are no redirects involved here). 
For a human user it's obvious that they should be treated as one page. 
Another example would be sites that use farms of servers with 
round-robin DNS (e.g. IBM), so that there may be dozens or hundreds 
different URLs like www-120.ibm.com/software/..., 
www-306.ibm.com/software/..., etc, to which users are redirected from 
http://www.ibm.com/, and which contain exactly the same content.

Currently Nutch addresses this issue only at the deduplication stage, 
selecting the shortest URL (which may or may not be the right choice), 
i.e. in the end we get http://example.com/ as the only remaining URL in 
the searchable index. IMHO users would expect that 
http://www.example.com/ would be the remaining one ... ? Also, we get 4 
different URLs with 4 different statuses (e.g. fetch times) in CrawlDb, 
which is not good.

Unfortunately, we cannot blindly assume that www.example.com and 
example.com are equivalent aliases - a landmark example of this is 
http://internic.net/ versus http://www.internic.net/, which give two 
different pages.

Probably this dilemma can be resolved by doing a graph analysis of a 
close webgraph neighbourhood of duplicate pages. Google improves its 
results with manual intervention of site owners:

http://www.google.com/support/webmasters/bin/answer.py?answer=44232

This addresses only the www.example.com versus example.com issue, 
apparently the issue of / vs. /index.html vs. /index.htm vs. 
/default.asp is handled through some other means.

Finally, a few interesting queries to run that show how Google treats 
such aliases:

http://www.google.com/search?q=site:example.com&hl=en&filter=0
http://www.google.com/search?q=site:www.example.com&hl=en&filter=0

http://www.google.com/search?hl=en&lr=&as_qdr=all&q=link%3Ahttp%3A%2F%2Fwww.example.org
(especially interesting is the result under this URL: 
http://lists.w3.org/Archives/Public/w3c-dist-auth/1999OctDec/0180.html)
http://www.google.com/search?hl=en&lr=&as_qdr=all&q=link%3Ahttp%3A%2F%2Fwww.example.org%2Findex.html
http://www.google.com/search?hl=en&lr=&as_qdr=all&q=link%3Ahttp%3A%2F%2Fexample.org%2Findex.html
http://www.google.com/search?hl=en&lr=&as_qdr=all&q=link%3Ahttp%3A%2F%2Fexample.org

2. Redirected pages
-------------------

First, the standard defines pretty clearly the meaning of 301 versus 302 
redirects, on the HTTP protocol level:

http://www.ietf.org/rfc/rfc2616.txt

Which is reflected in the common practice of major search engines:

http://www.google.com/support/webmasters/bin/answer.py?answer=40151
http://help.yahoo.com/l/nz/yahooxtra/search/webcrawler/slurp-11.html

Javascript redirects are treated differently, e.g. Yahoo treats them 
differently depending on the time between redirects. One scenario not 
described there is how to treat cross-protocol redirects to the same 
domain or the same url path. Example: http://www.example.com/secure -> 
https://www.example.com/

Recent versions of Nutch introduced specific status codes for pages 
redirected permanently or temporarily, and target URL-s are stored in 
CrawlDatum.metadata. However, this information is not used anywhere at 
the moment.

I propose to make necessary modifications to follow the algorithm 
described on Yahoo-s pages referenced above. Note that when that page 
says "Yahoo indexes the 'source' URL" it really means that it processes 
the content of the target page, but puts it under the source URL.

And this brings another interesting topic ...

3. Link and anchor information for aliases and redirects.
---------------------------------------------------------
This issue has been briefly discussed in NUTCH-353. Inlink information 
should be "merged" so that all link information from all "aliases" is 
aggregated, so that it points to a selected canonical target URL.

See also above sample queries from Google.


B. Design and implementation
============================

In order to select the correct "canonical" URL at each stage in 
redirection handling we should keep the accumulated "redirection path", 
which includes source URLs and redirection methods (temporary/permanent, 
protocol or content-level redirect, redirect delay). This way, when we 
arrive a the final page in the redirection path, we should be able to 
select the canonical path.

We should also specify which intermediate URL we accept as the current 
"canonical" URL in case we haven't yet reached the end of redirections 
(e.g. when we don't follow redirects immediately, but only record them 
to be used in the next cycle).

We should introduce an "alias" status in CrawlDb and LinkDb, which 
indicates that a given URL is a non-canonical alias of another URL. In 
CrawlDb, we should copy all accumulated metadata and put it into the 
target canonical CrawlDatum. In LinkDb, we should merge all inlinks 
pointing to non-canonical URLs so that they are assigned to the 
canonical URL. In both cases we should still keep the non-canonical URLs 
in CrawlDb and LinkDb - however we could decide not to keep any of the 
metadata / inlinks there, just an "alias" flag and a pointer to the 
canonical URL where all aggregated data is stored. CrawlDb and 
LinkDbReader may or may not hide this fact from their users - I think it 
would be more efficient if users of this API would get the final 
aggregated data right away, perhaps with an indicator that it was 
obtained using a non-canonical URL ...

Regarding Lucene indexes - we could either duplicate all data for each 
non-canonical URL, i.e. create as many full-blown Lucene documents as 
many there are aliases, or we could create special "redirect" documents 
that would point to a URL which contains the full data ...


That's it for now ... Any comments or suggestions to the above are welcome!

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Redirects and alias handling (LONG)

Posted by Ken Krugler <kk...@transpac.com>.

>Ken Krugler wrote:
>
>>>>common case. Thus it could be somewhat computationally expensive 
>>>>(e.g. a winnowing ala 
>>>>http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf).
>>>
>>>Interesting paper, thanks for the pointer - I always wondered what 
>>>criteria to use to reduce the number of shingles, and this 
>>>winnowing is a simple enough recipe for creating page signatures. 
>>>I may be tempted to implement it ;)
>>
>>I took a quick scan through the public code and didn't find 
>>anything that looked appropriate for this. One more potentially 
>>useful paper is here:
>>
>>http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf
>
>This URL looks similar to the one you mentioned before ... probably 
>a case of near-duplicate *chuckle* ...

Sorry about that - I can't really claim I was checking your manual 
dedup support. The real URL is:

http://www1.cs.columbia.edu/~cs6998/final_reports/ca2269-report.pdf

-- Ken
-- 
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"

Re: Redirects and alias handling (LONG)

Posted by Andrzej Bialecki <ab...@getopt.org>.

Ken Krugler wrote:

>>> common case. Thus it could be somewhat computationally expensive 
>>> (e.g. a winnowing ala 
>>> http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf).
>>
>> Interesting paper, thanks for the pointer - I always wondered what 
>> criteria to use to reduce the number of shingles, and this winnowing 
>> is a simple enough recipe for creating page signatures. I may be 
>> tempted to implement it ;)
> 
> I took a quick scan through the public code and didn't find anything 
> that looked appropriate for this. One more potentially useful paper is 
> here:
> 
> http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf

This URL looks similar to the one you mentioned before ... probably a 
case of near-duplicate *chuckle* ...


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Redirects and alias handling (LONG)

Posted by Ken Krugler <kk...@transpac.com>.

Hi Andrzej,

>>And even with deduping, we run into problems, especially for top-level pages.
>>
>>These often change slightly between crawls, so if 
>>http://example.com is found during one pass, and a different 
>>http://www.example.com is found at a later crawl, you wind up with 
>>two hits for a result. What's worse is that typically the summary 
>>is exactly the same (from the body of the page), so to a user it's 
>>painfully obvious that there are (near) duplicates in the index.
>>
>>To solve this, I think a near duplicate detector would need to be 
>>used when collapsing similar URLs. If you did this only when two 
>>URLs appear to be the same, I think it would be OK, as that's the 
>>most common case. Thus it could be somewhat computationally 
>>expensive (e.g. a winnowing ala 
>>http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf).
>
>Interesting paper, thanks for the pointer - I always wondered what 
>criteria to use to reduce the number of shingles, and this winnowing 
>is a simple enough recipe for creating page signatures. I may be 
>tempted to implement it ;)

I took a quick scan through the public code and didn't find anything 
that looked appropriate for this. One more potentially useful paper 
is here:

http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf

>There is a Signature implementation in Nutch that allows for small 
>differences in text (TextProfileSignature), but I guess it's not 
>sufficient in your case?

I thought we were using that, but I just double-checked and we're 
not. So I'll try to switch over to that for the next crawl/index, to 
see how well it works.

Thanks,

-- Ken
-- 
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"

Re: Redirects and alias handling (LONG)

Posted by Andrzej Bialecki <ab...@getopt.org>.

Ken Krugler wrote:

[..]

> And even with deduping, we run into problems, especially for top-level 
> pages.
> 
> These often change slightly between crawls, so if http://example.com is 
> found during one pass, and a different http://www.example.com is found 
> at a later crawl, you wind up with two hits for a result. What's worse 
> is that typically the summary is exactly the same (from the body of the 
> page), so to a user it's painfully obvious that there are (near) 
> duplicates in the index.
> 
> To solve this, I think a near duplicate detector would need to be used 
> when collapsing similar URLs. If you did this only when two URLs appear 
> to be the same, I think it would be OK, as that's the most common case. 
> Thus it could be somewhat computationally expensive (e.g. a winnowing 
> ala http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf).

Interesting paper, thanks for the pointer - I always wondered what 
criteria to use to reduce the number of shingles, and this winnowing is 
a simple enough recipe for creating page signatures. I may be tempted to 
implement it ;)

There is a Signature implementation in Nutch that allows for small 
differences in text (TextProfileSignature), but I guess it's not 
sufficient in your case?


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Redirects and alias handling (LONG)

Posted by Ken Krugler <kk...@transpac.com>.

Hi Andrzej,

Thanks for writing this up!

One small comment below...

>I'm going to create a JIRA issue out of this discussion, but I think 
>it's more convenient to first exchange our initial ideas here ...

[snip]

>1. "Aliases" problem
>---------------------------------------
>This is a case where the same content is available from the same 
>site under several equivalent URLs. Example:
>
>    http://example.com/
>    http://example.org/
>    http://example.net/
>    http://example.com/index.html
>    http://www.example.com/
>    http://www.example.com/index.html
>
>These URLs yield the same page (there are no redirects involved 
>here). For a human user it's obvious that they should be treated as 
>one page. Another example would be sites that use farms of servers 
>with round-robin DNS (e.g. IBM), so that there may be dozens or 
>hundreds different URLs like www-120.ibm.com/software/..., 
>www-306.ibm.com/software/..., etc, to which users are redirected 
>from http://www.ibm.com/, and which contain exactly the same content.
>
>Currently Nutch addresses this issue only at the deduplication 
>stage, selecting the shortest URL (which may or may not be the right 
>choice), i.e. in the end we get http://example.com/ as the only 
>remaining URL in the searchable index. IMHO users would expect that 
>http://www.example.com/ would be the remaining one ... ? Also, we 
>get 4 different URLs with 4 different statuses (e.g. fetch times) in 
>CrawlDb, which is not good.

And even with deduping, we run into problems, especially for top-level pages.

These often change slightly between crawls, so if http://example.com 
is found during one pass, and a different http://www.example.com is 
found at a later crawl, you wind up with two hits for a result. 
What's worse is that typically the summary is exactly the same (from 
the body of the page), so to a user it's painfully obvious that there 
are (near) duplicates in the index.

To solve this, I think a near duplicate detector would need to be 
used when collapsing similar URLs. If you did this only when two URLs 
appear to be the same, I think it would be OK, as that's the most 
common case. Thus it could be somewhat computationally expensive 
(e.g. a winnowing ala 
http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf).

-- Ken
-- 
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"

Re: Redirects and alias handling (LONG)

Posted by Doğacan Güney <do...@gmail.com>.

On 8/21/07, Andrzej Bialecki <ab...@getopt.org> wrote:
> Doğacan Güney wrote:
>
> >> Hmm. The index should somehow contain _all_ urls, which point to the
> >> same document. I.e. when you search for url "http://example.com" it
> >> should ideally return exactly the same Lucene document as when you
> >> search for "http://www.example.com/index.html".
> >
> > Why would you do a search with the full name of the url? I also don't
> > understand why we need to have all urls in index (we already eliminate
> > near-duplicates with dedup).  I guess I am missing your use case
> > here...
>
> Let's say I'm searching for "test" and I want to limit the search to a
> particular url. I enter a query:
>
>         test url:example.com
>
> It should yield the same results as for the following query:
>
>         test url:www.example.com
>
> (assuming they are "aliases").

I guess we can do something like this (continuing from my example
above): Index D's data under B then add a alias field to the lucene
document with A, C and D in it. Then change query-url so that a "url:"
query also searches the alias field.

>
> Another, more realistic example: I'm searching for IBM products. So I
> enter a query:
>
>         products site:ibm.com
>
> This should yield the same results as any of the following:
>
>         products site:www.ibm.com
>         products site:www-128.ibm.com
>         products site:www-304.ibm.com
>

Thanks for the explanation.

How do we know that www.ibm.com and www-128.ibm.com hosts are perfect
mirrors of one another? All we can know is that http://www.ibm.com/
and http://www-128.ibm.com/ *urls* are aliases of one another and that
for the urls that we have fetched *so far* they seem to mirror each
other. It is possible that the next URL we fetch from one of those
sites does not exist in the other. I don't think that we can ever be
certain that they are perfect mirrors of each other so, IMHO, we
shouldn't treat those queries as same. Google also doesn't return the
same results for "products site:www.ibm.com" "products
site:www-128.ibm.com" .

(One small unrelated note: As discussed in NUTCH-439 and NUTCH-445, we
should treat site:ibm.com as all hosts under domain ibm.com even if
http://www.ibm.com/ and http://ibm.com/ are perfect mirrors of each
other.)

> --
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>

-- 
Doğacan Güney

Re: Redirects and alias handling (LONG)

Posted by Andrzej Bialecki <ab...@getopt.org>.

Doğacan Güney wrote:

>> Hmm. The index should somehow contain _all_ urls, which point to the
>> same document. I.e. when you search for url "http://example.com" it
>> should ideally return exactly the same Lucene document as when you
>> search for "http://www.example.com/index.html".
> 
> Why would you do a search with the full name of the url? I also don't
> understand why we need to have all urls in index (we already eliminate
> near-duplicates with dedup).  I guess I am missing your use case
> here...

Let's say I'm searching for "test" and I want to limit the search to a 
particular url. I enter a query:

	test url:example.com

It should yield the same results as for the following query:

	test url:www.example.com

(assuming they are "aliases").

Another, more realistic example: I'm searching for IBM products. So I 
enter a query:

	products site:ibm.com

This should yield the same results as any of the following:

	products site:www.ibm.com
	products site:www-128.ibm.com
	products site:www-304.ibm.com

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Redirects and alias handling (LONG)

Posted by Doğacan Güney <do...@gmail.com>.

On 8/21/07, Andrzej Bialecki <ab...@getopt.org> wrote:
> Doğacan Güney wrote:
>
> > If the same content is available under multiple urls, I think it makes
> > sense to assume that the url with the highest score should be 'the
> > representative'  url.
>
> Not necessarily - it depends how you defined your score.
> http://www.ibm.com/ may actually have a low score, because it
> immediately redirects to http://www.ibm.com/index.html (actually, it
> redirects to http://www.ibm.com/us/index.html).
>
> Also, "the shortest url wins" rule is not always true. Let's say I own a
> domain a.biz, and I made a Wikipedia mirror there. Which of the pages is
> more representative: http://a.biz/About_Wikipedia or
> http://www.wikipedia.org/en/About_Wikipedia ?
>
>
> >> 3. Link and anchor information for aliases and redirects.
> >> ---------------------------------------------------------
> >> This issue has been briefly discussed in NUTCH-353. Inlink information
> >> should be "merged" so that all link information from all "aliases" is
> >> aggregated, so that it points to a selected canonical target URL.
> >
> > We should also merge their score. If example.com (with score 4.0) is
> > an alias for www.example.com (with score 8.0), the selected url (which
> > I think, as I said before, should be www.example.com)  should end up
> > with the score 12.0. We may not want to do this for aliases in
> > different domains but I think we should definitely do this if two urls
> > with the same content are under the same domain (like example.com).
>
> I think you are right - at least with the OPIC scoring it would work ok.
>
>
> >>
> >> Regarding Lucene indexes - we could either duplicate all data for each
> >> non-canonical URL, i.e. create as many full-blown Lucene documents as
> >> many there are aliases, or we could create special "redirect" documents
> >> that would point to a URL which contains the full data ...
> >
> > We can avoid doing both. Let's assume A redirects to B, C also
> > redirects to B and B redirects to D. After the fetch/parse/updatedb
> > cycle that processes D we would probably have enough data to choose
> > the 'canonical url' (let's assume that canonical is B). Then during
> > Indexer's reduce we can just index parse text and parse data (and
> > whatever else) of D under url B since we won't index B (or A or C) as
> > itself (it doesn't have any useful content after all).
>
> Hmm. The index should somehow contain _all_ urls, which point to the
> same document. I.e. when you search for url "http://example.com" it
> should ideally return exactly the same Lucene document as when you
> search for "http://www.example.com/index.html".

Why would you do a search with the full name of the url? I also don't
understand why we need to have all urls in index (we already eliminate
near-duplicates with dedup).  I guess I am missing your use case
here...

>
> Similarly, the inlink information for all "aliased" urls should be the
> same (but in our case it's not a Lucene issue, only the LinkDb aliasing
> issue).

I agree with you here.

>
>
> >
> >>
> >> That's it for now ... Any comments or suggestions to the above are welcome!
> >
> > Andrzej, have you written any code? I would suggest that we open a
> > JIRA and have some code (no matter how much half-baked it is) as soon
> > as we can.
>
> Not yet - I'll open the issue and put these initial thoughts there.
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>


-- 
Doğacan Güney

Re: Redirects and alias handling (LONG)

Posted by Andrzej Bialecki <ab...@getopt.org>.

Doğacan Güney wrote:

> If the same content is available under multiple urls, I think it makes
> sense to assume that the url with the highest score should be 'the
> representative'  url.

Not necessarily - it depends how you defined your score. 
http://www.ibm.com/ may actually have a low score, because it 
immediately redirects to http://www.ibm.com/index.html (actually, it 
redirects to http://www.ibm.com/us/index.html).

Also, "the shortest url wins" rule is not always true. Let's say I own a 
domain a.biz, and I made a Wikipedia mirror there. Which of the pages is 
more representative: http://a.biz/About_Wikipedia or 
http://www.wikipedia.org/en/About_Wikipedia ?

>> 3. Link and anchor information for aliases and redirects.
>> ---------------------------------------------------------
>> This issue has been briefly discussed in NUTCH-353. Inlink information
>> should be "merged" so that all link information from all "aliases" is
>> aggregated, so that it points to a selected canonical target URL.
> 
> We should also merge their score. If example.com (with score 4.0) is
> an alias for www.example.com (with score 8.0), the selected url (which
> I think, as I said before, should be www.example.com)  should end up
> with the score 12.0. We may not want to do this for aliases in
> different domains but I think we should definitely do this if two urls
> with the same content are under the same domain (like example.com).

I think you are right - at least with the OPIC scoring it would work ok.

>>
>> Regarding Lucene indexes - we could either duplicate all data for each
>> non-canonical URL, i.e. create as many full-blown Lucene documents as
>> many there are aliases, or we could create special "redirect" documents
>> that would point to a URL which contains the full data ...
> 
> We can avoid doing both. Let's assume A redirects to B, C also
> redirects to B and B redirects to D. After the fetch/parse/updatedb
> cycle that processes D we would probably have enough data to choose
> the 'canonical url' (let's assume that canonical is B). Then during
> Indexer's reduce we can just index parse text and parse data (and
> whatever else) of D under url B since we won't index B (or A or C) as
> itself (it doesn't have any useful content after all).

Hmm. The index should somehow contain _all_ urls, which point to the 
same document. I.e. when you search for url "http://example.com" it 
should ideally return exactly the same Lucene document as when you 
search for "http://www.example.com/index.html".

Similarly, the inlink information for all "aliased" urls should be the 
same (but in our case it's not a Lucene issue, only the LinkDb aliasing 
issue).

> 
>>
>> That's it for now ... Any comments or suggestions to the above are welcome!
> 
> Andrzej, have you written any code? I would suggest that we open a
> JIRA and have some code (no matter how much half-baked it is) as soon
> as we can.

Not yet - I'll open the issue and put these initial thoughts there.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Redirects and alias handling (LONG)

Posted by Doğacan Güney <do...@gmail.com>.

Hi,

Thanks for bringing this up, Andrzej. There are some excellent
pointers/remarks here.

On 8/15/07, Andrzej Bialecki <ab...@getopt.org> wrote:
> Hi all,
>
> I'm going to create a JIRA issue out of this discussion, but I think
> it's more convenient to first exchange our initial ideas here ...
>
> Redirect handling is a difficult subject for all search engines, but the
> way it's currently done in Nutch could use some improvement. The same
> goes for handling aliases, i.e. the same sites that are accessible via a
> slightly different non-canonical URLs (i.e. they are not mirrors but the
> same sites), which cannot be easily handled by url normalizers.
>
> A. Problem description
> ======================
>
> 1. "Aliases" problem
> ---------------------------------------
> This is a case where the same content is available from the same site
> under several equivalent URLs. Example:
>
>     http://example.com/
>     http://example.org/
>     http://example.net/
>     http://example.com/index.html
>     http://www.example.com/
>     http://www.example.com/index.html
>
> These URLs yield the same page (there are no redirects involved here).
> For a human user it's obvious that they should be treated as one page.
> Another example would be sites that use farms of servers with
> round-robin DNS (e.g. IBM), so that there may be dozens or hundreds
> different URLs like www-120.ibm.com/software/...,
> www-306.ibm.com/software/..., etc, to which users are redirected from
> http://www.ibm.com/, and which contain exactly the same content.
>
> Currently Nutch addresses this issue only at the deduplication stage,
> selecting the shortest URL (which may or may not be the right choice),

A small addition: Nutch can also select the page with the highest score.

> i.e. in the end we get http://example.com/ as the only remaining URL in
> the searchable index. IMHO users would expect that
> http://www.example.com/ would be the remaining one ... ? Also, we get 4
> different URLs with 4 different statuses (e.g. fetch times) in CrawlDb,
> which is not good.
>
> Unfortunately, we cannot blindly assume that www.example.com and
> example.com are equivalent aliases - a landmark example of this is
> http://internic.net/ versus http://www.internic.net/, which give two
> different pages.
>
> Probably this dilemma can be resolved by doing a graph analysis of a
> close webgraph neighbourhood of duplicate pages. Google improves its
> results with manual intervention of site owners:
>
> http://www.google.com/support/webmasters/bin/answer.py?answer=44232
>
> This addresses only the www.example.com versus example.com issue,
> apparently the issue of / vs. /index.html vs. /index.htm vs.
> /default.asp is handled through some other means.
>
> Finally, a few interesting queries to run that show how Google treats
> such aliases:
>
> http://www.google.com/search?q=site:example.com&hl=en&filter=0
> http://www.google.com/search?q=site:www.example.com&hl=en&filter=0
>
> http://www.google.com/search?hl=en&lr=&as_qdr=all&q=link%3Ahttp%3A%2F%2Fwww.example.org
> (especially interesting is the result under this URL:
> http://lists.w3.org/Archives/Public/w3c-dist-auth/1999OctDec/0180.html)
> http://www.google.com/search?hl=en&lr=&as_qdr=all&q=link%3Ahttp%3A%2F%2Fwww.example.org%2Findex.html
> http://www.google.com/search?hl=en&lr=&as_qdr=all&q=link%3Ahttp%3A%2F%2Fexample.org%2Findex.html
> http://www.google.com/search?hl=en&lr=&as_qdr=all&q=link%3Ahttp%3A%2F%2Fexample.org
>

If the same content is available under multiple urls, I think it makes
sense to assume that the url with the highest score should be 'the
representative'  url.

> 2. Redirected pages
> -------------------
>
> First, the standard defines pretty clearly the meaning of 301 versus 302
> redirects, on the HTTP protocol level:
>
> http://www.ietf.org/rfc/rfc2616.txt
>
> Which is reflected in the common practice of major search engines:
>
> http://www.google.com/support/webmasters/bin/answer.py?answer=40151
> http://help.yahoo.com/l/nz/yahooxtra/search/webcrawler/slurp-11.html
>
> Javascript redirects are treated differently, e.g. Yahoo treats them
> differently depending on the time between redirects. One scenario not
> described there is how to treat cross-protocol redirects to the same
> domain or the same url path. Example: http://www.example.com/secure ->
> https://www.example.com/
>
> Recent versions of Nutch introduced specific status codes for pages
> redirected permanently or temporarily, and target URL-s are stored in
> CrawlDatum.metadata. However, this information is not used anywhere at
> the moment.
>
> I propose to make necessary modifications to follow the algorithm
> described on Yahoo-s pages referenced above. Note that when that page
> says "Yahoo indexes the 'source' URL" it really means that it processes
> the content of the target page, but puts it under the source URL.

+1. Yahoo's algorithm looks very solid.
>
> And this brings another interesting topic ...
>
> 3. Link and anchor information for aliases and redirects.
> ---------------------------------------------------------
> This issue has been briefly discussed in NUTCH-353. Inlink information
> should be "merged" so that all link information from all "aliases" is
> aggregated, so that it points to a selected canonical target URL.

We should also merge their score. If example.com (with score 4.0) is
an alias for www.example.com (with score 8.0), the selected url (which
I think, as I said before, should be www.example.com)  should end up
with the score 12.0. We may not want to do this for aliases in
different domains but I think we should definitely do this if two urls
with the same content are under the same domain (like example.com).

>
> See also above sample queries from Google.
>
>
> B. Design and implementation
> ============================
>
> In order to select the correct "canonical" URL at each stage in
> redirection handling we should keep the accumulated "redirection path",
> which includes source URLs and redirection methods (temporary/permanent,
> protocol or content-level redirect, redirect delay). This way, when we
> arrive a the final page in the redirection path, we should be able to
> select the canonical path.
>
> We should also specify which intermediate URL we accept as the current
> "canonical" URL in case we haven't yet reached the end of redirections
> (e.g. when we don't follow redirects immediately, but only record them
> to be used in the next cycle).
>
> We should introduce an "alias" status in CrawlDb and LinkDb, which
> indicates that a given URL is a non-canonical alias of another URL. In
> CrawlDb, we should copy all accumulated metadata and put it into the
> target canonical CrawlDatum. In LinkDb, we should merge all inlinks
> pointing to non-canonical URLs so that they are assigned to the
> canonical URL. In both cases we should still keep the non-canonical URLs
> in CrawlDb and LinkDb - however we could decide not to keep any of the
> metadata / inlinks there, just an "alias" flag and a pointer to the
> canonical URL where all aggregated data is stored. CrawlDb and
> LinkDbReader may or may not hide this fact from their users - I think it
> would be more efficient if users of this API would get the final
> aggregated data right away, perhaps with an indicator that it was
> obtained using a non-canonical URL ...
>
> Regarding Lucene indexes - we could either duplicate all data for each
> non-canonical URL, i.e. create as many full-blown Lucene documents as
> many there are aliases, or we could create special "redirect" documents
> that would point to a URL which contains the full data ...

We can avoid doing both. Let's assume A redirects to B, C also
redirects to B and B redirects to D. After the fetch/parse/updatedb
cycle that processes D we would probably have enough data to choose
the 'canonical url' (let's assume that canonical is B). Then during
Indexer's reduce we can just index parse text and parse data (and
whatever else) of D under url B since we won't index B (or A or C) as
itself (it doesn't have any useful content after all).

>
>
> That's it for now ... Any comments or suggestions to the above are welcome!

Andrzej, have you written any code? I would suggest that we open a
JIRA and have some code (no matter how much half-baked it is) as soon
as we can.

>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>


-- 
Doğacan Güney