You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Nutch User - 1 <nu...@gmail.com> on 2011/09/18 13:29:45 UTC

Effects of redirections on LinkDB and LinkRank

Hi.

I created a simple graph and crawled it with Nutch.

The graph consists of three HTML files:

A.html
B/index.html
C.html

The link structure is as the following:

A.html -> B

Apache's HTTP server redirects B to B/index.html.

B/index.html -> C.html

The LinkDB dump:

"
http://localhost:8080/B    Inlinks:
  fromUrl: http://localhost:8080/A.html anchor: B

http://localhost:8080/C.html    Inlinks:
  fromUrl: http://localhost:8080/B/ anchor: C
"

Clearly the crawled graph isn't connected.

The WebGraphDB dump after LinkRank was run:

"
http://localhost:8080/B    0.670625
http://localhost:8080/C.html    0.670625
http://localhost:8080/A.html    0.36249998
http://localhost:8080/B/    0.36249998
"

In my opinion the order doesn't make sense, because of the link 
structure: A -> B -> B/index.html -> C. So C should have the greatest 
amount of LinkRank. The real order should be:

C.html
B/index.html
B
A.html.

If you consider this as a bug, I don't have time to fix it. I just 
decided to report it.

Best Regards,
Nutch User - 1 (nutch.user.1@gmail.com)