You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Nutch User - 1 <nu...@gmail.com> on 2011/09/18 13:29:45 UTC
Effects of redirections on LinkDB and LinkRank
Hi.
I created a simple graph and crawled it with Nutch.
The graph consists of three HTML files:
A.html
B/index.html
C.html
The link structure is as the following:
A.html -> B
Apache's HTTP server redirects B to B/index.html.
B/index.html -> C.html
The LinkDB dump:
"
http://localhost:8080/B Inlinks:
fromUrl: http://localhost:8080/A.html anchor: B
http://localhost:8080/C.html Inlinks:
fromUrl: http://localhost:8080/B/ anchor: C
"
Clearly the crawled graph isn't connected.
The WebGraphDB dump after LinkRank was run:
"
http://localhost:8080/B 0.670625
http://localhost:8080/C.html 0.670625
http://localhost:8080/A.html 0.36249998
http://localhost:8080/B/ 0.36249998
"
In my opinion the order doesn't make sense, because of the link
structure: A -> B -> B/index.html -> C. So C should have the greatest
amount of LinkRank. The real order should be:
C.html
B/index.html
B
A.html.
If you consider this as a bug, I don't have time to fix it. I just
decided to report it.
Best Regards,
Nutch User - 1 (nutch.user.1@gmail.com)