You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Nils Hoeller <ni...@arcor.de> on 2005/09/03 14:20:55 UTC
Finding the Top Ten Topics in the Site Index
Hi,
I got my first ( very buggy and yet not so helpful, because the graph is
not filtered enough )
Version of my Sitemap Visualization Tool working
( Can be visited under
http://server01.pool.ifis.uni-luebeck.de:8080/cscrawler/login.html
BUT PLEASE USE ONLY DEPTH 1 OR 2 (for small Sites 3)
)
The Graph Visualisation is also buggy, you might
click on a light blue node first to have the graph
rebuilded correctly.
Anyway:
My Question:
I d like to give the User a Top Ten of Words/Topics
out of the Index of a crawled Site.
So I ll have to get a list of words
(filtered by a stopwordlist) out of the index.
Can you tell me the easiest way to get a List
of words out of the index, together with the count
of how often the word is found in the index.
When you have a look at luke, you ll see that
feature, but not filtered.
Thanks for your help.
Nils
Ps.:
I d like to insert a picture
like "powered by nutch"
Is that ok? or who do I have to ask.
(everything is just research, not commercial)