You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Nils Hoeller <ni...@arcor.de> on 2005/09/03 14:20:55 UTC

Finding the Top Ten Topics in the Site Index

Hi, 

I got my first ( very buggy and yet not so helpful, because the graph is
not filtered enough  ) 
Version of my Sitemap Visualization Tool working

( Can be visited under
http://server01.pool.ifis.uni-luebeck.de:8080/cscrawler/login.html
BUT PLEASE USE ONLY DEPTH 1 OR 2 (for small Sites 3) 
)

The Graph Visualisation is also buggy, you might
click on a light blue node first to have the graph
rebuilded correctly.

Anyway: 

My Question:

I d like to give the User a Top Ten of Words/Topics
out of the Index of a crawled Site.

So I ll have to get a list of words
(filtered by a stopwordlist) out of the index.

Can you tell me the easiest way to get a List
of words out of the index, together with the count
of how often the word is found in the index.

When you have a look at luke, you ll see that 
feature, but not filtered.

Thanks for your help.

Nils

Ps.:

I d like to insert a picture 
like "powered by nutch" 

Is that ok? or who do I have to ask.
(everything is just research, not commercial)