You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2010/06/25 13:36:44 UTC

[Nutch Wiki] Trivial Update of "DomainStatistics" by AlexMc

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "DomainStatistics" page has been changed by AlexMc.
http://wiki.apache.org/nutch/DomainStatistics?action=diff&rev1=1&rev2=2

--------------------------------------------------

  There is a tool which may tell you some more about which domains have been fetched. You can try it through something like this
  
  
+ == usage ==
  
-     $ bin/nutch org.apache.nutch.util.domain.DomainStatistics
-     hdfs://nn:9000/user/otis/crawl/crawldb/current
-     hdfs://nn:9000/user/otis/ds-host host 8
+ {{{
+ $ bin/nutch org.apache.nutch.util.domain.DomainStatistics inputDirs outDir host|domain|suffix [numOfReducer]
+ }}}
  
+ == example ==
+ 
+ {{{
+ $ bin/nutch org.apache.nutch.util.domain.DomainStatistics hdfs://nn:9000/user/otis/crawl/crawldb/current hdfs://nn:9000/user/otis/ds-host host 8
+ }}}
  
  You can then -cat ds-host file from DFS and pipe it to sort -nrk1 for sorting by count, higher count first.