You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2010/06/30 16:26:29 UTC

[Nutch Wiki] Update of "Nutch2Roadmap" by JulienNioche

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "Nutch2Roadmap" page has been changed by JulienNioche.
http://wiki.apache.org/nutch/Nutch2Roadmap?action=diff&rev1=2&rev2=3

--------------------------------------------------

      * robots handling, url filtering and url normalization, URL state management, perhaps deduplication. We should coordinate our efforts, and share code freely so that other projects (bixo, heritrix,droids) may contribute to this shared pool of functionality, much like Tika does for the common need of parsing complex formats.
    * Remove index / search and delegate to SOLR
      * we may still keep a thin abstract layer to allow other indexing/search backends (ElasticSearch?), but the current mess of indexing/query filters and competing indexing frameworks (lucene, fields, solr) should go away. We should go directly from DOM to a NutchDocument, and stop there.
+   * Rewrite SOLR deduplication : do everything using the webtable and avoid retrieving content from SOLR 
    * Various new functionalities 
      * e.g. sitemap support, canonical tag, better handling of redirects, detecting duplicated sites, detection of spam cliques, tools to manage the webgraph, etc.