You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Julien Nioche (JIRA)" <ji...@apache.org> on 2013/02/16 18:37:13 UTC
[jira] [Updated] (NUTCH-1047) Pluggable indexing backends

     [ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche updated NUTCH-1047:
---------------------------------

    Attachment: NUTCH-1047-1.x-v5.patch

Fixed bug with checking of arguments length for index command.
Fixed issue with solr param not passed on when using the all-in-one crawl command
Added describe() method to IndexWriter which is called by the IndexingJob and dumps in the log a list of all the active indexingwriters as well as the parameters that they take.

All the issues mentioned previously should now have been fixed. Basically the crawl and the solrindex command should work in exactly the same way as before, so no change from a user point of view but we also get the possiblity to plug new backends.

Please give it a try, would be nice to commit that soon.


                
> Pluggable indexing backends
> ---------------------------
>
>                 Key: NUTCH-1047
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1047
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>              Labels: indexing
>             Fix For: 1.7
>
>         Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch, NUTCH-1047-1.x-v5.patch
>
>
> One possible feature would be to add a new endpoint for indexing-backends and make the indexing plugable. at the moment we are hardwired to SOLR - which is OK - but as other resources like ElasticSearch are becoming more popular it would be better to handle this as plugins. Not sure about the name of the endpoint though : we already have indexing-plugins (which are about generating fields sent to the backends) and moreover the backends are not necessarily for indexing / searching but could be just an external storage e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this could be pertaining to the storage in GORA. 'indexing-backend' is the best name that came to my mind so far - please suggest better ones.
> We should come up with generic map/reduce jobs for indexing, deduplicating and cleaning and maybe add a Nutch extension point there so we can easily hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira