You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "kaveh minooie (JIRA)" <ji...@apache.org> on 2014/09/12 02:49:34 UTC

[jira] [Updated] (NUTCH-1480) SolrIndexer to write to multiple servers.

     [ https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

kaveh minooie updated NUTCH-1480:
---------------------------------
    Attachment: adding-support-for-sharding-indexer-for-solr.patch

I just found this issue today when I was checking to see if what I am about to upload would be a duplicate issue or not and good thing I did since apparently there are quite a few issues about this. But considering that this is the latest one, I will post it here.

This patch add another plugin, indexer-solrshard, that allows to shard the index data on nutch side. this is mostly geared toward solr 3.x as there are still a few of them are around (including in our production 
environment ), but it could have benefits even with solr 4.x to which I will get.

it adds two new properties to the nutch config file ( solr.shardkey and solr.server.urls ), the solr.shardkey would the name of the field that should be used to generate the hash code ( and if it is being used against
solr 3.x should be the uniqekey field in schema file, otherwise the delete would not work properly ), and solr.server.urls would be a comma seperated list of solr core urls or instance urls. 
The plugin divide the hash value by the number of urls to figure out in which core it should put the doucment. it also uses the reset of the solr properties ( commit sieze, etc... ). the code is really the same.
But the idea behind having a solr.server.urls instead of just using solr.server.url was so that both plugin could be used simultinously which can help in migrating from 3.x to 4.x as well, Though I guess the same
argument can be made for other properties as well.

The code use String.hashCode function which is really good enough in terms of evenly distributing docs accross multiple cores ( in our case with about 85 million docs over 8 cores, the diffrence between the number 
of docs in each core is less than 5% ), but changing the hash function or even makeing it customizeable as was suggested in NUTCH-945 is trivial.

Turning the hasing mechanism off is also trivial ( again, I didn't know about this issue when I was writing this otherwise I would have done it already ) but we can add another property such as solr.usehash and by setting it to false, have the plugin to 
just post the documents to all the servers which could also be quite usefull.

As for using it against the solr 4.x, it can function as a load balancer. believe me when I say watching 40 reduce jobs try to write to a single solr instance is rather horrifying.

The patch is against the trunk but porting it to 2.x is trivial ( I actually think that it can probably be applied as it is, but I haven't test it yet )

> SolrIndexer to write to multiple servers.
> -----------------------------------------
>
>                 Key: NUTCH-1480
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1480
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.10
>
>         Attachments: NUTCH-1480-1.6.1.patch, adding-support-for-sharding-indexer-for-solr.patch
>
>
> SolrUtils should return an array of SolrServers and read the SolrUrl as a comma delimited list of URL's using Configuration.getString(). SolrWriter should be able to handle this list of SolrServers.
> This is useful if you want to send documents to multiple servers if no replication is available or if you want to send documents to multiple NOCs.
> edit:
> This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this issue allows you to index to multiple SolrCloud clusters at the same time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)