You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2018/06/14 19:29:38 UTC

[Nutch Wiki] Update of "IndexWriters" by RoannelFernandez

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "IndexWriters" page has been changed by RoannelFernandez:
https://wiki.apache.org/nutch/IndexWriters

Comment:
Parameters from index-writers.xml and a few changes

New page:
= Index writers configuration =

<<TableOfContents(4)>>

== Structure of index-writers.xml ==

== Mapping section ==

== Parameters section ==

=== Solr indexer properties ===

||'''Parameter Name''' ||'''Description''' ||'''Default value''' ||
||type ||Specifies the [[https://lucene.apache.org/solr/5_5_0/solr-solrj/org/apache/solr/client/solrj/SolrClient.html|SolrClient]] implementation to use. This is a string value of one of the following '''cloud''' or '''http'''. The values represent [[https://lucene.apache.org/solr/5_5_0/solr-solrj/org/apache/solr/client/solrj/impl/CloudSolrServer.html|CloudSolrServer]] or [[https://lucene.apache.org/solr/5_5_0/solr-solrj/org/apache/solr/client/solrj/impl/HttpSolrServer.html|HttpSolrServer]] respectively. ||http ||
||url ||Defines the Solr URL into which data should be indexed (This should be a fully qualified URL). Multiple URL can be provided using comma as a delimiter. ||http://localhost:8983/solr/nutch ||
||commitSize ||Defines the number of documents to send to Solr in a single update batch. Decrease when handling very large documents to prevent Nutch from running out of memory.<<BR>> '''Note''': It does not explicitly trigger a server side commit. ||250 ||
||auth || Whether to enable HTTP basic authentication for communicating with Solr. Use the [[#username|username]] and [[#password|password]] properties to configure your credentials. ||false ||
||<<Anchor(username)>> username ||The username of Solr server. ||username ||
||<<Anchor(password)>> password ||The password of Solr server. ||password ||

=== Elasticsearch indexer properties ===

||'''Parameter Name''' ||'''Description''' ||'''Default value''' ||
|| host || Comma-separated list of hostnames to send documents to using TransportClient. Either host and port must be defined or cluster. ||  ||
|| port || The port to connect to using TransportClient. || 9300 ||
|| cluster || The cluster name to discover. Either host and port must be defined or cluster. ||  ||
|| index || Default index to send documents to. || nutch ||
|| max.bulk.docs || Maximum size of the bulk in number of documents. || 250 ||
|| max.bulk.size || Maximum size of the bulk in bytes. || 2500500 ||
|| exponential.backoff.millis || Initial delay for the BulkProcessor's exponential backoff policy. || 100 ||
|| exponential.backoff.retries || Number of times the BulkProcessor's exponential backoff policy should retry bulk operations. || 10 ||
|| bulk.close.timeout || Number of seconds allowed for the BulkProcessor to complete its last operation. || 600 ||

=== Rabbit indexer properties ===

||'''Parameter Name''' ||'''Description''' ||'''Default value''' ||
|| server.uri || URI with connection parameters in the form amqp://username:password@hostname:port/virtualHost <<BR>> Where: <<Include(IndexWriters/RabbitURIParts)>> || amqp://guest:guest@localhost:5672/ ||
|| binding || Whether the relationship between an exchange and a queue is created automatically. Default "false". <<BR>> '''NOTE:''' Binding between exchanges is not supported. || false ||
|| binding.arguments || Arguments used in binding. It must have the form key1=value1,key2=value2. This value is only used when the exchange's type is headers and the value of 'rabbitmq.indexer.binding' property is true. In other cases is ignored. ||  ||
|| exchange.name || Name for the exchange where the messages will be sent. Default "". ||  ||
|| exchange.options || Options used when the exchange is created. Only used when the value of 'rabbitmq.indexer.binding' property is true. Default "type=direct,durable=true". || type=direct,durable=true ||
|| queue.name || Name of the queue used to create the binding. Default "nutch.queue". Only used when the value of 'rabbitmq.indexer.binding' property is true. || nutch.queue ||
|| queue.options || Options used when the queue is created. Only used when the value of 'rabbitmq.indexer.binding' property is true. Default "durable=true,exclusive=false,auto-delete=false".<<BR>> It must have the form durable={durable},exclusive={exclusive},auto-delete={auto-delete},arguments={arguments}<<BR>> where: <<Include(IndexWriters/RabbitQueueOptions)>> || durable=true,exclusive=false,auto-delete=false ||
|| routingkey || The routing key used to publish messages to specific queues. It is only used when the exchange type is "topic" or "direct". Default is the value of 'rabbitmq.indexer.queue.name' property. ||  ||
|| commit.mode || "single" if a message contains only one document. In this case a header with the action (write, update or delete) will be added. "multiple" if a message contains all documents. Default "multiple". || multiple ||
|| commit.size || Amount of documents to send into each message if the value of 'rabbitmq.indexer.commit.mode' property is "multiple". Default "250". || 250 ||
|| headers.static || Headers to add to each message. It must have the form key1=value1,key2=value2. ||  ||
|| headers.dynamic || Document's fields to add as headers to each message. It must have the form field1,field2. ||  ||

=== Elasticsearch rest indexer properties ===

||'''Parameter Name''' ||'''Description''' ||'''Default value''' ||
|| host || The hostname or a list of comma separated hostnames to send documents to using Elasticsearch Jest. Both host and port must be defined. ||  ||
|| port || The port to connect to using Elasticsearch Jest. || 9200 Check this number||
|| index || Default index to send documents to. || nutch ||
|| max.bulk.docs || Maximum size of the bulk in number of documents. || 250 ||
|| max.bulk.size || Maximum size of the bulk in bytes. || 2500500 Check this number||
|| user || Username for auth credentials (only used when https is enabled) || user ||
|| password || Password for auth credentials (only used when https is enabled) || password ||
|| type || Default type to send documents to. || doc ||
|| https || "true" to enable https, "false" to disable https If you've disabled http access (by forcing https), be sure to set this to true, otherwise you might get "connection reset by peer". || false ||
|| trustallhostnames || "true" to trust elasticsearch server's certificate even if its listed domain name does not match the domain they are hosted on "false" to check if the elasticsearch server's certificate's listed domain is the same domain that it is hosted on, and if it doesn't, then fail to index (only used when https is enabled) || false ||
|| languages || A list of strings denoting the supported languages (e.g. `en,de,fr,it`). If this value is empty all documents will be sent to index ${elastic.rest.index}. If not empty the Rest client will distribute documents in different indices based on their `lang` property. Indices are named with the following schema: ${elastic.rest.index}${elastic.rest.index.separator}${lang} (e.g. `nutch_de`). Entries with an unsupported `lang` value will be added to index ${elastic.rest.index}${elastic.rest.index.separator}${elastic.rest.index.sink} (e.g. `nutch_others`). ||  ||
|| separator || Default value is `_`. Is used only if `elastic.rest.index.languages` is defined to build the index name (i.e. ${elastic.rest.index}${elastic.rest.index.separator}${lang}).  || _ ||
|| sink || Default value is `others`. Is used only if `elastic.rest.index.languages` is defined to build the index name where to store documents with unsupported languages (i.e. ${elastic.rest.index}${elastic.rest.index.separator}${elastic.rest.index.sink}). || others ||