You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@lucene.apache.org by tf...@apache.org on 2017/05/12 23:38:44 UTC
[17/58] [abbrv] lucene-solr:jira/solr-10233: squash merge jira/solr-10290 into master

http://git-wip-us.apache.org/repos/asf/lucene-solr/blob/95968c69/solr/solr-ref-guide/src/kerberos-authentication-plugin.adoc
----------------------------------------------------------------------
diff --git a/solr/solr-ref-guide/src/kerberos-authentication-plugin.adoc b/solr/solr-ref-guide/src/kerberos-authentication-plugin.adoc
new file mode 100644
index 0000000..7bf3060
--- /dev/null
+++ b/solr/solr-ref-guide/src/kerberos-authentication-plugin.adoc
@@ -0,0 +1,379 @@
+= Kerberos Authentication Plugin
+:page-shortname: kerberos-authentication-plugin
+:page-permalink: kerberos-authentication-plugin.html
+
+If you are using Kerberos to secure your network environment, the Kerberos authentication plugin can be used to secure a Solr cluster.
+
+This allows Solr to use a Kerberos service principal and keytab file to authenticate with ZooKeeper and between nodes of the Solr cluster (if applicable). Users of the Admin UI and all clients (such as <<using-solrj.adoc#using-solrj,SolrJ>>) would also need to have a valid ticket before being able to use the UI or send requests to Solr.
+
+Support for the Kerberos authentication plugin is available in SolrCloud mode or standalone mode.
+
+[TIP]
+====
+If you are using Solr with a Hadoop cluster secured with Kerberos and intend to store your Solr indexes in HDFS, also see the section <<running-solr-on-hdfs.adoc#running-solr-on-hdfs,Running Solr on HDFS>> for additional steps to configure Solr for that purpose. The instructions on this page apply only to scenarios where Solr will be secured with Kerberos. If you only need to store your indexes in a Kerberized HDFS system, please see the other section referenced above.
+====
+
+[[KerberosAuthenticationPlugin-HowSolrWorksWithKerberos]]
+== How Solr Works With Kerberos
+
+When setting up Solr to use Kerberos, configurations are put in place for Solr to use a _service principal_, or a Kerberos username, which is registered with the Key Distribution Center (KDC) to authenticate requests. The configurations define the service principal name and the location of the keytab file that contains the credentials.
+
+[[KerberosAuthenticationPlugin-security.json]]
+=== security.json
+
+The Solr authentication model uses a file called `security.json`. A description of this file and how it is created and maintained is covered in the section <<authentication-and-authorization-plugins.adoc#authentication-and-authorization-plugins,Authentication and Authorization Plugins>>. If this file is created after an initial startup of Solr, a restart of each node of the system is required.
+
+[[KerberosAuthenticationPlugin-ServicePrincipalsandKeytabFiles]]
+=== Service Principals and Keytab Files
+
+Each Solr node must have a service principal registered with the Key Distribution Center (KDC). The Kerberos plugin uses SPNego to negotiate authentication.
+
+Using `HTTP/host1@YOUR-DOMAIN.ORG`, as an example of a service principal:
+
+* `HTTP` indicates the type of requests which this service principal will be used to authenticate. The `HTTP/` in the service principal is a must for SPNego to work with requests to Solr over HTTP.
+* `host1` is the host name of the machine hosting the Solr node.
+* `YOUR-DOMAIN.ORG` is the organization wide Kerberos realm.
+
+Multiple Solr nodes on the same host may have the same service principal, since the host name is common to them all.
+
+Along with the service principal, each Solr node needs a keytab file which should contain the credentials of the service principal used. A keytab file contains encrypted credentials to support passwordless logins while obtaining Kerberos tickets from the KDC. For each Solr node, the keytab file should be kept in a secure location and not shared with users of the cluster.
+
+Since a Solr cluster requires internode communication, each node must also be able to make Kerberos enabled requests to other nodes. By default, Solr uses the same service principal and keytab as a 'client principal' for internode communication. You may configure a distinct client principal explicitly, but doing so is not recommended and is not covered in the examples below.
+
+[[KerberosAuthenticationPlugin-KerberizedZooKeeper]]
+=== Kerberized ZooKeeper
+
+When setting up a kerberized SolrCloud cluster, it is recommended to enable Kerberos security for Zookeeper as well.
+
+In such a setup, the client principal used to authenticate requests with Zookeeper can be shared for internode communication as well. This has the benefit of not needing to renew the ticket granting tickets (TGTs) separately, since the Zookeeper client used by Solr takes care of this. To achieve this, a single JAAS configuration (with the app name as Client) can be used for the Kerberos plugin as well as for the Zookeeper client.
+
+See the <<ZooKeeper Configuration>> section below for an example of starting Zookeeper in Kerberos mode.
+
+[[KerberosAuthenticationPlugin-BrowserConfiguration]]
+=== Browser Configuration
+
+In order for your browser to access the Solr Admin UI after enabling Kerberos authentication, it must be able to negotiate with the Kerberos authenticator service to allow you access. Each browser supports this differently, and some (like Chrome) do not support it at all. If you see 401 errors when trying to access the Solr Admin UI after enabling Kerberos authentication, it's likely your browser has not been configured properly to know how or where to negotiate the authentication request.
+
+Detailed information on how to set up your browser is beyond the scope of this documentation; please see your system administrators for Kerberos for details on how to configure your browser.
+
+[[KerberosAuthenticationPlugin-PluginConfiguration]]
+== Plugin Configuration
+
+.Consult Your Kerberos Admins!
+[WARNING]
+====
+Before attempting to configure Solr to use Kerberos authentication, please review each step outlined below and consult with your local Kerberos administrators on each detail to be sure you know the correct values for each parameter. Small errors can cause Solr to not start or not function properly, and are notoriously difficult to diagnose.
+====
+
+Configuration of the Kerberos plugin has several parts:
+
+* Create service principals and keytab files
+* ZooKeeper configuration
+* Create or update `/security.json`
+* Define `jaas-client.conf`
+* Solr startup parameters
+
+We'll walk through each of these steps below.
+
+.Using Hostnames
+[IMPORTANT]
+====
+To use host names instead of IP addresses, use the `SOLR_HOST` configuration in `bin/solr.in.sh` or pass a `-Dhost=<hostname>` system parameter during Solr startup. This guide uses IP addresses. If you specify a hostname, replace all the IP addresses in the guide with the Solr hostname as appropriate.
+====
+
+[[KerberosAuthenticationPlugin-GetServicePrincipalsandKeytabs]]
+=== Get Service Principals and Keytabs
+
+Before configuring Solr, make sure you have a Kerberos service principal for each Solr host and ZooKeeper (if ZooKeeper has not already been configured) available in the KDC server, and generate a keytab file as shown below.
+
+This example assumes the hostname is `192.168.0.107` and your home directory is `/home/foo/`. This example should be modified for your own environment.
+
+[source,plain]
+----
+root@kdc:/# kadmin.local
+Authenticating as principal foo/admin@EXAMPLE.COM with password.
+
+kadmin.local:  addprinc HTTP/192.168.0.107
+WARNING: no policy specified for HTTP/192.168.0.107@EXAMPLE.COM; defaulting to no policy
+Enter password for principal "HTTP/192.168.0.107@EXAMPLE.COM":
+Re-enter password for principal "HTTP/192.168.0.107@EXAMPLE.COM":
+Principal "HTTP/192.168.0.107@EXAMPLE.COM" created.
+
+kadmin.local:  ktadd -k /tmp/107.keytab HTTP/192.168.0.107
+Entry for principal HTTP/192.168.0.107 with kvno 2, encryption type aes256-cts-hmac-sha1-96 added to keytab WRFILE:/tmp/107.keytab.
+Entry for principal HTTP/192.168.0.107 with kvno 2, encryption type arcfour-hmac added to keytab WRFILE:/tmp/107.keytab.
+Entry for principal HTTP/192.168.0.107 with kvno 2, encryption type des3-cbc-sha1 added to keytab WRFILE:/tmp/108.keytab.
+Entry for principal HTTP/192.168.0.107 with kvno 2, encryption type des-cbc-crc added to keytab WRFILE:/tmp/107.keytab.
+
+kadmin.local:  quit
+----
+
+Copy the keytab file from the KDC server’s `/tmp/107.keytab` location to the Solr host at `/keytabs/107.keytab`. Repeat this step for each Solr node.
+
+You might need to take similar steps to create a Zookeeper service principal and keytab if it has not already been set up. In that case, the example below shows a different service principal for ZooKeeper, so the above might be repeated with `zookeeper/host1` as the service principal for one of the nodes
+
+[[KerberosAuthenticationPlugin-ZooKeeperConfiguration]]
+=== ZooKeeper Configuration
+
+If you are using a ZooKeeper that has already been configured to use Kerberos, you can skip the ZooKeeper-related steps shown here.
+
+Since ZooKeeper manages the communication between nodes in a SolrCloud cluster, it must also be able to authenticate with each node of the cluster. Configuration requires setting up a service principal for ZooKeeper, defining a JAAS configuration file and instructing ZooKeeper to use both of those items.
+
+The first step is to create a file `java.env` in ZooKeeper's `conf` directory and add the following to it, as in this example:
+
+[source,bash]
+----
+export JVMFLAGS="-Djava.security.auth.login.config=/etc/zookeeper/conf/jaas-client.conf"
+----
+
+The JAAS configuration file should contain the following parameters. Be sure to change the `principal` and `keyTab` path as appropriate. The file must be located in the path defined in the step above, with the filename specified.
+
+[source,plain]
+----
+Server {
+ com.sun.security.auth.module.Krb5LoginModule required
+  useKeyTab=true
+  keyTab="/keytabs/zkhost1.keytab"
+  storeKey=true
+  doNotPrompt=true
+  useTicketCache=false
+  debug=true
+  principal="zookeeper/host1@EXAMPLE.COM";
+};
+----
+
+Finally, add the following lines to the ZooKeeper configuration file `zoo.cfg`:
+
+[source,bash]
+----
+authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider
+jaasLoginRenew=3600000
+----
+
+Once all of the pieces are in place, start ZooKeeper with the following parameter pointing to the JAAS configuration file:
+
+[source,bash]
+----
+bin/zkServer.sh start -Djava.security.auth.login.config=/etc/zookeeper/conf/jaas-client.conf
+----
+
+[[KerberosAuthenticationPlugin-Createsecurity.json]]
+=== Create security.json
+
+Create the `security.json` file.
+
+In SolrCloud mode, you can set up Solr to use the Kerberos plugin by uploading the `security.json` to ZooKeeper while you create it, as follows:
+
+[source,bash]
+----
+server/scripts/cloud-scripts/zkcli.sh -zkhost localhost:2181 -cmd put /security.json '{"authentication":{"class": "org.apache.solr.security.KerberosPlugin"}}'
+----
+
+If you are using Solr in standalone mode, you need to create the `security.json` file and put it in your `$SOLR_HOME` directory.
+
+More details on how to use a `/security.json` file in Solr are available in the section <<authentication-and-authorization-plugins.adoc#authentication-and-authorization-plugins,Authentication and Authorization Plugins>>.
+
+[IMPORTANT]
+====
+If you already have a `/security.json` file in Zookeeper, download the file, add or modify the authentication section and upload it back to ZooKeeper using the <<command-line-utilities.adoc#command-line-utilities,Command Line Utilities>> available in Solr.
+====
+
+[[KerberosAuthenticationPlugin-DefineaJAASConfigurationFile]]
+=== Define a JAAS Configuration File
+
+The JAAS configuration file defines the properties to use for authentication, such as the service principal and the location of the keytab file. Other properties can also be set to ensure ticket caching and other features.
+
+The following example can be copied and modified slightly for your environment. The location of the file can be anywhere on the server, but it will be referenced when starting Solr so it must be readable on the filesystem. The JAAS file may contain multiple sections for different users, but each section must have a unique name so it can be uniquely referenced in each application.
+
+In the below example, we have created a JAAS configuration file with the name and path of `/home/foo/jaas-client.conf`. We will use this name and path when we define the Solr start parameters in the next section. Note that the client `principal` here is the same as the service principal. This will be used to authenticate internode requests and requests to Zookeeper. Make sure to use the correct `principal` hostname and the `keyTab` file path.
+
+[source,plain]
+----
+Client {
+  com.sun.security.auth.module.Krb5LoginModule required
+  useKeyTab=true
+  keyTab="/keytabs/107.keytab"
+  storeKey=true
+  useTicketCache=true
+  debug=true
+  principal="HTTP/192.168.0.107@EXAMPLE.COM";
+};
+----
+
+The first line of this file defines the section name, which will be used with the `solr.kerberos.jaas.appname` parameter, defined below.
+
+The main properties we are concerned with are the `keyTab` and `principal` properties, but there are others which may be required for your environment. The https://docs.oracle.com/javase/8/docs/jre/api/security/jaas/spec/com/sun/security/auth/module/Krb5LoginModule.html[javadocs for the Krb5LoginModule] (the class that's being used and is called in the second line above) provide a good outline of the available properties, but for reference the ones in use in the above example are explained here:
+
+* `useKeyTab`: this boolean property defines if we should use a keytab file (true, in this case).
+* `keyTab`: the location and name of the keytab file for the principal this section of the JAAS configuration file is for. The path should be enclosed in double-quotes.
+* `storeKey`: this boolean property allows the key to be stored in the private credentials of the user.
+* `useTicketCache`: this boolean property allows the ticket to be obtained from the ticket cache.
+* `debug`: this boolean property will output debug messages for help in troubleshooting.
+* `principal`: the name of the service principal to be used.
+
+[[KerberosAuthenticationPlugin-SolrStartupParameters]]
+=== Solr Startup Parameters
+
+While starting up Solr, the following host-specific parameters need to be passed. These parameters can be passed at the command line with the `bin/solr` start command (see <<solr-control-script-reference.adoc#solr-control-script-reference,Solr Control Script Reference>> for details on how to pass system parameters) or defined in `bin/solr.in.sh` or `bin/solr.in.cmd` as appropriate for your operating system.
+
+// TODO: Change column width to %autowidth.spread when https://github.com/asciidoctor/asciidoctor-pdf/issues/599 is fixed
+
+[cols="30,10,60",options="header"]
+|===
+|Parameter Name |Required |Description
+|`solr.kerberos.name.rules` |No |Used to map Kerberos principals to short names. Default value is `DEFAULT`. Example of a name rule: `RULE:[1:$1@$0](.\*EXAMPLE.COM)s/@.*//`
+|`solr.kerberos.cookie.domain` |Yes |Used to issue cookies and should have the hostname of the Solr node.
+|`solr.kerberos.cookie.portaware` |No |When set to true, cookies are differentiated based on host and port, as opposed to standard cookies which are not port aware. This should be set if more than one Solr node is hosted on the same host. The default is false.
+|`solr.kerberos.principal` |Yes |The service principal.
+|`solr.kerberos.keytab` |Yes |Keytab file path containing service principal credentials.
+|`solr.kerberos.jaas.appname` |No |The app name (section name) within the JAAS configuration file which is required for internode communication. Default is `Client`, which is used for Zookeeper authentication as well. If different users are used for ZooKeeper and Solr, they will need to have separate sections in the JAAS configuration file.
+|`java.security.auth.login.config` |Yes |Path to the JAAS configuration file for configuring a Solr client for internode communication.
+|===
+
+Here is an example that could be added to `bin/solr.in.sh`. Make sure to change this example to use the right hostname and the keytab file path.
+
+[source,bash]
+----
+SOLR_AUTH_TYPE="kerberos"
+SOLR_AUTHENTICATION_OPTS="-Djava.security.auth.login.config=/home/foo/jaas-client.conf -Dsolr.kerberos.cookie.domain=192.168.0.107 -Dsolr.kerberos.cookie.portaware=true -Dsolr.kerberos.principal=HTTP/192.168.0.107@EXAMPLE.COM -Dsolr.kerberos.keytab=/keytabs/107.keytab"
+----
+
+.KDC with AES-256 encryption
+[IMPORTANT]
+====
+If your KDC uses AES-256 encryption, you need to add the Java Cryptography Extension (JCE) Unlimited Strength Jurisdiction Policy Files to your JRE before a kerberized Solr can interact with the KDC.
+
+You will know this when you see an error like this in your Solr logs : "KrbException: Encryption type AES256 CTS mode with HMAC SHA1-96 is not supported/enabled"
+
+For Java 1.8, this is available here: http://www.oracle.com/technetwork/java/javase/downloads/jce8-download-2133166.html.
+
+Replace the `local_policy.jar` present in `JAVA_HOME/jre/lib/security/` with the new `local_policy.jar` from the downloaded package and restart the Solr node.
+====
+
+[[KerberosAuthenticationPlugin-UsingDelegationTokens]]
+=== Using Delegation Tokens
+
+The Kerberos plugin can be configured to use delegation tokens, which allow an application to reuse the authentication of an end-user or another application.
+
+There are a few use cases for Solr where this might be helpful:
+
+* Using distributed clients (such as MapReduce) where each client may not have access to the user's credentials.
+* When load on the Kerberos server is high. Delegation tokens can reduce the load because they do not access the server after the first request.
+* If requests or permissions need to be delegated to another user.
+
+To enable delegation tokens, several parameters must be defined. These parameters can be passed at the command line with the `bin/solr` start command (see <<solr-control-script-reference.adoc#solr-control-script-reference,Solr Control Script Reference>> for details on how to pass system parameters) or defined in `bin/solr.in.sh` or `bin/solr.in.cmd` as appropriate for your operating system.
+
+// TODO: Change column width to %autowidth.spread when https://github.com/asciidoctor/asciidoctor-pdf/issues/599 is fixed
+
+[cols="30,10,60",options="header"]
+|===
+|Parameter Name |Required |Description
+|`solr.kerberos.delegation.token.enabled` |Yes, to enable tokens |False by default, set to true to enable delegation tokens.
+|`solr.kerberos.delegation.token.kind` |No |Type of delegation tokens. By default this is `solr-dt`. Likely this does not need to change. No other option is available at this time.
+|`solr.kerberos.delegation.token.validity` |No |Time, in seconds, for which delegation tokens are valid. The default is 36000 seconds.
+|`solr.kerberos.delegation.token.signer.secret.provider` |No |Where delegation token information is stored internally. The default is `zookeeper` which must be the location for delegation tokens to work across Solr servers (when running in SolrCloud mode). No other option is available at this time.
+|`solr.kerberos.delegation.token.signer.secret.provider.zookeper.path` |No |The ZooKeeper path where the secret provider information is stored. This is in the form of the path + /security/token. The path can include the chroot or the chroot can be omitted if you are not using it. This example includes the chroot: `server1:9983,server2:9983,server3:9983/solr/security/token`.
+|`solr.kerberos.delegation.token.secret.manager.znode.working.path` |No |The ZooKeeper path where token information is stored. This is in the form of the path + /security/zkdtsm. The path can include the chroot or the chroot can be omitted if you are not using it. This example includes the chroot: `server1:9983,server2:9983,server3:9983/solr/security/zkdtsm`.
+|===
+
+[[KerberosAuthenticationPlugin-StartSolr]]
+=== Start Solr
+
+Once the configuration is complete, you can start Solr with the `bin/solr` script, as in the example below, which is for users in SolrCloud mode only. This example assumes you modified `bin/solr.in.sh` or `bin/solr.in.cmd`, with the proper values, but if you did not, you would pass the system parameters along with the start command. Note you also need to customize the `-z` property as appropriate for the location of your ZooKeeper nodes.
+
+[source,bash]
+----
+bin/solr -c -z server1:2181,server2:2181,server3:2181/solr
+----
+
+[[KerberosAuthenticationPlugin-TesttheConfiguration]]
+=== Test the Configuration
+
+. Do a `kinit` with your username. For example, `kinit \user@EXAMPLE.COM`.
+. Try to access Solr using `curl`. You should get a successful response.
++
+[source,bash]
+----
+curl --negotiate -u : "http://192.168.0.107:8983/solr/"
+----
+
+[[KerberosAuthenticationPlugin-UsingSolrJwithaKerberizedSolr]]
+== Using SolrJ with a Kerberized Solr
+
+To use Kerberos authentication in a SolrJ application, you need the following two lines before you create a SolrClient:
+
+[source,java]
+----
+System.setProperty("java.security.auth.login.config", "/home/foo/jaas-client.conf");
+HttpClientUtil.setConfigurer(new Krb5HttpClientConfigurer());
+----
+
+You need to specify a Kerberos service principal for the client and a corresponding keytab in the JAAS client configuration file above. This principal should be different from the service principal we created for Solr.
+
+Here’s an example:
+
+[source,plain]
+----
+SolrJClient {
+  com.sun.security.auth.module.Krb5LoginModule required
+  useKeyTab=true
+  keyTab="/keytabs/foo.keytab"
+  storeKey=true
+  useTicketCache=true
+  debug=true
+  principal="solrclient@EXAMPLE.COM";
+};
+----
+
+[[KerberosAuthenticationPlugin-DelegationTokenswithSolrJ]]
+=== Delegation Tokens with SolrJ
+
+Delegation tokens are also supported with SolrJ, in the following ways:
+
+* `DelegationTokenRequest` and `DelegationTokenResponse` can be used to get, cancel, and renew delegation tokens.
+* `HttpSolrClient.Builder` includes a `withDelegationToken` function for creating an HttpSolrClient that uses a delegation token to authenticate.
+
+Sample code to get a delegation token:
+
+[source,java]
+----
+private String getDelegationToken(final String renewer, final String user, HttpSolrClient solrClient) throws Exception {
+    DelegationTokenRequest.Get get = new DelegationTokenRequest.Get(renewer) {
+      @Override
+      public SolrParams getParams() {
+        ModifiableSolrParams params = new ModifiableSolrParams(super.getParams());
+        params.set("user", user);
+        return params;
+      }
+    };
+    DelegationTokenResponse.Get getResponse = get.process(solrClient);
+    return getResponse.getDelegationToken();
+  }
+----
+
+To create a `HttpSolrClient` that uses delegation tokens:
+
+[source,java]
+----
+HttpSolrClient client = new HttpSolrClient.Builder("http://localhost:8983/solr").withDelegationToken(token).build();
+----
+
+To create a `CloudSolrClient` that uses delegation tokens:
+
+[source,java]
+----
+CloudSolrClient client = new CloudSolrClient.Builder()
+                .withZkHost("http://localhost:2181")
+                .withLBHttpSolrClientBuilder(new LBHttpSolrClient.Builder()
+                    .withResponseParser(client.getParser())
+                    .withHttpSolrClientBuilder(
+                        new HttpSolrClient.Builder()
+                            .withKerberosDelegationToken(token)
+                    ))
+                        .build();
+----
+
+[TIP]
+====
+Hadoop's delegation token responses are in JSON map format. A response parser for that is available in `DelegationTokenResponse`. Other response parsers may not work well with Hadoop responses.
+====

http://git-wip-us.apache.org/repos/asf/lucene-solr/blob/95968c69/solr/solr-ref-guide/src/language-analysis.adoc
----------------------------------------------------------------------
diff --git a/solr/solr-ref-guide/src/language-analysis.adoc b/solr/solr-ref-guide/src/language-analysis.adoc
new file mode 100644
index 0000000..ea12424
--- /dev/null
+++ b/solr/solr-ref-guide/src/language-analysis.adoc
@@ -0,0 +1,1608 @@
+= Language Analysis
+:page-shortname: language-analysis
+:page-permalink: language-analysis.html
+
+This section contains information about tokenizers and filters related to character set conversion or for use with specific languages.
+
+For the European languages, tokenization is fairly straightforward. Tokens are delimited by white space and/or a relatively small set of punctuation characters.
+
+In other languages the tokenization rules are often not so simple. Some European languages may also require special tokenization rules, such as rules for decompounding German words.
+
+For information about language detection at index time, see <<detecting-languages-during-indexing.adoc#detecting-languages-during-indexing,Detecting Languages During Indexing>>.
+
+[[LanguageAnalysis-KeywordMarkerFilterFactory]]
+== KeywordMarkerFilterFactory
+
+Protects words from being modified by stemmers. A customized protected word list may be specified with the "protected" attribute in the schema. Any words in the protected word list will not be modified by any stemmer in Solr.
+
+A sample Solr `protwords.txt` with comments can be found in the `sample_techproducts_configs` <<config-sets.adoc#config-sets,config set>> directory:
+
+[source,xml]
+----
+<fieldtype name="myfieldtype" class="solr.TextField">
+  <analyzer>
+    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
+    <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt" />
+    <filter class="solr.PorterStemFilterFactory" />
+  </analyzer>
+</fieldtype>
+----
+
+[[LanguageAnalysis-KeywordRepeatFilterFactory]]
+== KeywordRepeatFilterFactory
+
+Emits each token twice, one with the `KEYWORD` attribute and once without.
+
+If placed before a stemmer, the result will be that you will get the unstemmed token preserved on the same position as the stemmed one. Queries matching the original exact term will get a better score while still maintaining the recall benefit of stemming. Another advantage of keeping the original token is that wildcard truncation will work as expected.
+
+To configure, add the `KeywordRepeatFilterFactory` early in the analysis chain. It is recommended to also include `RemoveDuplicatesTokenFilterFactory` to avoid duplicates when tokens are not stemmed.
+
+A sample fieldType configuration could look like this:
+
+[source,xml]
+----
+<fieldtype name="english_stem_preserve_original" class="solr.TextField">
+  <analyzer>
+    <tokenizer class="solr.StandardTokenizerFactory"/>
+    <filter class="solr.KeywordRepeatFilterFactory" />
+    <filter class="solr.PorterStemFilterFactory" />
+    <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
+  </analyzer>
+</fieldtype>
+----
+
+IMPORTANT: When adding the same token twice, it will also score twice (double), so you may have to re-tune your ranking rules.
+
+
+[[LanguageAnalysis-StemmerOverrideFilterFactory]]
+== StemmerOverrideFilterFactory
+
+Overrides stemming algorithms by applying a custom mapping, then protecting these terms from being modified by stemmers.
+
+A customized mapping of words to stems, in a tab-separated file, can be specified to the "dictionary" attribute in the schema. Words in this mapping will be stemmed to the stems from the file, and will not be further changed by any stemmer.
+
+A sample http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/test-files/solr/collection1/conf/stemdict.txt[stemdict.txt] with comments can be found in the Source Repository.
+
+[source,xml]
+----
+<fieldtype name="myfieldtype" class="solr.TextField">
+  <analyzer>
+    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
+    <filter class="solr.StemmerOverrideFilterFactory" dictionary="stemdict.txt" />
+    <filter class="solr.PorterStemFilterFactory" />
+  </analyzer>
+</fieldtype>
+----
+
+[[LanguageAnalysis-DictionaryCompoundWordTokenFilter]]
+== Dictionary Compound Word Token Filter
+
+This filter splits, or _decompounds_, compound words into individual words using a dictionary of the component words. Each input token is passed through unchanged. If it can also be decompounded into subwords, each subword is also added to the stream at the same logical position.
+
+Compound words are most commonly found in Germanic languages.
+
+*Factory class:* `solr.DictionaryCompoundWordTokenFilterFactory`
+
+*Arguments:*
+
+`dictionary`:: (required) The path of a file that contains a list of simple words, one per line. Blank lines and lines that begin with "#" are ignored. This path may be an absolute path, or path relative to the Solr config directory.
+
+`minWordSize`:: (integer, default 5) Any token shorter than this is not decompounded.
+
+`minSubwordSize`:: (integer, default 2) Subwords shorter than this are not emitted as tokens.
+
+`maxSubwordSize`:: (integer, default 15) Subwords longer than this are not emitted as tokens.
+
+`onlyLongestMatch`:: (true/false) If true (the default), only the longest matching subwords will generate new tokens.
+
+*Example:*
+
+Assume that `germanwords.txt` contains at least the following words: `dumm kopf donau dampf schiff`
+
+[source,xml]
+----
+<analyzer>
+  <tokenizer class="solr.StandardTokenizerFactory"/>
+  <filter class="solr.DictionaryCompoundWordTokenFilterFactory" dictionary="germanwords.txt"/>
+</analyzer>
+----
+
+*In:* "Donaudampfschiff dummkopf"
+
+*Tokenizer to Filter:* "Donaudampfschiff"(1), "dummkopf"(2),
+
+*Out:* "Donaudampfschiff"(1), "Donau"(1), "dampf"(1), "schiff"(1), "dummkopf"(2), "dumm"(2), "kopf"(2)
+
+[[LanguageAnalysis-UnicodeCollation]]
+== Unicode Collation
+
+Unicode Collation is a language-sensitive method of sorting text that can also be used for advanced search purposes.
+
+Unicode Collation in Solr is fast, because all the work is done at index time.
+
+Rather than specifying an analyzer within `<fieldtype ... class="solr.TextField">`, the `solr.CollationField` and `solr.ICUCollationField` field type classes provide this functionality. `solr.ICUCollationField`, which is backed by http://site.icu-project.org[the ICU4J library], provides more flexible configuration, has more locales, is significantly faster, and requires less memory and less index space, since its keys are smaller than those produced by the JDK implementation that backs `solr.CollationField`.
+
+`solr.ICUCollationField` is included in the Solr `analysis-extras` contrib - see `solr/contrib/analysis-extras/README.txt` for instructions on which jars you need to add to your `SOLR_HOME/lib` in order to use it.
+
+`solr.ICUCollationField` and `solr.CollationField` fields can be created in two ways:
+
+* Based upon a system collator associated with a Locale.
+* Based upon a tailored `RuleBasedCollator` ruleset.
+
+*Arguments for `solr.ICUCollationField`, specified as attributes within the `<fieldtype>` element:*
+
+Using a System collator:
+
+`locale`:: (required) http://www.rfc-editor.org/rfc/rfc3066.txt[RFC 3066] locale ID. See http://demo.icu-project.org/icu-bin/locexp[the ICU locale explorer] for a list of supported locales.
+
+`strength`:: Valid values are `primary`, `secondary`, `tertiary`, `quaternary`, or `identical`. See http://userguide.icu-project.org/collation/concepts#TOC-Comparison-Levels[Comparison Levels in ICU Collation Concepts] for more information.
+
+`decomposition`:: Valid values are `no` or `canonical`. See http://userguide.icu-project.org/collation/concepts#TOC-Normalization[Normalization in ICU Collation Concepts] for more information.
+
+Using a Tailored ruleset:
+
+`custom`:: (required) Path to a UTF-8 text file containing rules supported by the ICU http://icu-project.org/apiref/icu4j/com/ibm/icu/text/RuleBasedCollator.html[`RuleBasedCollator`]
+
+`strength`:: Valid values are `primary`, `secondary`, `tertiary`, `quaternary`, or `identical`. See http://userguide.icu-project.org/collation/concepts#TOC-Comparison-Levels[Comparison Levels in ICU Collation Concepts] for more information.
+
+`decomposition`:: Valid values are `no` or `canonical`. See http://userguide.icu-project.org/collation/concepts#TOC-Normalization[Normalization in ICU Collation Concepts] for more information.
+
+Expert options:
+
+`alternate`:: Valid values are `shifted` or `non-ignorable`. Can be used to ignore punctuation/whitespace.
+
+`caseLevel`:: (true/false) If true, in combination with `strength="primary"`, accents are ignored but case is taken into account. The default is false. See http://userguide.icu-project.org/collation/concepts#TOC-CaseLevel[CaseLevel in ICU Collation Concepts] for more information.
+
+`caseFirst`:: Valid values are `lower` or `upper`. Useful to control which is sorted first when case is not ignored.
+
+`numeric`:: (true/false) If true, digits are sorted according to numeric value, e.g. foobar-9 sorts before foobar-10. The default is false.
+
+`variableTop`:: Single character or contraction. Controls what is variable for `alternate`.
+
+[[LanguageAnalysis-SortingTextforaSpecificLanguage]]
+=== Sorting Text for a Specific Language
+
+In this example, text is sorted according to the default German rules provided by ICU4J.
+
+Locales are typically defined as a combination of language and country, but you can specify just the language if you want. For example, if you specify "de" as the language, you will get sorting that works well for the German language. If you specify "de" as the language and "CH" as the country, you will get German sorting specifically tailored for Switzerland.
+
+[source,xml]
+----
+<!-- Define a field type for German collation -->
+<fieldType name="collatedGERMAN" class="solr.ICUCollationField"
+           locale="de"
+           strength="primary" />
+...
+<!-- Define a field to store the German collated manufacturer names. -->
+<field name="manuGERMAN" type="collatedGERMAN" indexed="false" stored="false" docValues="true"/>
+...
+<!-- Copy the text to this field. We could create French, English, Spanish versions too,
+     and sort differently for different users! -->
+<copyField source="manu" dest="manuGERMAN"/>
+----
+
+In the example above, we defined the strength as "primary". The strength of the collation determines how strict the sort order will be, but it also depends upon the language. For example, in English, "primary" strength ignores differences in case and accents.
+
+Another example:
+
+[source,xml]
+----
+<fieldType name="polishCaseInsensitive" class="solr.ICUCollationField"
+           locale="pl_PL"
+           strength="secondary" />
+...
+<field name="city" type="text_general" indexed="true" stored="true"/>
+...
+<field name="city_sort" type="polishCaseInsensitive" indexed="true" stored="false"/>
+...
+<copyField source="city" dest="city_sort"/>
+----
+
+The type will be used for the fields where the data contains Polish text. The "secondary" strength will ignore case differences, but, unlike "primary" strength, a letter with diacritic(s) will be sorted differently from the same base letter without diacritics.
+
+An example using the "city_sort" field to sort:
+
+[source,plain]
+----
+q=*:*&fl=city&sort=city_sort+asc
+----
+
+[[LanguageAnalysis-SortingTextforMultipleLanguages]]
+=== Sorting Text for Multiple Languages
+
+There are two approaches to supporting multiple languages: if there is a small list of languages you wish to support, consider defining collated fields for each language and using `copyField`. However, adding a large number of sort fields can increase disk and indexing costs. An alternative approach is to use the Unicode `default` collator.
+
+The Unicode `default` or `ROOT` locale has rules that are designed to work well for most languages. To use the `default` locale, simply define the locale as the empty string. This Unicode default sort is still significantly more advanced than the standard Solr sort.
+
+[source,xml]
+----
+<fieldType name="collatedROOT" class="solr.ICUCollationField"
+           locale=""
+           strength="primary" />
+----
+
+[[LanguageAnalysis-SortingTextwithCustomRules]]
+=== Sorting Text with Custom Rules
+
+You can define your own set of sorting rules. It's easiest to take existing rules that are close to what you want and customize them.
+
+In the example below, we create a custom rule set for German called DIN 5007-2. This rule set treats umlauts in German differently: it treats ö as equivalent to oe, ä as equivalent to ae, and ü as equivalent to ue. For more information, see the http://icu-project.org/apiref/icu4j/com/ibm/icu/text/RuleBasedCollator.html[ICU RuleBasedCollator javadocs].
+
+This example shows how to create a custom rule set for `solr.ICUCollationField` and dump it to a file:
+
+[source,java]
+----
+// get the default rules for Germany
+// these are called DIN 5007-1 sorting
+RuleBasedCollator baseCollator = (RuleBasedCollator) Collator.getInstance(new ULocale("de", "DE"));
+
+// define some tailorings, to make it DIN 5007-2 sorting.
+// For example, this makes ö equivalent to oe
+String DIN5007_2_tailorings =
+    "& ae , a\u0308 & AE , A\u0308"+
+    "& oe , o\u0308 & OE , O\u0308"+
+    "& ue , u\u0308 & UE , u\u0308";
+
+// concatenate the default rules to the tailorings, and dump it to a String
+RuleBasedCollator tailoredCollator = new RuleBasedCollator(baseCollator.getRules() + DIN5007_2_tailorings);
+String tailoredRules = tailoredCollator.getRules();
+
+// write these to a file, be sure to use UTF-8 encoding!!!
+FileOutputStream os = new FileOutputStream(new File("/solr_home/conf/customRules.dat"));
+IOUtils.write(tailoredRules, os, "UTF-8");
+----
+
+This rule set can now be used for custom collation in Solr:
+
+[source,xml]
+----
+<fieldType name="collatedCUSTOM" class="solr.ICUCollationField"
+           custom="customRules.dat"
+           strength="primary" />
+----
+
+[[LanguageAnalysis-JDKCollation]]
+=== JDK Collation
+
+As mentioned above, ICU Unicode Collation is better in several ways than JDK Collation, but if you cannot use ICU4J for some reason, you can use `solr.CollationField`.
+
+The principles of JDK Collation are the same as those of ICU Collation; you just specify `language`, `country` and `variant` arguments instead of the combined `locale` argument.
+
+*Arguments for `solr.CollationField`, specified as attributes within the `<fieldtype>` element:*
+
+Using a System collator (see http://www.oracle.com/technetwork/java/javase/java8locales-2095355.html[Oracle's list of locales supported in Java 8]):
+
+`language`:: (required) http://www.loc.gov/standards/iso639-2/php/code_list.php[ISO-639] language code
+
+`country`:: http://www.iso.org/iso/country_codes/iso_3166_code_lists/country_names_and_code_elements.htm[ISO-3166] country code
+
+`variant`:: Vendor or browser-specific code
+
+`strength`:: Valid values are `primary`, `secondary`, `tertiary` or `identical`. See http://docs.oracle.com/javase/8/docs/api/java/text/Collator.html[Oracle Java 8 Collator javadocs] for more information.
+
+`decomposition`:: Valid values are `no`, `canonical`, or `full`. See http://docs.oracle.com/javase/8/docs/api/java/text/Collator.html[Oracle Java 8 Collator javadocs] for more information.
+
+Using a Tailored ruleset:
+
+`custom`:: (required) Path to a UTF-8 text file containing rules supported by the http://docs.oracle.com/javase/8/docs/api/java/text/RuleBasedCollator.html[`JDK RuleBasedCollator`]
+
+`strength`:: Valid values are `primary`, `secondary`, `tertiary` or `identical`. See http://docs.oracle.com/javase/8/docs/api/java/text/Collator.html[Oracle Java 8 Collator javadocs] for more information.
+
+`decomposition`:: Valid values are `no`, `canonical`, or `full`. See http://docs.oracle.com/javase/8/docs/api/java/text/Collator.html[Oracle Java 8 Collator javadocs] for more information.
+
+.A `solr.CollationField` example:
+[source,xml]
+----
+<fieldType name="collatedGERMAN" class="solr.CollationField"
+           language="de"
+           country="DE"
+           strength="primary" /> <!-- ignore Umlauts and letter case when sorting -->
+...
+<field name="manuGERMAN" type="collatedGERMAN" indexed="false" stored="false" docValues="true" />
+...
+<copyField source="manu" dest="manuGERMAN"/>
+----
+
+== ASCII & Decimal Folding Filters
+
+[[LanguageAnalysis-AsciiFolding]]
+=== ASCII Folding
+
+This filter converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" Unicode block) into their ASCII equivalents, if one exists. Only those characters with reasonable ASCII alternatives are converted.
+
+This can increase recall by causing more matches. On the other hand, it can reduce precision because language-specific character differences may be lost.
+
+*Factory class:* `solr.ASCIIFoldingFilterFactory`
+
+*Arguments:* None
+
+*Example:*
+
+[source,xml]
+----
+<analyzer>
+  <tokenizer class="solr.StandardTokenizerFactory"/>
+  <filter class="solr.ASCIIFoldingFilterFactory"/>
+</analyzer>
+----
+
+*In:* "Björn Ångström"
+
+*Tokenizer to Filter:* "Björn", "Ångström"
+
+*Out:* "Bjorn", "Angstrom"
+
+[[LanguageAnalysis-DecimalDigitFolding]]
+=== Decimal Digit Folding
+
+This filter converts any character in the Unicode "Decimal Number" general category (`Nd`) into their equivalent Basic Latin digits (0-9).
+
+This can increase recall by causing more matches. On the other hand, it can reduce precision because language-specific character differences may be lost.
+
+*Factory class:* `solr.DecimalDigitFilterFactory`
+
+*Arguments:* None
+
+*Example:*
+
+[source,xml]
+----
+<analyzer>
+  <tokenizer class="solr.StandardTokenizerFactory"/>
+  <filter class="solr.DecimalDigitFilterFactory"/>
+</analyzer>
+----
+
+[[LanguageAnalysis-Language-SpecificFactories]]
+== Language-Specific Factories
+
+These factories are each designed to work with specific languages. The languages covered here are:
+
+* <<Arabic>>
+* <<Brazilian Portuguese>>
+* <<Bulgarian>>
+* <<Catalan>>
+* <<Chinese>>
+* <<Simplified Chinese>>
+* <<CJK>>
+* <<LanguageAnalysis-Czech,Czech>>
+* <<LanguageAnalysis-Danish,Danish>>
+
+* <<Dutch>>
+* <<Finnish>>
+* <<French>>
+* <<Galician>>
+* <<German>>
+* <<Greek>>
+* <<LanguageAnalysis-Hebrew_Lao_Myanmar_Khmer,Hebrew, Lao, Myanmar, Khmer>>
+* <<Hindi>>
+* <<Indonesian>>
+* <<Italian>>
+* <<Irish>>
+* <<Japanese>>
+* <<Latvian>>
+* <<Norwegian>>
+* <<Persian>>
+* <<Polish>>
+* <<Portuguese>>
+* <<Romanian>>
+* <<Russian>>
+* <<Scandinavian>>
+* <<Serbian>>
+* <<Spanish>>
+* <<Swedish>>
+* <<Thai>>
+* <<Turkish>>
+* <<Ukrainian>>
+
+[[LanguageAnalysis-Arabic]]
+=== Arabic
+
+Solr provides support for the http://www.mtholyoke.edu/~lballest/Pubs/arab_stem05.pdf[Light-10] (PDF) stemming algorithm, and Lucene includes an example stopword list.
+
+This algorithm defines both character normalization and stemming, so these are split into two filters to provide more flexibility.
+
+*Factory classes:* `solr.ArabicStemFilterFactory`, `solr.ArabicNormalizationFilterFactory`
+
+*Arguments:* None
+
+*Example:*
+
+[source,xml]
+----
+<analyzer>
+  <tokenizer class="solr.StandardTokenizerFactory"/>
+  <filter class="solr.ArabicNormalizationFilterFactory"/>
+  <filter class="solr.ArabicStemFilterFactory"/>
+</analyzer>
+----
+
+[[LanguageAnalysis-BrazilianPortuguese]]
+=== Brazilian Portuguese
+
+This is a Java filter written specifically for stemming the Brazilian dialect of the Portuguese language. It uses the Lucene class `org.apache.lucene.analysis.br.BrazilianStemmer`. Although that stemmer can be configured to use a list of protected words (which should not be stemmed), this factory does not accept any arguments to specify such a list.
+
+*Factory class:* `solr.BrazilianStemFilterFactory`
+
+*Arguments:* None
+
+*Example:*
+
+[source,xml]
+----
+<analyzer type="index">
+  <tokenizer class="solr.StandardTokenizerFactory"/>
+  <filter class="solr.BrazilianStemFilterFactory"/>
+</analyzer>
+----
+
+*In:* "praia praias"
+
+*Tokenizer to Filter:* "praia", "praias"
+
+*Out:* "pra", "pra"
+
+[[LanguageAnalysis-Bulgarian]]
+=== Bulgarian
+
+Solr includes a light stemmer for Bulgarian, following http://members.unine.ch/jacques.savoy/Papers/BUIR.pdf[this algorithm] (PDF), and Lucene includes an example stopword list.
+
+*Factory class:* `solr.BulgarianStemFilterFactory`
+
+*Arguments:* None
+
+*Example:*
+
+[source,xml]
+----
+<analyzer>
+  <tokenizer class="solr.StandardTokenizerFactory"/>
+  <filter class="solr.LowerCaseFilterFactory"/>
+  <filter class="solr.BulgarianStemFilterFactory"/>
+</analyzer>
+----
+
+[[LanguageAnalysis-Catalan]]
+=== Catalan
+
+Solr can stem Catalan using the Snowball Porter Stemmer with an argument of `language="Catalan"`. Solr includes a set of contractions for Catalan, which can be stripped using `solr.ElisionFilterFactory`.
+
+*Factory class:* `solr.SnowballPorterFilterFactory`
+
+*Arguments:*
+
+`language`:: (required) stemmer language, "Catalan" in this case
+
+*Example:*
+
+[source,xml]
+----
+<analyzer>
+  <tokenizer class="solr.StandardTokenizerFactory"/>
+  <filter class="solr.LowerCaseFilterFactory"/>
+  <filter class="solr.ElisionFilterFactory"
+          articles="lang/contractions_ca.txt"/>
+  <filter class="solr.SnowballPorterFilterFactory" language="Catalan" />
+</analyzer>
+----
+
+*In:* "llengües llengua"
+
+*Tokenizer to Filter:* "llengües"(1) "llengua"(2),
+
+*Out:* "llengu"(1), "llengu"(2)
+
+[[LanguageAnalysis-Chinese]]
+=== Chinese
+
+[[LanguageAnalysis-ChineseTokenizer]]
+==== Chinese Tokenizer
+
+The Chinese Tokenizer is deprecated as of Solr 3.4. Use the <<tokenizers.adoc#Tokenizers-StandardTokenizer,`solr.StandardTokenizerFactory`>> instead.
+
+*Factory class:* `solr.ChineseTokenizerFactory`
+
+*Arguments:* None
+
+*Example:*
+
+[source,xml]
+----
+<analyzer type="index">
+  <tokenizer class="solr.ChineseTokenizerFactory"/>
+</analyzer>
+----
+
+[[LanguageAnalysis-ChineseFilterFactory]]
+==== Chinese Filter Factory
+
+The Chinese Filter Factory is deprecated as of Solr 3.4. Use the <<filter-descriptions.adoc#FilterDescriptions-StopFilter,`solr.StopFilterFactory`>> instead.
+
+*Factory class:* `solr.ChineseFilterFactory`
+
+*Arguments:* None
+
+*Example:*
+
+[source,xml]
+----
+<analyzer type="index">
+  <tokenizer class="solr.StandardTokenizerFactory"/>
+  <filter class="solr.ChineseFilterFactory"/>
+</analyzer>
+----
+
+[[LanguageAnalysis-SimplifiedChinese]]
+=== Simplified Chinese
+
+For Simplified Chinese, Solr provides support for Chinese sentence and word segmentation with the `solr.HMMChineseTokenizerFactory` in the `analysis-extras` contrib module. This component includes a large dictionary and segments Chinese text into words with the Hidden Markov Model. To use this filter, see `solr/contrib/analysis-extras/README.txt` for instructions on which jars you need to add to your `solr_home/lib`.
+
+*Factory class:* `solr.HMMChineseTokenizerFactory`
+
+*Arguments:* None
+
+*Examples:*
+
+To use the default setup with fallback to English Porter stemmer for English words, use:
+
+`<analyzer class="org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer"/>`
+
+Or to configure your own analysis setup, use the `solr.HMMChineseTokenizerFactory` along with your custom filter setup.
+
+[source,xml]
+----
+<analyzer>
+  <tokenizer class="solr.HMMChineseTokenizerFactory"/>
+  <filter class="solr.StopFilterFactory"
+          words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
+  <filter class="solr.PorterStemFilterFactory"/>
+</analyzer>
+----
+
+[[LanguageAnalysis-CJK]]
+=== CJK
+
+This tokenizer breaks Chinese, Japanese and Korean language text into tokens. These are not whitespace delimited languages. The tokens generated by this tokenizer are "doubles", overlapping pairs of CJK characters found in the field text.
+
+*Factory class:* `solr.CJKTokenizerFactory`
+
+*Arguments:* None
+
+*Example:*
+
+[source,xml]
+----
+<analyzer type="index">
+  <tokenizer class="solr.CJKTokenizerFactory"/>
+</analyzer>
+----
+
+[[LanguageAnalysis-Czech]]
+=== Czech
+
+Solr includes a light stemmer for Czech, following https://dl.acm.org/citation.cfm?id=1598600[this algorithm], and Lucene includes an example stopword list.
+
+*Factory class:* `solr.CzechStemFilterFactory`
+
+*Arguments:* None
+
+*Example:*
+
+[source,xml]
+----
+<analyzer>
+  <tokenizer class="solr.StandardTokenizerFactory"/>
+  <filter class="solr.LowerCaseFilterFactory"/>
+  <filter class="solr.CzechStemFilterFactory"/>
+<analyzer>
+----
+
+*In:* "prezidenští, prezidenta, prezidentského"
+
+*Tokenizer to Filter:* "prezidenští", "prezidenta", "prezidentského"
+
+*Out:* "preziden", "preziden", "preziden"
+
+[[LanguageAnalysis-Danish]]
+=== Danish
+
+Solr can stem Danish using the Snowball Porter Stemmer with an argument of `language="Danish"`.
+
+Also relevant are the <<LanguageAnalysis-Scandinavian,Scandinavian normalization filters>>.
+
+*Factory class:* `solr.SnowballPorterFilterFactory`
+
+*Arguments:*
+
+`language`:: (required) stemmer language, "Danish" in this case
+
+*Example:*
+
+[source,xml]
+----
+<analyzer>
+  <tokenizer class="solr.StandardTokenizerFactory"/>
+  <filter class="solr.LowerCaseFilterFactory"/>
+  <filter class="solr.SnowballPorterFilterFactory" language="Danish" />
+</analyzer>
+----
+
+*In:* "undersøg undersøgelse"
+
+*Tokenizer to Filter:* "undersøg"(1) "undersøgelse"(2),
+
+*Out:* "undersøg"(1), "undersøg"(2)
+
+
+[[LanguageAnalysis-Dutch]]
+=== Dutch
+
+Solr can stem Dutch using the Snowball Porter Stemmer with an argument of `language="Dutch"`.
+
+*Factory class:* `solr.SnowballPorterFilterFactory`
+
+*Arguments:*
+
+`language`:: (required) stemmer language, "Dutch" in this case
+
+*Example:*
+
+[source,xml]
+----
+<analyzer type="index">
+  <tokenizer class="solr.StandardTokenizerFactory"/>
+  <filter class="solr.LowerCaseFilterFactory"/>
+  <filter class="solr.SnowballPorterFilterFactory" language="Dutch"/>
+</analyzer>
+----
+
+*In:* "kanaal kanalen"
+
+*Tokenizer to Filter:* "kanaal", "kanalen"
+
+*Out:* "kanal", "kanal"
+
+[[LanguageAnalysis-Finnish]]
+=== Finnish
+
+Solr includes support for stemming Finnish, and Lucene includes an example stopword list.
+
+*Factory class:* `solr.FinnishLightStemFilterFactory`
+
+*Arguments:* None
+
+*Example:*
+
+[source,xml]
+----
+<analyzer type="index">
+  <tokenizer class="solr.StandardTokenizerFactory"/>
+  <filter class="solr.FinnishLightStemFilterFactory"/>
+</analyzer>
+----
+
+*In:* "kala kalat"
+
+*Tokenizer to Filter:* "kala", "kalat"
+
+*Out:* "kala", "kala"
+
+
+[[LanguageAnalysis-French]]
+=== French
+
+[[LanguageAnalysis-ElisionFilter]]
+==== Elision Filter
+
+Removes article elisions from a token stream. This filter can be useful for languages such as French, Catalan, Italian, and Irish.
+
+*Factory class:* `solr.ElisionFilterFactory`
+
+*Arguments:*
+
+`articles`:: The pathname of a file that contains a list of articles, one per line, to be stripped. Articles are words such as "le", which are commonly abbreviated, such as in _l'avion_ (the plane). This file should include the abbreviated form, which precedes the apostrophe. In this case, simply "_l_". If no `articles` attribute is specified, a default set of French articles is used.
+
+`ignoreCase`:: (boolean) If true, the filter ignores the case of words when comparing them to the common word file. Defaults to `false`
+
+*Example:*
+
+[source,xml]
+----
+<analyzer>
+  <tokenizer class="solr.StandardTokenizerFactory"/>
+  <filter class="solr.ElisionFilterFactory"
+          ignoreCase="true"
+          articles="lang/contractions_fr.txt"/>
+</analyzer>
+----
+
+*In:* "L'histoire d'art"
+
+*Tokenizer to Filter:* "L'histoire", "d'art"
+
+*Out:* "histoire", "art"
+
+[[LanguageAnalysis-FrenchLightStemFilter]]
+==== French Light Stem Filter
+
+Solr includes three stemmers for French: one in the `solr.SnowballPorterFilterFactory`, a lighter stemmer called `solr.FrenchLightStemFilterFactory`, and an even less aggressive stemmer called `solr.FrenchMinimalStemFilterFactory`. Lucene includes an example stopword list.
+
+*Factory classes:* `solr.FrenchLightStemFilterFactory`, `solr.FrenchMinimalStemFilterFactory`
+
+*Arguments:* None
+
+*Examples:*
+
+[source,xml]
+----
+<analyzer>
+  <tokenizer class="solr.StandardTokenizerFactory"/>
+  <filter class="solr.LowerCaseFilterFactory"/>
+  <filter class="solr.ElisionFilterFactory"
+          articles="lang/contractions_fr.txt"/>
+  <filter class="solr.FrenchLightStemFilterFactory"/>
+</analyzer>
+----
+
+[source,xml]
+----
+<analyzer>
+  <tokenizer class="solr.StandardTokenizerFactory"/>
+  <filter class="solr.LowerCaseFilterFactory"/>
+  <filter class="solr.ElisionFilterFactory"
+          articles="lang/contractions_fr.txt"/>
+  <filter class="solr.FrenchMinimalStemFilterFactory"/>
+</analyzer>
+----
+
+*In:* "le chat, les chats"
+
+*Tokenizer to Filter:* "le", "chat", "les", "chats"
+
+*Out:* "le", "chat", "le", "chat"
+
+
+[[LanguageAnalysis-Galician]]
+=== Galician
+
+Solr includes a stemmer for Galician following http://bvg.udc.es/recursos_lingua/stemming.jsp[this algorithm], and Lucene includes an example stopword list.
+
+*Factory class:* `solr.GalicianStemFilterFactory`
+
+*Arguments:* None
+
+*Example:*
+
+[source,xml]
+----
+<analyzer>
+  <tokenizer class="solr.StandardTokenizerFactory"/>
+  <filter class="solr.LowerCaseFilterFactory"/>
+  <filter class="solr.GalicianStemFilterFactory"/>
+</analyzer>
+----
+
+*In:* "felizmente Luzes"
+
+*Tokenizer to Filter:* "felizmente", "luzes"
+
+*Out:* "feliz", "luz"
+
+
+[[LanguageAnalysis-German]]
+=== German
+
+Solr includes four stemmers for German: one in the `solr.SnowballPorterFilterFactory language="German"`, a stemmer called `solr.GermanStemFilterFactory`, a lighter stemmer called `solr.GermanLightStemFilterFactory`, and an even less aggressive stemmer called `solr.GermanMinimalStemFilterFactory`. Lucene includes an example stopword list.
+
+*Factory classes:* `solr.GermanStemFilterFactory`, `solr.LightGermanStemFilterFactory`, `solr.MinimalGermanStemFilterFactory`
+
+*Arguments:* None
+
+*Examples:*
+
+[source,xml]
+----
+<analyzer type="index">
+  <tokenizer class="solr.StandardTokenizerFactory "/>
+  <filter class="solr.GermanStemFilterFactory"/>
+</analyzer>
+----
+
+[source,xml]
+----
+<analyzer type="index">
+  <tokenizer class="solr.StandardTokenizerFactory"/>
+  <filter class="solr.GermanLightStemFilterFactory"/>
+</analyzer>
+----
+
+[source,xml]
+----
+<analyzer type="index">
+  <tokenizer class="solr.StandardTokenizerFactory "/>
+  <filter class="solr.GermanMinimalStemFilterFactory"/>
+</analyzer>
+----
+
+*In:* "haus häuser"
+
+*Tokenizer to Filter:* "haus", "häuser"
+
+*Out:* "haus", "haus"
+
+
+[[LanguageAnalysis-Greek]]
+=== Greek
+
+This filter converts uppercase letters in the Greek character set to the equivalent lowercase character.
+
+*Factory class:* `solr.GreekLowerCaseFilterFactory`
+
+*Arguments:* None
+
+[IMPORTANT]
+====
+Use of custom charsets is no longer supported as of Solr 3.1. If you need to index text in these encodings, please use Java's character set conversion facilities (InputStreamReader, etc.) during I/O, so that Lucene can analyze this text as Unicode instead.
+====
+
+*Example:*
+
+[source,xml]
+----
+<analyzer type="index">
+  <tokenizer class="solr.StandardTokenizerFactory"/>
+  <filter class="solr.GreekLowerCaseFilterFactory"/>
+</analyzer>
+----
+
+[[LanguageAnalysis-Hindi]]
+=== Hindi
+
+Solr includes support for stemming Hindi following http://computing.open.ac.uk/Sites/EACLSouthAsia/Papers/p6-Ramanathan.pdf[this algorithm] (PDF), support for common spelling differences through the `solr.HindiNormalizationFilterFactory`, support for encoding differences through the `solr.IndicNormalizationFilterFactory` following http://ldc.upenn.edu/myl/IndianScriptsUnicode.html[this algorithm], and Lucene includes an example stopword list.
+
+*Factory classes:* `solr.IndicNormalizationFilterFactory`, `solr.HindiNormalizationFilterFactory`, `solr.HindiStemFilterFactory`
+
+*Arguments:* None
+
+*Example:*
+
+[source,xml]
+----
+<analyzer type="index">
+  <tokenizer class="solr.StandardTokenizerFactory"/>
+  <filter class="solr.IndicNormalizationFilterFactory"/>
+  <filter class="solr.HindiNormalizationFilterFactory"/>
+  <filter class="solr.HindiStemFilterFactory"/>
+</analyzer>
+----
+
+
+[[LanguageAnalysis-Indonesian]]
+=== Indonesian
+
+Solr includes support for stemming Indonesian (Bahasa Indonesia) following http://www.illc.uva.nl/Publications/ResearchReports/MoL-2003-02.text.pdf[this algorithm] (PDF), and Lucene includes an example stopword list.
+
+*Factory class:* `solr.IndonesianStemFilterFactory`
+
+*Arguments:* None
+
+*Example:*
+
+[source,xml]
+----
+<analyzer>
+  <tokenizer class="solr.StandardTokenizerFactory"/>
+  <filter class="solr.LowerCaseFilterFactory"/>
+  <filter class="solr.IndonesianStemFilterFactory" stemDerivational="true" />
+</analyzer>
+----
+
+*In:* "sebagai sebagainya"
+
+*Tokenizer to Filter:* "sebagai", "sebagainya"
+
+*Out:* "bagai", "bagai"
+
+[[LanguageAnalysis-Italian]]
+=== Italian
+
+Solr includes two stemmers for Italian: one in the `solr.SnowballPorterFilterFactory language="Italian"`, and a lighter stemmer called `solr.ItalianLightStemFilterFactory`. Lucene includes an example stopword list.
+
+*Factory class:* `solr.ItalianStemFilterFactory`
+
+*Arguments:* None
+
+*Example:*
+
+[source,xml]
+----
+<analyzer>
+  <tokenizer class="solr.StandardTokenizerFactory"/>
+  <filter class="solr.LowerCaseFilterFactory"/>
+  <filter class="solr.ElisionFilterFactory"
+          articles="lang/contractions_it.txt"/>
+  <filter class="solr.ItalianLightStemFilterFactory"/>
+</analyzer>
+----
+
+*In:* "propaga propagare propagamento"
+
+*Tokenizer to Filter:* "propaga", "propagare", "propagamento"
+
+*Out:* "propag", "propag", "propag"
+
+[[LanguageAnalysis-Irish]]
+=== Irish
+
+Solr can stem Irish using the Snowball Porter Stemmer with an argument of `language="Irish"`. Solr includes `solr.IrishLowerCaseFilterFactory`, which can handle Irish-specific constructs. Solr also includes a set of contractions for Irish which can be stripped using `solr.ElisionFilterFactory`.
+
+*Factory class:* `solr.SnowballPorterFilterFactory`
+
+*Arguments:*
+
+`language`:: (required) stemmer language, "Irish" in this case
+
+*Example:*
+
+[source,xml]
+----
+<analyzer>
+  <tokenizer class="solr.StandardTokenizerFactory"/>
+  <filter class="solr.ElisionFilterFactory"
+          articles="lang/contractions_ga.txt"/>
+  <filter class="solr.IrishLowerCaseFilterFactory"/>
+  <filter class="solr.SnowballPorterFilterFactory" language="Irish" />
+</analyzer>
+----
+
+*In:* "siopadóireacht síceapatacha b'fhearr m'athair"
+
+*Tokenizer to Filter:* "siopadóireacht", "síceapatacha", "b'fhearr", "m'athair"
+
+*Out:* "siopadóir", "síceapaite", "fearr", "athair"
+
+[[LanguageAnalysis-Japanese]]
+=== Japanese
+
+Solr includes support for analyzing Japanese, via the Lucene Kuromoji morphological analyzer, which includes several analysis components - more details on each below:
+
+* `JapaneseIterationMarkCharFilter` normalizes Japanese horizontal iteration marks (odoriji) to their expanded form.
+* `JapaneseTokenizer` tokenizes Japanese using morphological analysis, and annotates each term with part-of-speech, base form (a.k.a. lemma), reading and pronunciation.
+* `JapaneseBaseFormFilter` replaces original terms with their base forms (a.k.a. lemmas).
+* `JapanesePartOfSpeechStopFilter` removes terms that have one of the configured parts-of-speech.
+* `JapaneseKatakanaStemFilter` normalizes common katakana spelling variations ending in a long sound character (U+30FC) by removing the long sound character.
+
+Also useful for Japanese analysis, from lucene-analyzers-common:
+
+* `CJKWidthFilter` folds fullwidth ASCII variants into the equivalent Basic Latin forms, and folds halfwidth Katakana variants into their equivalent fullwidth forms.
+
+[[LanguageAnalysis-JapaneseIterationMarkCharFilter]]
+==== Japanese Iteration Mark CharFilter
+
+Normalizes horizontal Japanese iteration marks (odoriji) to their expanded form. Vertical iteration marks are not supported.
+
+*Factory class:* `JapaneseIterationMarkCharFilterFactory`
+
+*Arguments:*
+
+`normalizeKanji`:: set to `false` to not normalize kanji iteration marks (default is `true`)
+
+`normalizeKana`:: set to `false` to not normalize kana iteration marks (default is `true`)
+
+[[LanguageAnalysis-JapaneseTokenizer]]
+==== Japanese Tokenizer
+
+Tokenizer for Japanese that uses morphological analysis, and annotates each term with part-of-speech, base form (a.k.a. lemma), reading and pronunciation.
+
+`JapaneseTokenizer` has a `search` mode (the default) that does segmentation useful for search: a heuristic is used to segment compound terms into their constituent parts while also keeping the original compound terms as synonyms.
+
+*Factory class:* `solr.JapaneseTokenizerFactory`
+
+*Arguments:*
+
+`mode`:: Use `search` mode to get a noun-decompounding effect useful for search. `search` mode improves segmentation for search at the expense of part-of-speech accuracy. Valid values for `mode` are:
++
+* `normal`: default segmentation
+* `search`: segmentation useful for search (extra compound splitting)
+* `extended`: search mode plus unigramming of unknown words (experimental)
++
+For some applications it might be good to use `search` mode for indexing and `normal` mode for queries to increase precision and prevent parts of compounds from being matched and highlighted.
+
+`userDictionary`:: filename for a user dictionary, which allows overriding the statistical model with your own entries for segmentation, part-of-speech tags and readings without a need to specify weights. See `lang/userdict_ja.txt` for a sample user dictionary file.
+
+`userDictionaryEncoding`:: user dictionary encoding (default is UTF-8)
+
+`discardPunctuation`:: set to `false` to keep punctuation, `true` to discard (the default)
+
+[[LanguageAnalysis-JapaneseBaseFormFilter]]
+==== Japanese Base Form Filter
+
+Replaces original terms' text with the corresponding base form (lemma). (`JapaneseTokenizer` annotates each term with its base form.)
+
+*Factory class:* `JapaneseBaseFormFilterFactory`
+
+(no arguments)
+
+[[LanguageAnalysis-JapanesePartOfSpeechStopFilter]]
+==== Japanese Part Of Speech Stop Filter
+
+Removes terms with one of the configured parts-of-speech. `JapaneseTokenizer` annotates terms with parts-of-speech.
+
+*Factory class* *:* `JapanesePartOfSpeechStopFilterFactory`
+
+*Arguments:*
+
+`tags`:: filename for a list of parts-of-speech for which to remove terms; see `conf/lang/stoptags_ja.txt` in the `sample_techproducts_config` <<config-sets.adoc#config-sets,config set>> for an example.
+
+`enablePositionIncrements`:: if `luceneMatchVersion` is `4.3` or earlier and `enablePositionIncrements="false"`, no position holes will be left by this filter when it removes tokens. *This argument is invalid if `luceneMatchVersion` is `5.0` or later.*
+
+[[LanguageAnalysis-JapaneseKatakanaStemFilter]]
+==== Japanese Katakana Stem Filter
+
+Normalizes common katakana spelling variations ending in a long sound character (U+30FC) by removing the long sound character.
+
+`CJKWidthFilterFactory` should be specified prior to this filter to normalize half-width katakana to full-width.
+
+*Factory class:* `JapaneseKatakanaStemFilterFactory`
+
+*Arguments:*
+
+`minimumLength`:: terms below this length will not be stemmed. Default is 4, value must be 2 or more.
+
+[[LanguageAnalysis-CJKWidthFilter]]
+==== CJK Width Filter
+
+Folds fullwidth ASCII variants into the equivalent Basic Latin forms, and folds halfwidth Katakana variants into their equivalent fullwidth forms.
+
+*Factory class:* `CJKWidthFilterFactory`
+
+(no arguments)
+
+Example:
+
+[source,xml]
+----
+<fieldType name="text_ja" positionIncrementGap="100" autoGeneratePhraseQueries="false">
+  <analyzer>
+    <!-- Uncomment if you need to handle iteration marks: -->
+    <!-- <charFilter class="solr.JapaneseIterationMarkCharFilterFactory" /> -->
+    <tokenizer class="solr.JapaneseTokenizerFactory" mode="search" userDictionary="lang/userdict_ja.txt"/>
+    <filter class="solr.JapaneseBaseFormFilterFactory"/>
+    <filter class="solr.JapanesePartOfSpeechStopFilterFactory" tags="lang/stoptags_ja.txt"/>
+    <filter class="solr.CJKWidthFilterFactory"/>
+    <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ja.txt"/>
+    <filter class="solr.JapaneseKatakanaStemFilterFactory" minimumLength="4"/>
+    <filter class="solr.LowerCaseFilterFactory"/>
+  </analyzer>
+</fieldType>
+----
+
+[[LanguageAnalysis-Hebrew_Lao_Myanmar_Khmer]]
+=== Hebrew, Lao, Myanmar, Khmer
+
+Lucene provides support, in addition to UAX#29 word break rules, for Hebrew's use of the double and single quote characters, and for segmenting Lao, Myanmar, and Khmer into syllables with the `solr.ICUTokenizerFactory` in the `analysis-extras` contrib module. To use this tokenizer, see `solr/contrib/analysis-extras/README.txt for` instructions on which jars you need to add to your `solr_home/lib`.
+
+See <<tokenizers.adoc#Tokenizers-ICUTokenizer,the ICUTokenizer>> for more information.
+
+[[LanguageAnalysis-Latvian]]
+=== Latvian
+
+Solr includes support for stemming Latvian, and Lucene includes an example stopword list.
+
+*Factory class:* `solr.LatvianStemFilterFactory`
+
+*Arguments:* None
+
+*Example:*
+
+[source,xml]
+----
+<fieldType name="text_lvstem" class="solr.TextField" positionIncrementGap="100">
+  <analyzer>
+    <tokenizer class="solr.StandardTokenizerFactory"/>
+    <filter class="solr.LowerCaseFilterFactory"/>
+    <filter class="solr.LatvianStemFilterFactory"/>
+  </analyzer>
+</fieldType>
+----
+
+*In:* "tirgiem tirgus"
+
+*Tokenizer to Filter:* "tirgiem", "tirgus"
+
+*Out:* "tirg", "tirg"
+
+[[LanguageAnalysis-Norwegian]]
+=== Norwegian
+
+Solr includes two classes for stemming Norwegian, `NorwegianLightStemFilterFactory` and `NorwegianMinimalStemFilterFactory`. Lucene includes an example stopword list.
+
+Another option is to use the Snowball Porter Stemmer with an argument of language="Norwegian".
+
+Also relevant are the <<LanguageAnalysis-Scandinavian,Scandinavian normalization filters>>.
+
+[[LanguageAnalysis-NorwegianLightStemmer]]
+==== Norwegian Light Stemmer
+
+The `NorwegianLightStemFilterFactory` requires a "two-pass" sort for the -dom and -het endings. This means that in the first pass the word "kristendom" is stemmed to "kristen", and then all the general rules apply so it will be further stemmed to "krist". The effect of this is that "kristen," "kristendom," "kristendommen," and "kristendommens" will all be stemmed to "krist."
+
+The second pass is to pick up -dom and -het endings. Consider this example:
+
+[width="100%",options="header",]
+|===
+2+^|*One pass* 2+^|*Two passes*
+|*Before* |*After* |*Before* |*After*
+|forlegen |forleg |forlegen |forleg
+|forlegenhet |forlegen |forlegenhet |forleg
+|forlegenheten |forlegen |forlegenheten |forleg
+|forlegenhetens |forlegen |forlegenhetens |forleg
+|firkantet |firkant |firkantet |firkant
+|firkantethet |firkantet |firkantethet |firkant
+|firkantetheten |firkantet |firkantetheten |firkant
+|===
+
+*Factory class:* `solr.NorwegianLightStemFilterFactory`
+
+*Arguments:*
+
+`variant`:: Choose the Norwegian language variant to use. Valid values are:
++
+* `nb:` Bokmål (default)
+* `nn:` Nynorsk
+* `no:` both
+
+*Example:*
+
+[source,xml]
+----
+<fieldType name="text_no" class="solr.TextField" positionIncrementGap="100">
+  <analyzer>
+    <tokenizer class="solr.StandardTokenizerFactory"/>
+    <filter class="solr.LowerCaseFilterFactory"/>
+    <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_no.txt" format="snowball"/>
+    <filter class="solr.NorwegianLightStemFilterFactory"/>
+  </analyzer>
+</fieldType>
+----
+
+*In:* "Forelskelsen"
+
+*Tokenizer to Filter:* "forelskelsen"
+
+*Out:* "forelske"
+
+[[LanguageAnalysis-NorwegianMinimalStemmer]]
+==== Norwegian Minimal Stemmer
+
+The `NorwegianMinimalStemFilterFactory` stems plural forms of Norwegian nouns only.
+
+*Factory class:* `solr.NorwegianMinimalStemFilterFactory`
+
+*Arguments:*
+
+`variant`:: Choose the Norwegian language variant to use. Valid values are:
++
+* `nb:` Bokmål (default)
+* `nn:` Nynorsk
+* `no:` both
+
+*Example:*
+
+[source,xml]
+----
+<fieldType name="text_no" class="solr.TextField" positionIncrementGap="100">
+  <analyzer>
+    <tokenizer class="solr.StandardTokenizerFactory"/>
+    <filter class="solr.LowerCaseFilterFactory"/>
+    <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_no.txt" format="snowball"/>
+    <filter class="solr.NorwegianMinimalStemFilterFactory"/>
+  </analyzer>
+</fieldType>
+----
+
+*In:* "Bilens"
+
+*Tokenizer to Filter:* "bilens"
+
+*Out:* "bil"
+
+[[LanguageAnalysis-Persian]]
+=== Persian
+
+[[LanguageAnalysis-PersianFilterFactories]]
+==== Persian Filter Factories
+
+Solr includes support for normalizing Persian, and Lucene includes an example stopword list.
+
+*Factory class:* `solr.PersianNormalizationFilterFactory`
+
+*Arguments:* None
+
+*Example:*
+
+[source,xml]
+----
+<analyzer>
+  <tokenizer class="solr.StandardTokenizerFactory"/>
+  <filter class="solr.ArabicNormalizationFilterFactory"/>
+  <filter class="solr.PersianNormalizationFilterFactory">
+</analyzer>
+----
+
+[[LanguageAnalysis-Polish]]
+=== Polish
+
+Solr provides support for Polish stemming with the `solr.StempelPolishStemFilterFactory`, and `solr.MorphologikFilterFactory` for lemmatization, in the `contrib/analysis-extras` module. The `solr.StempelPolishStemFilterFactory` component includes an algorithmic stemmer with tables for Polish. To use either of these filters, see `solr/contrib/analysis-extras/README.txt` for instructions on which jars you need to add to your `solr_home/lib`.
+
+*Factory class:* `solr.StempelPolishStemFilterFactory` and `solr.MorfologikFilterFactory`
+
+*Arguments:* None
+
+*Example:*
+
+[source,xml]
+----
+<analyzer>
+  <tokenizer class="solr.StandardTokenizerFactory"/>
+  <filter class="solr.LowerCaseFilterFactory"/>
+  <filter class="solr.StempelPolishStemFilterFactory"/>
+</analyzer>
+----
+
+[source,xml]
+----
+<analyzer>
+  <tokenizer class="solr.StandardTokenizerFactory"/>
+  <filter class="solr.MorfologikFilterFactory" dictionary="morfologik/stemming/polish/polish.dict"/>
+  <filter class="solr.LowerCaseFilterFactory"/>
+</analyzer>
+----
+
+*In:* ""studenta studenci"
+
+*Tokenizer to Filter:* "studenta", "studenci"
+
+*Out:* "student", "student"
+
+More information about the Stempel stemmer is available in {lucene-javadocs}/analyzers-stempel/index.html[the Lucene javadocs].
+
+Note the lower case filter is applied _after_ the Morfologik stemmer; this is because the Polish dictionary contains proper names and then proper term case may be important to resolve disambiguities (or even lookup the correct lemma at all).
+
+The Morfologik dictionary parameter value is a constant specifying which dictionary to choose. The dictionary resource must be named `path/to/_language_.dict` and have an associated `.info` metadata file. See http://morfologik.blogspot.com/[the Morfologik project] for details. If the dictionary attribute is not provided, the Polish dictionary is loaded and used by default.
+
+[[LanguageAnalysis-Portuguese]]
+=== Portuguese
+
+Solr includes four stemmers for Portuguese: one in the `solr.SnowballPorterFilterFactory`, an alternative stemmer called `solr.PortugueseStemFilterFactory`, a lighter stemmer called `solr.PortugueseLightStemFilterFactory`, and an even less aggressive stemmer called `solr.PortugueseMinimalStemFilterFactory`. Lucene includes an example stopword list.
+
+*Factory classes:* `solr.PortugueseStemFilterFactory`, `solr.PortugueseLightStemFilterFactory`, `solr.PortugueseMinimalStemFilterFactory`
+
+*Arguments:* None
+
+*Example:*
+
+[source,xml]
+----
+<analyzer>
+  <tokenizer class="solr.StandardTokenizerFactory"/>
+  <filter class="solr.LowerCaseFilterFactory"/>
+  <filter class="solr.PortugueseStemFilterFactory"/>
+</analyzer>
+----
+
+[source,xml]
+----
+<analyzer>
+  <tokenizer class="solr.StandardTokenizerFactory"/>
+  <filter class="solr.LowerCaseFilterFactory"/>
+  <filter class="solr.PortugueseLightStemFilterFactory"/>
+</analyzer>
+----
+
+[source,xml]
+----
+<analyzer>
+  <tokenizer class="solr.StandardTokenizerFactory"/>
+  <filter class="solr.LowerCaseFilterFactory"/>
+  <filter class="solr.PortugueseMinimalStemFilterFactory"/>
+</analyzer>
+----
+
+*In:* "praia praias"
+
+*Tokenizer to Filter:* "praia", "praias"
+
+*Out:* "pra", "pra"
+
+
+[[LanguageAnalysis-Romanian]]
+=== Romanian
+
+Solr can stem Romanian using the Snowball Porter Stemmer with an argument of `language="Romanian"`.
+
+*Factory class:* `solr.SnowballPorterFilterFactory`
+
+*Arguments:*
+
+`language`:: (required) stemmer language, "Romanian" in this case
+
+*Example:*
+
+[source,xml]
+----
+<analyzer>
+  <tokenizer class="solr.StandardTokenizerFactory"/>
+  <filter class="solr.LowerCaseFilterFactory"/>
+  <filter class="solr.SnowballPorterFilterFactory" language="Romanian" />
+</analyzer>
+----
+
+
+[[LanguageAnalysis-Russian]]
+=== Russian
+
+[[LanguageAnalysis-RussianStemFilter]]
+==== Russian Stem Filter
+
+Solr includes two stemmers for Russian: one in the `solr.SnowballPorterFilterFactory language="Russian"`, and a lighter stemmer called `solr.RussianLightStemFilterFactory`. Lucene includes an example stopword list.
+
+*Factory class:* `solr.RussianLightStemFilterFactory`
+
+*Arguments:* None
+
+*Example:*
+
+[source,xml]
+----
+<analyzer type="index">
+  <tokenizer class="solr.StandardTokenizerFactory"/>
+  <filter class="solr.LowerCaseFilterFactory"/>
+  <filter class="solr.RussianLightStemFilterFactory"/>
+</analyzer>
+----
+
+
+[[LanguageAnalysis-Scandinavian]]
+=== Scandinavian
+
+Scandinavian is a language group spanning three languages <<LanguageAnalysis-Norwegian,Norwegian>>, <<LanguageAnalysis-Swedish,Swedish>> and <<LanguageAnalysis-Danish,Danish>> which are very similar.
+
+Swedish å, ä, ö are in fact the same letters as Norwegian and Danish å, æ, ø and thus interchangeable when used between these languages. They are however folded differently when people type them on a keyboard lacking these characters.
+
+In that situation almost all Swedish people use a, a, o instead of å, ä, ö. Norwegians and Danes on the other hand usually type aa, ae and oe instead of å, æ and ø. Some do however use a, a, o, oo, ao and sometimes permutations of everything above.
+
+There are two filters for helping with normalization between Scandinavian languages: one is `solr.ScandinavianNormalizationFilterFactory` trying to preserve the special characters (æäöå) and another `solr.ScandinavianFoldingFilterFactory` which folds these to the more broad ø/ö->o etc.
+
+See also each language section for other relevant filters.
+
+[[LanguageAnalysis-ScandinavianNormalizationFilter]]
+==== Scandinavian Normalization Filter
+
+This filter normalize use of the interchangeable Scandinavian characters æÆäÄöÖøØ and folded variants (aa, ao, ae, oe and oo) by transforming them to åÅæÆøØ.
+
+It's a semantically less destructive solution than `ScandinavianFoldingFilter`, most useful when a person with a Norwegian or Danish keyboard queries a Swedish index and vice versa. This filter does *not* perform the common Swedish folds of å and ä to a nor ö to o.
+
+*Factory class:* `solr.ScandinavianNormalizationFilterFactory`
+
+*Arguments:* None
+
+*Example:*
+
+[source,xml]
+----
+<analyzer>
+  <tokenizer class="solr.StandardTokenizerFactory"/>
+  <filter class="solr.LowerCaseFilterFactory"/>
+  <filter class="solr.ScandinavianNormalizationFilterFactory"/>
+</analyzer>
+----
+
+*In:* "blåbærsyltetøj blåbärsyltetöj blaabaarsyltetoej blabarsyltetoj"
+
+*Tokenizer to Filter:* "blåbærsyltetøj", "blåbärsyltetöj", "blaabaersyltetoej", "blabarsyltetoj"
+
+*Out:* "blåbærsyltetøj", "blåbærsyltetøj", "blåbærsyltetøj", "blabarsyltetoj"
+
+[[LanguageAnalysis-ScandinavianFoldingFilter]]
+==== Scandinavian Folding Filter
+
+This filter folds Scandinavian characters åÅäæÄÆ->a and öÖøØ->o. It also discriminate against use of double vowels aa, ae, ao, oe and oo, leaving just the first one.
+
+It's a semantically more destructive solution than `ScandinavianNormalizationFilter`, but can in addition help with matching raksmorgas as räksmörgås.
+
+*Factory class:* `solr.ScandinavianFoldingFilterFactory`
+
+*Arguments:* None
+
+*Example:*
+
+[source,xml]
+----
+<analyzer>
+  <tokenizer class="solr.StandardTokenizerFactory"/>
+  <filter class="solr.LowerCaseFilterFactory"/>
+  <filter class="solr.ScandinavianFoldingFilterFactory"/>
+</analyzer>
+----
+
+*In:* "blåbærsyltetøj blåbärsyltetöj blaabaarsyltetoej blabarsyltetoj"
+
+*Tokenizer to Filter:* "blåbærsyltetøj", "blåbärsyltetöj", "blaabaersyltetoej", "blabarsyltetoj"
+
+*Out:* "blabarsyltetoj", "blabarsyltetoj", "blabarsyltetoj", "blabarsyltetoj"
+
+[[LanguageAnalysis-Serbian]]
+=== Serbian
+
+[[LanguageAnalysis-SerbianNormalizationFilter]]
+==== Serbian Normalization Filter
+
+Solr includes a filter that normalizes Serbian Cyrillic and Latin characters. Note that this filter only works with lowercased input.
+
+See the Solr wiki for tips & advice on using this filter: https://wiki.apache.org/solr/SerbianLanguageSupport
+
+*Factory class:* `solr.SerbianNormalizationFilterFactory`
+
+*Arguments:*
+
+`haircut` :: Select the extend of normalization. Valid values are:
++
+* `bald`: (Default behavior) Cyrillic characters are first converted to Latin; then, Latin characters have their diacritics removed, with the exception of https://en.wikipedia.org/wiki/D_with_stroke[LATIN SMALL LETTER D WITH STROKE] (U+0111) which is converted to "```dj```"
+* `regular`: Only Cyrillic to Latin normalization will be applied, preserving the Latin diatrics
+
+*Example:*
+
+[source,xml]
+----
+<analyzer>
+  <tokenizer class="solr.StandardTokenizerFactory"/>
+  <filter class="solr.LowerCaseFilterFactory"/>
+  <filter class="solr.SerbianNormalizationFilterFactory" haircut="bald"/>
+</analyzer>
+----
+
+[[LanguageAnalysis-Spanish]]
+=== Spanish
+
+Solr includes two stemmers for Spanish: one in the `solr.SnowballPorterFilterFactory language="Spanish"`, and a lighter stemmer called `solr.SpanishLightStemFilterFactory`. Lucene includes an example stopword list.
+
+*Factory class:* `solr.SpanishStemFilterFactory`
+
+*Arguments:* None
+
+*Example:*
+
+[source,xml]
+----
+<analyzer>
+  <tokenizer class="solr.StandardTokenizerFactory"/>
+  <filter class="solr.LowerCaseFilterFactory"/>
+  <filter class="solr.SpanishLightStemFilterFactory"/>
+</analyzer>
+----
+
+*In:* "torear toreara torearlo"
+
+*Tokenizer to Filter:* "torear", "toreara", "torearlo"
+
+*Out:* "tor", "tor", "tor"
+
+
+[[LanguageAnalysis-Swedish]]
+=== Swedish
+
+[[LanguageAnalysis-SwedishStemFilter]]
+==== Swedish Stem Filter
+
+Solr includes two stemmers for Swedish: one in the `solr.SnowballPorterFilterFactory language="Swedish"`, and a lighter stemmer called `solr.SwedishLightStemFilterFactory`. Lucene includes an example stopword list.
+
+Also relevant are the <<LanguageAnalysis-Scandinavian,Scandinavian normalization filters>>.
+
+*Factory class:* `solr.SwedishStemFilterFactory`
+
+*Arguments:* None
+
+*Example:*
+
+[source,xml]
+----
+<analyzer>
+  <tokenizer class="solr.StandardTokenizerFactory"/>
+  <filter class="solr.LowerCaseFilterFactory"/>
+  <filter class="solr.SwedishLightStemFilterFactory"/>
+</analyzer>
+----
+
+*In:* "kloke klokhet klokheten"
+
+*Tokenizer to Filter:* "kloke", "klokhet", "klokheten"
+
+*Out:* "klok", "klok", "klok"
+
+
+[[LanguageAnalysis-Thai]]
+=== Thai
+
+This filter converts sequences of Thai characters into individual Thai words. Unlike European languages, Thai does not use whitespace to delimit words.
+
+*Factory class:* `solr.ThaiTokenizerFactory`
+
+*Arguments:* None
+
+*Example:*
+
+[source,xml]
+----
+<analyzer type="index">
+  <tokenizer class="solr.ThaiTokenizerFactory"/>
+  <filter class="solr.LowerCaseFilterFactory"/>
+</analyzer>
+----
+
+[[LanguageAnalysis-Turkish]]
+=== Turkish
+
+Solr includes support for stemming Turkish with the `solr.SnowballPorterFilterFactory`; support for case-insensitive search with the `solr.TurkishLowerCaseFilterFactory`; support for stripping apostrophes and following suffixes with `solr.ApostropheFilterFactory` (see http://www.ipcsit.com/vol57/015-ICNI2012-M021.pdf[Role of Apostrophes in Turkish Information Retrieval]); support for a form of stemming that truncating tokens at a configurable maximum length through the `solr.TruncateTokenFilterFactory` (see http://www.users.muohio.edu/canf/papers/JASIST2008offPrint.pdf[Information Retrieval on Turkish Texts]); and Lucene includes an example stopword list.
+
+*Factory class:* `solr.TurkishLowerCaseFilterFactory`
+
+*Arguments:* None
+
+*Example:*
+
+[source,xml]
+----
+<analyzer>
+  <tokenizer class="solr.StandardTokenizerFactory"/>
+  <filter class="solr.ApostropheFilterFactory"/>
+  <filter class="solr.TurkishLowerCaseFilterFactory"/>
+  <filter class="solr.SnowballPorterFilterFactory" language="Turkish"/>
+</analyzer>
+----
+
+*Another example, illustrating diacritics-insensitive search:*
+
+[source,xml]
+----
+<analyzer>
+  <tokenizer class="solr.StandardTokenizerFactory"/>
+  <filter class="solr.ApostropheFilterFactory"/>
+  <filter class="solr.TurkishLowerCaseFilterFactory"/>
+  <filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="true"/>
+  <filter class="solr.KeywordRepeatFilterFactory"/>
+  <filter class="solr.TruncateTokenFilterFactory" prefixLength="5"/>
+  <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
+</analyzer>
+----
+
+[[LanguageAnalysis-BacktoTop#main]]
+===
+
+[[LanguageAnalysis-Ukrainian]]
+=== Ukrainian
+
+Solr provides support for Ukrainian lemmatization with the `solr.MorphologikFilterFactory`, in the `contrib/analysis-extras` module. To use this filter, see `solr/contrib/analysis-extras/README.txt` for instructions on which jars you need to add to your `solr_home/lib`.
+
+Lucene also includes an example Ukrainian stopword list, in the `lucene-analyzers-morfologik` jar.
+
+*Factory class:* `solr.MorfologikFilterFactory`
+
+*Arguments:*
+
+`dictionary`:: (required) lemmatizer dictionary - the `lucene-analyzers-morfologik` jar contains a Ukrainian dictionary at `org/apache/lucene/analysis/uk/ukrainian.dict`.
+
+*Example:*
+
+[source,xml]
+----
+<analyzer>
+  <tokenizer class="solr.StandardTokenizerFactory"/>
+  <filter class="solr.StopFilterFactory" words="org/apache/lucene/analysis/uk/stopwords.txt"/>
+  <filter class="solr.MorfologikFilterFactory" dictionary="org/apache/lucene/analysis/uk/ukrainian.dict"/>
+  <filter class="solr.LowerCaseFilterFactory"/>
+</analyzer>
+----
+
+Note the lower case filter is applied _after_ the Morfologik stemmer; this is because the Ukrainian dictionary contains proper names and then proper term case may be important to resolve disambiguities (or even lookup the correct lemma at all).
+
+The Morfologik `dictionary` param value is a constant specifying which dictionary to choose. The dictionary resource must be named `path/to/_language_.dict` and have an associated `.info` metadata file. See http://morfologik.blogspot.com/[the Morfologik project] for details. If the dictionary attribute is not provided, the Polish dictionary is loaded and used by default.