You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Christopher Schultz <ch...@christopherschultz.net> on 2018/03/12 16:51:59 UTC

Defining a phonetic analyzer and searcher via the schema API

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

All,

I'd like to add a new synthesized field that uses a phonetic analyzer
such as Beider-Morse. I'm using Solr 7.2.

When I request the current schema via the schema API, I get a list of
existing fields, dynamic fields, and analyzers, none of which appear
to be what I'm looking for.

Conceptually, I think I'd like to do something like this:

add-field: { name: phoneticname, type: phonetic, multiValued: true }

... but how do I define what type of data "phonetic" should be?

I can see the example XML definition in this document:
https://lucene.apache.org/solr/guide/7_2/filter-descriptions.html#Filter
Descriptions-Beider-MorseFilter

But I'm not sure how to add an analyzer to the schema using the schema
API: https://lucene.apache.org/solr/guide/7_2/schema-api.html

Under "Add a new field type", it says that new analyzers can be
defined, but I'm not entirely sure how to do that ... the API docs
refer to the field type definitions page[1] which just shows what XML
you'd have to put into your schema XML -- which you aren't supposed to
edit directly.

When looking at the JSON version of my schema, I can see for example thi
s:

    "fieldTypes":[{
        "name":"ancestor_path",
        "class":"solr.TextField",
        "indexAnalyzer":{
          "tokenizer":{
            "class":"solr.KeywordTokenizerFactory"}},
        "queryAnalyzer":{
          "tokenizer":{
            "class":"solr.PathHierarchyTokenizerFactory",
            "delimiter":"/"}}},

So should I create a new field type like this?

"add-field-type" : {
  "name" : "phonetic",
  "class" : "solr.TextField",

  "analyzer" : {
    "tokenizer": { "class" : "solr.StandardTokenizerFactory" },

    "filters" : [{
      "class": "solr.BeiderMorseFilterFactory",
      "nameType": "GENERIC",
      "ruleType": "APPROX",
      "concat": "true",
      "languageSet": "auto"
    }]
  }
}

Then, use copy-field as "usual":

  "add-field":{
     "name":"phonetic",
     "type":"phonetic",
     multiValued: true,
     "stored":false },

  "add-copy-field":{
     "source":"first_name",
     "dest":"phonetic" },

  "add-copy-field":{
     "source":"last_name",
     "dest":"phonetic" },

This seems to work but I wanted to know if I was doing it the right way.

Thanks,
- -chris

[1]
https://lucene.apache.org/solr/guide/7_2/field-type-definitions-and-prop
erties.html#field-type-definitions-and-properties
-----BEGIN PGP SIGNATURE-----
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQJRBAEBCAA7FiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlqmsC4dHGNocmlzQGNo
cmlzdG9waGVyc2NodWx0ei5uZXQACgkQHPApP6U8pFjZWRAAisee5Ya+5dyix91A
cGpwgZtFpcVldhd0wDG8qwihq9528vBZCdDSM3yotojMd+Y9dYLm+Q+oM/RT/zoO
IXVfRRc352GqG00++hYKpZONUp9Eb3RNjl64+TCufz7vSpr3U/TsJL4wwIMQAY3r
eItN/v6TWvvb6jd0z/zL1eITeheOm7bFGjZhGRNv2A7LaQbqTLs6N+SgYphUv7mr
E6oQZD5VsdNDqmQdpXVA+Z+eiHweST5JHm1T2ePPz2S7lYunmAcGkAhCmTn2Kwew
H3C8+h+mD14YlfYK5J0VcQ2WMZtOkgNNvBiUGIUoEGoqu82dX81408cS49/ZYD/3
c9/p41nfzz2V9M3HwgYqbQTI9vV5HP33t44BsWIQr34x86yAPfnMIH3Yv5iEfXTk
aGAyeQjkfmMfJbiKTtmVu8Z7q/AiacgzUFUh3yMzGnoDQKz/OWw0A3JkdJ0TT/vY
Y6ZiwarooO1tuhG+wm4h+6rUQpoueJS7K8cdWi7LfVb9LGLgj7NCaOQtyIn9QAmk
1UxaJjIOiyO1hsV31nC0kXfKW2A/gkN444gitSi51106QuzIXpEtCeAc4QmqjJt9
yeI61DFbQRnr76oVCiyYQwEmOj+C0bOkZqkLU7ZvMonWLLjgX0ydrpNSfm0fDDNv
tdfbE/POTM+uJlgX0UEEJhN7qz0=
=bgGi
-----END PGP SIGNATURE-----

Re: Defining a phonetic analyzer and searcher via the schema API

Posted by Erick Erickson <er...@gmail.com>.

Chris:

LGTM, except maybe ;).....

You'll want to look closely at your admin UI/Analysis page for the
field (or fieldType) once it's defined. Uncheck the "verbose" box when
you look the first time, it'll be less confusing. That'll show you
_exactly_ what the results are and whether they match your
expectations. "right" is such an existential question after all...

When you're using that page, think outside the box. For instance, I
can't say offhand whether the phonetic filter you chose gives
different results when words are capitalized or not. what about when
they have numbers? Put some punctuation in. Try an e-mail address.
Etc. etc. etc.

For instance. If you swap out StandardTokenizer for
WhitespaceTokenizer, you'll now have punctuation in the mix. Most
people don't notice if they have WordDelimiterGraphFilterFactory in
the analysis chain too....

bq: Actually, I have the script that builds the schema in VCS, so it's
roughly the same.

We're on the same page here. I don't particularly care how the schema
gets saved, as long as I can back up to the last known good schema and
start over....

I'll mention in passing that there's no problem whatsoever with using
the "classic" schema. The managed stuff is cool, and enables spiffy
front-ends etc. Personally I'm comfortable enough with hand-editing
the schemas that I find it faster so I usually use it.

BTW, bin/solr has a set of commands that allow you to move
upload/download configs, try "bin/solr zk -help".....

Walter:

"I don't usually test my code, but when I do it's in production".

These young whipper-snappers don't appreciate how _very_ many ways
things can go wrong ;)

My tongue-in-cheek way to distinguish novice from "veteran" programmers:

Novice: The code compiles and she's surprised when it doesn't work the
first time.

Veteran: The code ran perfectly the first time. She immediately goes
over it with a fine-tooth comb to see whether it's still running
canned test cases.

Best,
Erick


On Mon, Mar 12, 2018 at 10:14 AM, Christopher Schultz
<ch...@christopherschultz.net> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> Erick,
>
> On 3/12/18 1:00 PM, Erick Erickson wrote:
>> bq: which you aren't supposed to edit directly.
>>
>> Well, kind of. Here's why it's "discouraged":
>> https://lucene.apache.org/solr/guide/6_6/schema-api.html.
>>
>> But as long as you don't mix-and-match hand-editing with using the
>> schema API you can hand edit it freely. You're then in charge of
>> pushing it to ZK and reloading your collections that use it
>> yourself however.
>
> No Zookeeper (yet), but I suspect I'll end up there. I'm mostly
> toying-around with it right now, but it won't be long before I'll want
> to go live with it and having a single Solr instance isn't going to
> help me sleep well at night. I'm sure I'll end up with two instances
> to begin with, which requires ZK, right?
>
>> As a side note, even if I _never_ hand-edited it I'd make it a
>> practice to regularly pull it from ZK and put it in some VCS system
>> ;)
>
> Actually, I have the script that builds the schema in VCS, so it's
> roughly the same.
>
> As for the schema modifications... did I get those right?
>
> Thanks,
> - -chris
>
>> On Mon, Mar 12, 2018 at 9:51 AM, Christopher Schultz
>> <ch...@christopherschultz.net> wrote: All,
>>
>> I'd like to add a new synthesized field that uses a phonetic
>> analyzer such as Beider-Morse. I'm using Solr 7.2.
>>
>> When I request the current schema via the schema API, I get a list
>> of existing fields, dynamic fields, and analyzers, none of which
>> appear to be what I'm looking for.
>>
>> Conceptually, I think I'd like to do something like this:
>>
>> add-field: { name: phoneticname, type: phonetic, multiValued: true
>> }
>>
>> ... but how do I define what type of data "phonetic" should be?
>>
>> I can see the example XML definition in this document:
>> https://lucene.apache.org/solr/guide/7_2/filter-descriptions.html#Filt
> er
>>
>>
> Descriptions-Beider-MorseFilter
>>
>> But I'm not sure how to add an analyzer to the schema using the
>> schema API:
>> https://lucene.apache.org/solr/guide/7_2/schema-api.html
>>
>> Under "Add a new field type", it says that new analyzers can be
>> defined, but I'm not entirely sure how to do that ... the API docs
>> refer to the field type definitions page[1] which just shows what
>> XML you'd have to put into your schema XML -- which you aren't
>> supposed to edit directly.
>>
>> When looking at the JSON version of my schema, I can see for
>> example thi s:
>>
>> "fieldTypes":[{ "name":"ancestor_path", "class":"solr.TextField",
>> "indexAnalyzer":{ "tokenizer":{
>> "class":"solr.KeywordTokenizerFactory"}}, "queryAnalyzer":{
>> "tokenizer":{ "class":"solr.PathHierarchyTokenizerFactory",
>> "delimiter":"/"}}},
>>
>> So should I create a new field type like this?
>>
>> "add-field-type" : { "name" : "phonetic", "class" :
>> "solr.TextField",
>>
>> "analyzer" : { "tokenizer": { "class" :
>> "solr.StandardTokenizerFactory" },
>>
>> "filters" : [{ "class": "solr.BeiderMorseFilterFactory",
>> "nameType": "GENERIC", "ruleType": "APPROX", "concat": "true",
>> "languageSet": "auto" }] } }
>>
>> Then, use copy-field as "usual":
>>
>> "add-field":{ "name":"phonetic", "type":"phonetic", multiValued:
>> true, "stored":false },
>>
>> "add-copy-field":{ "source":"first_name", "dest":"phonetic" },
>>
>> "add-copy-field":{ "source":"last_name", "dest":"phonetic" },
>>
>> This seems to work but I wanted to know if I was doing it the right
>> way.
>>
>> Thanks, -chris
>>
>> [1]
>> https://lucene.apache.org/solr/guide/7_2/field-type-definitions-and-pr
> op
>>
>>
> erties.html#field-type-definitions-and-properties
>>
> -----BEGIN PGP SIGNATURE-----
> Comment: GPGTools - http://gpgtools.org
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
>
> iQJRBAEBCAA7FiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlqmtY4dHGNocmlzQGNo
> cmlzdG9waGVyc2NodWx0ei5uZXQACgkQHPApP6U8pFhdIA/9GkZ/yimVmkwB725L
> uS4kcy4YJowyYw+eMtvurpIq/ZV/U8H4hFJY/ddsT+bdrjeZMsTdc7B9Tdlha8xt
> dmuj1VcvDn3uyIUGooTOob6ZvZwjeJEZIJrbwUM5gNq7uJW8xpCU0/3+iP6Km7OY
> 1Nia5uCuwarLWcsRFdtjCvR3M7ZppBYHec3kVGGOUL637AC6ISgpxhuzOnuTHAss
> wCjuR1y6AdTjRbHpis3MJdiVIjEENfyzGpEnqvumsu1e+0F/A0DNbhU9nAPv+73d
> aOLfOW9Fs6jjnq96qzIBAkHLWkqU1GHKYNYHql7/59x8rFcjGkGC7ziSY69lKc+f
> ivrIEqLH1Go7kawz+1og3dPyl/n0CFWE3UK+wj5QeTY5XLduq0x6EmFKW6D790BS
> ywmFuqr4cmvKbs3N6BbxHz5QVbjgRsWO4jp4kJi3KDCepd8vKW+2xwHfX/zAcBKY
> rSDuVkM3KtxQal8xgm4tsvyU3g1dXpNEVa7PFXYJzd3uA2yij9OU6s83NS9LHK3N
> 2zssPfNDj7QddAEhYan0O4r4wSUN2UNT9nMhBVXXYRpoD6WzrhC5TdRUDh66rkOB
> AvhAUKsV0rfjct+MUBpQA9W+SUG7i911wNSBJJmB58MYbyxMAJb8NKGk1yEs1MyH
> FQHEgiEEFRCD9ZFd/fqwfuPyKQo=
> =Vqz6
> -----END PGP SIGNATURE-----

Re: Defining a phonetic analyzer and searcher via the schema API

Posted by Christopher Schultz <ch...@christopherschultz.net>.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Erick,

On 3/12/18 1:00 PM, Erick Erickson wrote:
> bq: which you aren't supposed to edit directly.
> 
> Well, kind of. Here's why it's "discouraged": 
> https://lucene.apache.org/solr/guide/6_6/schema-api.html.
> 
> But as long as you don't mix-and-match hand-editing with using the 
> schema API you can hand edit it freely. You're then in charge of 
> pushing it to ZK and reloading your collections that use it
> yourself however.

No Zookeeper (yet), but I suspect I'll end up there. I'm mostly
toying-around with it right now, but it won't be long before I'll want
to go live with it and having a single Solr instance isn't going to
help me sleep well at night. I'm sure I'll end up with two instances
to begin with, which requires ZK, right?

> As a side note, even if I _never_ hand-edited it I'd make it a 
> practice to regularly pull it from ZK and put it in some VCS system
> ;)

Actually, I have the script that builds the schema in VCS, so it's
roughly the same.

As for the schema modifications... did I get those right?

Thanks,
- -chris

> On Mon, Mar 12, 2018 at 9:51 AM, Christopher Schultz 
> <ch...@christopherschultz.net> wrote: All,
> 
> I'd like to add a new synthesized field that uses a phonetic
> analyzer such as Beider-Morse. I'm using Solr 7.2.
> 
> When I request the current schema via the schema API, I get a list
> of existing fields, dynamic fields, and analyzers, none of which
> appear to be what I'm looking for.
> 
> Conceptually, I think I'd like to do something like this:
> 
> add-field: { name: phoneticname, type: phonetic, multiValued: true
> }
> 
> ... but how do I define what type of data "phonetic" should be?
> 
> I can see the example XML definition in this document: 
> https://lucene.apache.org/solr/guide/7_2/filter-descriptions.html#Filt
er
>
> 
Descriptions-Beider-MorseFilter
> 
> But I'm not sure how to add an analyzer to the schema using the
> schema API:
> https://lucene.apache.org/solr/guide/7_2/schema-api.html
> 
> Under "Add a new field type", it says that new analyzers can be 
> defined, but I'm not entirely sure how to do that ... the API docs 
> refer to the field type definitions page[1] which just shows what
> XML you'd have to put into your schema XML -- which you aren't
> supposed to edit directly.
> 
> When looking at the JSON version of my schema, I can see for
> example thi s:
> 
> "fieldTypes":[{ "name":"ancestor_path", "class":"solr.TextField", 
> "indexAnalyzer":{ "tokenizer":{ 
> "class":"solr.KeywordTokenizerFactory"}}, "queryAnalyzer":{ 
> "tokenizer":{ "class":"solr.PathHierarchyTokenizerFactory", 
> "delimiter":"/"}}},
> 
> So should I create a new field type like this?
> 
> "add-field-type" : { "name" : "phonetic", "class" :
> "solr.TextField",
> 
> "analyzer" : { "tokenizer": { "class" :
> "solr.StandardTokenizerFactory" },
> 
> "filters" : [{ "class": "solr.BeiderMorseFilterFactory", 
> "nameType": "GENERIC", "ruleType": "APPROX", "concat": "true", 
> "languageSet": "auto" }] } }
> 
> Then, use copy-field as "usual":
> 
> "add-field":{ "name":"phonetic", "type":"phonetic", multiValued:
> true, "stored":false },
> 
> "add-copy-field":{ "source":"first_name", "dest":"phonetic" },
> 
> "add-copy-field":{ "source":"last_name", "dest":"phonetic" },
> 
> This seems to work but I wanted to know if I was doing it the right
> way.
> 
> Thanks, -chris
> 
> [1] 
> https://lucene.apache.org/solr/guide/7_2/field-type-definitions-and-pr
op
>
> 
erties.html#field-type-definitions-and-properties
> 
-----BEGIN PGP SIGNATURE-----
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQJRBAEBCAA7FiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlqmtY4dHGNocmlzQGNo
cmlzdG9waGVyc2NodWx0ei5uZXQACgkQHPApP6U8pFhdIA/9GkZ/yimVmkwB725L
uS4kcy4YJowyYw+eMtvurpIq/ZV/U8H4hFJY/ddsT+bdrjeZMsTdc7B9Tdlha8xt
dmuj1VcvDn3uyIUGooTOob6ZvZwjeJEZIJrbwUM5gNq7uJW8xpCU0/3+iP6Km7OY
1Nia5uCuwarLWcsRFdtjCvR3M7ZppBYHec3kVGGOUL637AC6ISgpxhuzOnuTHAss
wCjuR1y6AdTjRbHpis3MJdiVIjEENfyzGpEnqvumsu1e+0F/A0DNbhU9nAPv+73d
aOLfOW9Fs6jjnq96qzIBAkHLWkqU1GHKYNYHql7/59x8rFcjGkGC7ziSY69lKc+f
ivrIEqLH1Go7kawz+1og3dPyl/n0CFWE3UK+wj5QeTY5XLduq0x6EmFKW6D790BS
ywmFuqr4cmvKbs3N6BbxHz5QVbjgRsWO4jp4kJi3KDCepd8vKW+2xwHfX/zAcBKY
rSDuVkM3KtxQal8xgm4tsvyU3g1dXpNEVa7PFXYJzd3uA2yij9OU6s83NS9LHK3N
2zssPfNDj7QddAEhYan0O4r4wSUN2UNT9nMhBVXXYRpoD6WzrhC5TdRUDh66rkOB
AvhAUKsV0rfjct+MUBpQA9W+SUG7i911wNSBJJmB58MYbyxMAJb8NKGk1yEs1MyH
FQHEgiEEFRCD9ZFd/fqwfuPyKQo=
=Vqz6
-----END PGP SIGNATURE-----

Re: Defining a phonetic analyzer and searcher via the schema API

Posted by Walter Underwood <wu...@wunderwood.org>.

People can discourage that, but we only use hand-edited schema and solrconfig files. Those are checked into version control. I wrote some Python to load them into Zookeeper and reload the cluster.

This allows us to use the same configs in dev, test, and prod. We can actually test things before putting them in prod.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Mar 12, 2018, at 10:00 AM, Erick Erickson <er...@gmail.com> wrote:
> 
> bq: which you aren't supposed to edit directly.
> 
> Well, kind of. Here's why it's "discouraged":
> https://lucene.apache.org/solr/guide/6_6/schema-api.html.
> 
> But as long as you don't mix-and-match hand-editing with using the
> schema API you can hand edit it freely. You're then in charge of
> pushing it to ZK and reloading your collections that use it yourself
> however.
> 
> As a side note, even if I _never_ hand-edited it I'd make it a
> practice to regularly pull it from ZK and put it in some VCS system ;)
> 
> Best,
> Erick
> 
> On Mon, Mar 12, 2018 at 9:51 AM, Christopher Schultz
> <ch...@christopherschultz.net> wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA256
>> 
>> All,
>> 
>> I'd like to add a new synthesized field that uses a phonetic analyzer
>> such as Beider-Morse. I'm using Solr 7.2.
>> 
>> When I request the current schema via the schema API, I get a list of
>> existing fields, dynamic fields, and analyzers, none of which appear
>> to be what I'm looking for.
>> 
>> Conceptually, I think I'd like to do something like this:
>> 
>> add-field: { name: phoneticname, type: phonetic, multiValued: true }
>> 
>> ... but how do I define what type of data "phonetic" should be?
>> 
>> I can see the example XML definition in this document:
>> https://lucene.apache.org/solr/guide/7_2/filter-descriptions.html#Filter
>> Descriptions-Beider-MorseFilter
>> 
>> But I'm not sure how to add an analyzer to the schema using the schema
>> API: https://lucene.apache.org/solr/guide/7_2/schema-api.html
>> 
>> Under "Add a new field type", it says that new analyzers can be
>> defined, but I'm not entirely sure how to do that ... the API docs
>> refer to the field type definitions page[1] which just shows what XML
>> you'd have to put into your schema XML -- which you aren't supposed to
>> edit directly.
>> 
>> When looking at the JSON version of my schema, I can see for example thi
>> s:
>> 
>>    "fieldTypes":[{
>>        "name":"ancestor_path",
>>        "class":"solr.TextField",
>>        "indexAnalyzer":{
>>          "tokenizer":{
>>            "class":"solr.KeywordTokenizerFactory"}},
>>        "queryAnalyzer":{
>>          "tokenizer":{
>>            "class":"solr.PathHierarchyTokenizerFactory",
>>            "delimiter":"/"}}},
>> 
>> So should I create a new field type like this?
>> 
>> "add-field-type" : {
>>  "name" : "phonetic",
>>  "class" : "solr.TextField",
>> 
>>  "analyzer" : {
>>    "tokenizer": { "class" : "solr.StandardTokenizerFactory" },
>> 
>>    "filters" : [{
>>      "class": "solr.BeiderMorseFilterFactory",
>>      "nameType": "GENERIC",
>>      "ruleType": "APPROX",
>>      "concat": "true",
>>      "languageSet": "auto"
>>    }]
>>  }
>> }
>> 
>> Then, use copy-field as "usual":
>> 
>>  "add-field":{
>>     "name":"phonetic",
>>     "type":"phonetic",
>>     multiValued: true,
>>     "stored":false },
>> 
>>  "add-copy-field":{
>>     "source":"first_name",
>>     "dest":"phonetic" },
>> 
>>  "add-copy-field":{
>>     "source":"last_name",
>>     "dest":"phonetic" },
>> 
>> This seems to work but I wanted to know if I was doing it the right way.
>> 
>> Thanks,
>> - -chris
>> 
>> [1]
>> https://lucene.apache.org/solr/guide/7_2/field-type-definitions-and-prop
>> erties.html#field-type-definitions-and-properties
>> -----BEGIN PGP SIGNATURE-----
>> Comment: GPGTools - http://gpgtools.org
>> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
>> 
>> iQJRBAEBCAA7FiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlqmsC4dHGNocmlzQGNo
>> cmlzdG9waGVyc2NodWx0ei5uZXQACgkQHPApP6U8pFjZWRAAisee5Ya+5dyix91A
>> cGpwgZtFpcVldhd0wDG8qwihq9528vBZCdDSM3yotojMd+Y9dYLm+Q+oM/RT/zoO
>> IXVfRRc352GqG00++hYKpZONUp9Eb3RNjl64+TCufz7vSpr3U/TsJL4wwIMQAY3r
>> eItN/v6TWvvb6jd0z/zL1eITeheOm7bFGjZhGRNv2A7LaQbqTLs6N+SgYphUv7mr
>> E6oQZD5VsdNDqmQdpXVA+Z+eiHweST5JHm1T2ePPz2S7lYunmAcGkAhCmTn2Kwew
>> H3C8+h+mD14YlfYK5J0VcQ2WMZtOkgNNvBiUGIUoEGoqu82dX81408cS49/ZYD/3
>> c9/p41nfzz2V9M3HwgYqbQTI9vV5HP33t44BsWIQr34x86yAPfnMIH3Yv5iEfXTk
>> aGAyeQjkfmMfJbiKTtmVu8Z7q/AiacgzUFUh3yMzGnoDQKz/OWw0A3JkdJ0TT/vY
>> Y6ZiwarooO1tuhG+wm4h+6rUQpoueJS7K8cdWi7LfVb9LGLgj7NCaOQtyIn9QAmk
>> 1UxaJjIOiyO1hsV31nC0kXfKW2A/gkN444gitSi51106QuzIXpEtCeAc4QmqjJt9
>> yeI61DFbQRnr76oVCiyYQwEmOj+C0bOkZqkLU7ZvMonWLLjgX0ydrpNSfm0fDDNv
>> tdfbE/POTM+uJlgX0UEEJhN7qz0=
>> =bgGi
>> -----END PGP SIGNATURE-----

Re: Defining a phonetic analyzer and searcher via the schema API

Posted by Erick Erickson <er...@gmail.com>.

bq: which you aren't supposed to edit directly.

Well, kind of. Here's why it's "discouraged":
https://lucene.apache.org/solr/guide/6_6/schema-api.html.

But as long as you don't mix-and-match hand-editing with using the
schema API you can hand edit it freely. You're then in charge of
pushing it to ZK and reloading your collections that use it yourself
however.

As a side note, even if I _never_ hand-edited it I'd make it a
practice to regularly pull it from ZK and put it in some VCS system ;)

Best,
Erick

On Mon, Mar 12, 2018 at 9:51 AM, Christopher Schultz
<ch...@christopherschultz.net> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> All,
>
> I'd like to add a new synthesized field that uses a phonetic analyzer
> such as Beider-Morse. I'm using Solr 7.2.
>
> When I request the current schema via the schema API, I get a list of
> existing fields, dynamic fields, and analyzers, none of which appear
> to be what I'm looking for.
>
> Conceptually, I think I'd like to do something like this:
>
> add-field: { name: phoneticname, type: phonetic, multiValued: true }
>
> ... but how do I define what type of data "phonetic" should be?
>
> I can see the example XML definition in this document:
> https://lucene.apache.org/solr/guide/7_2/filter-descriptions.html#Filter
> Descriptions-Beider-MorseFilter
>
> But I'm not sure how to add an analyzer to the schema using the schema
> API: https://lucene.apache.org/solr/guide/7_2/schema-api.html
>
> Under "Add a new field type", it says that new analyzers can be
> defined, but I'm not entirely sure how to do that ... the API docs
> refer to the field type definitions page[1] which just shows what XML
> you'd have to put into your schema XML -- which you aren't supposed to
> edit directly.
>
> When looking at the JSON version of my schema, I can see for example thi
> s:
>
>     "fieldTypes":[{
>         "name":"ancestor_path",
>         "class":"solr.TextField",
>         "indexAnalyzer":{
>           "tokenizer":{
>             "class":"solr.KeywordTokenizerFactory"}},
>         "queryAnalyzer":{
>           "tokenizer":{
>             "class":"solr.PathHierarchyTokenizerFactory",
>             "delimiter":"/"}}},
>
> So should I create a new field type like this?
>
> "add-field-type" : {
>   "name" : "phonetic",
>   "class" : "solr.TextField",
>
>   "analyzer" : {
>     "tokenizer": { "class" : "solr.StandardTokenizerFactory" },
>
>     "filters" : [{
>       "class": "solr.BeiderMorseFilterFactory",
>       "nameType": "GENERIC",
>       "ruleType": "APPROX",
>       "concat": "true",
>       "languageSet": "auto"
>     }]
>   }
> }
>
> Then, use copy-field as "usual":
>
>   "add-field":{
>      "name":"phonetic",
>      "type":"phonetic",
>      multiValued: true,
>      "stored":false },
>
>   "add-copy-field":{
>      "source":"first_name",
>      "dest":"phonetic" },
>
>   "add-copy-field":{
>      "source":"last_name",
>      "dest":"phonetic" },
>
> This seems to work but I wanted to know if I was doing it the right way.
>
> Thanks,
> - -chris
>
> [1]
> https://lucene.apache.org/solr/guide/7_2/field-type-definitions-and-prop
> erties.html#field-type-definitions-and-properties
> -----BEGIN PGP SIGNATURE-----
> Comment: GPGTools - http://gpgtools.org
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
>
> iQJRBAEBCAA7FiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlqmsC4dHGNocmlzQGNo
> cmlzdG9waGVyc2NodWx0ei5uZXQACgkQHPApP6U8pFjZWRAAisee5Ya+5dyix91A
> cGpwgZtFpcVldhd0wDG8qwihq9528vBZCdDSM3yotojMd+Y9dYLm+Q+oM/RT/zoO
> IXVfRRc352GqG00++hYKpZONUp9Eb3RNjl64+TCufz7vSpr3U/TsJL4wwIMQAY3r
> eItN/v6TWvvb6jd0z/zL1eITeheOm7bFGjZhGRNv2A7LaQbqTLs6N+SgYphUv7mr
> E6oQZD5VsdNDqmQdpXVA+Z+eiHweST5JHm1T2ePPz2S7lYunmAcGkAhCmTn2Kwew
> H3C8+h+mD14YlfYK5J0VcQ2WMZtOkgNNvBiUGIUoEGoqu82dX81408cS49/ZYD/3
> c9/p41nfzz2V9M3HwgYqbQTI9vV5HP33t44BsWIQr34x86yAPfnMIH3Yv5iEfXTk
> aGAyeQjkfmMfJbiKTtmVu8Z7q/AiacgzUFUh3yMzGnoDQKz/OWw0A3JkdJ0TT/vY
> Y6ZiwarooO1tuhG+wm4h+6rUQpoueJS7K8cdWi7LfVb9LGLgj7NCaOQtyIn9QAmk
> 1UxaJjIOiyO1hsV31nC0kXfKW2A/gkN444gitSi51106QuzIXpEtCeAc4QmqjJt9
> yeI61DFbQRnr76oVCiyYQwEmOj+C0bOkZqkLU7ZvMonWLLjgX0ydrpNSfm0fDDNv
> tdfbE/POTM+uJlgX0UEEJhN7qz0=
> =bgGi
> -----END PGP SIGNATURE-----