You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Derek Poh <dp...@globalsources.com> on 2016/10/12 07:14:14 UTC

Re: Split words with period in between ("Co.Ltd") into separate tokens

I tried adding Word Delimiter Filter to the field but it does not 
process or it truncate away the term "Co.Ltd".

<filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" 
generateNumberParts="0" catenateWords="0" catenateNumbers="0" 
catenateAll="0" splitOnCaseChange="0"/>

On 10/12/2016 8:54 AM, Derek Poh wrote:
> Hi
>
> How can I split words with period in between into separate tokens.
> Eg. "Co.Ltd" => "Co" "Ltd" .
>
> I am using StandardTokenizerFactory and it does notreplace periods 
> (dots) that are not followed by whitespace are kept as part of the 
> token, including Internet domain names.
>
> This is the field definition,
>
> <fieldType name="text_general" class="solr.TextField" 
> positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true" 
> words="stopwords.txt" />
>         <filter class="solr.LowerCaseFilterFactory"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true" 
> words="stopwords.txt" />
>         <filter class="solr.SynonymFilterFactory" 
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>       </analyzer>
> </fieldType>
>
> Solr versionis 10.4.10.
>
> Derek
>
> ----------------------
> CONFIDENTIALITY NOTICE
> This e-mail (including any attachments) may contain confidential 
> and/or privileged information. If you are not the intended recipient 
> or have received this e-mail in error, please inform the sender 
> immediately and delete this e-mail (including any attachments) from 
> your computer, and you must not use, disclose to anyone else or copy 
> this e-mail (including any attachments), whether in whole or in part.
> This e-mail and any reply to it may be monitored for security, legal, 
> regulatory compliance and/or other appropriate reasons.

----------------------
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 

This e-mail and any reply to it may be monitored for security, legal, regulatory compliance and/or other appropriate reasons.

Re: Split words with period in between ("Co.Ltd") into separate tokens

Posted by Derek Poh <dp...@globalsources.com>.

Thank you for pointing out the flags.
I set generateWordParts=1 and the term is split up.

On 10/12/2016 3:26 PM, Modassar Ather wrote:
> Hi,
>
> The flags set in your WordDelimiterFilterFactory definition is 0.
> You can try with generateWordParts=1 and splitOnCaseChange=1 and see if it
> breaks as per your requirement.
> You can also try with other available flags enabled.
>
> Best,
> Modassar
>
> On Wed, Oct 12, 2016 at 12:44 PM, Derek Poh <dp...@globalsources.com> wrote:
>
>> I tried adding Word Delimiter Filter to the field but it does not process
>> or it truncate away the term "Co.Ltd".
>>
>> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0"
>> generateNumberParts="0" catenateWords="0" catenateNumbers="0"
>> catenateAll="0" splitOnCaseChange="0"/>
>>
>> On 10/12/2016 8:54 AM, Derek Poh wrote:
>>
>>> Hi
>>>
>>> How can I split words with period in between into separate tokens.
>>> Eg. "Co.Ltd" => "Co" "Ltd" .
>>>
>>> I am using StandardTokenizerFactory and it does notreplace periods (dots)
>>> that are not followed by whitespace are kept as part of the token,
>>> including Internet domain names.
>>>
>>> This is the field definition,
>>>
>>> <fieldType name="text_general" class="solr.TextField"
>>> positionIncrementGap="100">
>>>        <analyzer type="index">
>>>          <tokenizer class="solr.StandardTokenizerFactory"/>
>>>          <filter class="solr.StopFilterFactory" ignoreCase="true"
>>> words="stopwords.txt" />
>>>          <filter class="solr.LowerCaseFilterFactory"/>
>>>        </analyzer>
>>>        <analyzer type="query">
>>>          <tokenizer class="solr.StandardTokenizerFactory"/>
>>>          <filter class="solr.StopFilterFactory" ignoreCase="true"
>>> words="stopwords.txt" />
>>>          <filter class="solr.SynonymFilterFactory"
>>> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>>>          <filter class="solr.LowerCaseFilterFactory"/>
>>>        </analyzer>
>>> </fieldType>
>>>
>>> Solr versionis 10.4.10.
>>>
>>> Derek
>>>
>>> ----------------------
>>> CONFIDENTIALITY NOTICE
>>> This e-mail (including any attachments) may contain confidential and/or
>>> privileged information. If you are not the intended recipient or have
>>> received this e-mail in error, please inform the sender immediately and
>>> delete this e-mail (including any attachments) from your computer, and you
>>> must not use, disclose to anyone else or copy this e-mail (including any
>>> attachments), whether in whole or in part.
>>> This e-mail and any reply to it may be monitored for security, legal,
>>> regulatory compliance and/or other appropriate reasons.
>>>
>> ----------------------
>> CONFIDENTIALITY NOTICE
>> This e-mail (including any attachments) may contain confidential and/or
>> privileged information. If you are not the intended recipient or have
>> received this e-mail in error, please inform the sender immediately and
>> delete this e-mail (including any attachments) from your computer, and you
>> must not use, disclose to anyone else or copy this e-mail (including any
>> attachments), whether in whole or in part.
>> This e-mail and any reply to it may be monitored for security, legal,
>> regulatory compliance and/or other appropriate reasons.
>>
>>


----------------------
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 

This e-mail and any reply to it may be monitored for security, legal, regulatory compliance and/or other appropriate reasons.

Re: Split words with period in between ("Co.Ltd") into separate tokens

Posted by Modassar Ather <mo...@gmail.com>.

Hi,

The flags set in your WordDelimiterFilterFactory definition is 0.
You can try with generateWordParts=1 and splitOnCaseChange=1 and see if it
breaks as per your requirement.
You can also try with other available flags enabled.

Best,
Modassar

On Wed, Oct 12, 2016 at 12:44 PM, Derek Poh <dp...@globalsources.com> wrote:

> I tried adding Word Delimiter Filter to the field but it does not process
> or it truncate away the term "Co.Ltd".
>
> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0"
> generateNumberParts="0" catenateWords="0" catenateNumbers="0"
> catenateAll="0" splitOnCaseChange="0"/>
>
> On 10/12/2016 8:54 AM, Derek Poh wrote:
>
>> Hi
>>
>> How can I split words with period in between into separate tokens.
>> Eg. "Co.Ltd" => "Co" "Ltd" .
>>
>> I am using StandardTokenizerFactory and it does notreplace periods (dots)
>> that are not followed by whitespace are kept as part of the token,
>> including Internet domain names.
>>
>> This is the field definition,
>>
>> <fieldType name="text_general" class="solr.TextField"
>> positionIncrementGap="100">
>>       <analyzer type="index">
>>         <tokenizer class="solr.StandardTokenizerFactory"/>
>>         <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords.txt" />
>>         <filter class="solr.LowerCaseFilterFactory"/>
>>       </analyzer>
>>       <analyzer type="query">
>>         <tokenizer class="solr.StandardTokenizerFactory"/>
>>         <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords.txt" />
>>         <filter class="solr.SynonymFilterFactory"
>> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>>         <filter class="solr.LowerCaseFilterFactory"/>
>>       </analyzer>
>> </fieldType>
>>
>> Solr versionis 10.4.10.
>>
>> Derek
>>
>> ----------------------
>> CONFIDENTIALITY NOTICE
>> This e-mail (including any attachments) may contain confidential and/or
>> privileged information. If you are not the intended recipient or have
>> received this e-mail in error, please inform the sender immediately and
>> delete this e-mail (including any attachments) from your computer, and you
>> must not use, disclose to anyone else or copy this e-mail (including any
>> attachments), whether in whole or in part.
>> This e-mail and any reply to it may be monitored for security, legal,
>> regulatory compliance and/or other appropriate reasons.
>>
>
> ----------------------
> CONFIDENTIALITY NOTICE
> This e-mail (including any attachments) may contain confidential and/or
> privileged information. If you are not the intended recipient or have
> received this e-mail in error, please inform the sender immediately and
> delete this e-mail (including any attachments) from your computer, and you
> must not use, disclose to anyone else or copy this e-mail (including any
> attachments), whether in whole or in part.
> This e-mail and any reply to it may be monitored for security, legal,
> regulatory compliance and/or other appropriate reasons.
>
>