You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Scott Chu <sc...@udngroup.com> on 2015/10/22 10:53:09 UTC

Is it possible to specigfy only one-character term synonym for 2-gram tokenizer?

Hi solr-user,

I always uses CJKTokenizer on appropriate amount of Chinese news articles. Say in Chinese, character C1 has same meaning as character C2 (e.g 台=臺), Is it possible that I only add this line in synonym.txt:

C1,C2 (and in true exmaple: 台, 臺)

and by applying CJKTokenizer and SynonymFilter, I only have to query "C1Cm..."  (say Cm is arbitrary Chinese character) and Solr will return documents that matche whether "C1Cm" or "C2Cm"?

Scott Chu，scott.chu@udngroup.com
2015/10/22

Re: Is it possible to specigfy only one-character term synonymfor2-gram tokenizer?

Posted by Erick Erickson <er...@gmail.com>.

Scott:

The Apache spam filters are quite aggressive and sometimes reject e-mails
that are formatted any way other than "plain text" so that may have been
what happened to your e-mails.

Best,
Erick

On Fri, Oct 23, 2015 at 3:23 AM, Emir Arnautovic
<em...@sematext.com> wrote:
> Hi Scott,
> This replacement will only be in index terms and not in stored field so you
> are fine - problem you mention is related to case when you do replacement in
> raw text. However, this would be part of analysis chain (both index and
> query)  so has no effect on presentation (unless you are using index to
> reconstruct your text - which I assume you don't).
>
> Thanks,
> Emir
>
> On 23.10.2015 03:26, Scott Chu wrote:
>>
>> Hi Emir,
>> Very weirdly. I've reply to your email at home many times yesterday but
>> they never show up in the solr-user email list again. Don't know why. So I
>> reply this again at office. Hope this will show up.
>> Thanks to your explanation. I'll see PatternReplaceCharFilter as a
>> workaround (As I know, Character filter are dealing with input stream before
>> the tokenizer. In some way, indexed data no longer has original C1 if I do
>> the replacement.) What I deal wth are published news articles and I don't
>> know how the author of these articles feel about when they see C1 in their
>> articles become C2 since some term containing C1 are proper nouns or
>> terminologies. I'll talk to them to see if this is ok. Thanks anyway.
>> Scott Chu，scott.chu@udngroup.com <ma...@udngroup.com>
>> 2015/10/23
>>
>>     ----- Original Message -----
>>     *From: *Emir Arnautovic <ma...@sematext.com>
>>     *To: *solr-user <ma...@lucene.apache.org>
>>     *Date: *2015-10-22, 18:20:38
>>     *Subject: *Re: Is it possible to specigfy only one-character term
>>     synonymfor2-gram tokenizer?
>>
>>     Hi Scott,
>>     Using PatternReplaceCharFilter is not same as replacing raw data
>>     (replacing raw data is not proper solution as it does not solve issue
>>     when searching with "other" character). This is part of token
>>     standardization, no different than lower casing - it is standard
>>     approach as well when it comes to Latin characters:
>>     <charFilter class="solr.MappingCharFilterFactory"
>>     mapping="mapping-ISOLatin1Accent.txt"/>
>>
>>     Quick search of "MappingCharFilterFactory chinese" shows it is used -
>>     you should check if suitable for your case.
>>
>>     Thanks,
>>     Emir
>>
>>     On 22.10.2015 11:48, Scott Chu wrote:
>>     > Hi solr-user,
>>     > Ya, I thought about replacing C1 with C2 in the underground raw
>>     data.
>>     > However, it's a huge data set (over 10M news articles) so I give up
>>     > this strategy eariler. My current temporary solution is going
>>     back to
>>     > use 1-gram tokenizer ((i.e.StandardTokenizer) so I can only set 1
>>     > rule. But it is kinda ugly, especially when applying highlight,
>>     e.g.
>>     > search "C1C2" Solr returns highlight snippet such as
>>     > "...<em>C1</em><em>C2<em>...".
>>     > Scott Chu，scott.chu@udngroup.com
>>     <ma...@udngroup.com> <mailto:scott.chu@udngroup.com
>>     <ma...@udngroup.com>>
>>     > 2015/10/22
>>     >
>>     > ----- Original Message -----
>>     > *From: *Emir Arnautovic <mailto:emir.arnautovic@sematext.com
>>     <ma...@sematext.com>>
>>     > *To: *solr-user <mailto:solr-user@lucene.apache.org
>>     <ma...@lucene.apache.org>>
>>
>>     > *Date: *2015-10-22, 17:08:26
>>     > *Subject: *Re: Is it possible to specigfy only one-character term
>>     > synonym for2-gram tokenizer?
>>     >
>>     > Hi Scott,
>>     > I don't have experience with Chinese, but SynonymFilter works on
>>     > tokens,
>>     > so if CJKTokenizer recognizes C1 and Cm as tokens, it should
>>     work. If
>>     > not, than you can try configuring PatternReplaceCharFilter to
>>     > replace C1
>>     > to C2 during indexing and searching and get a match.
>>     >
>>     > Thanks,
>>     > Emir
>>     >
>>     > On 22.10.2015 10:53, Scott Chu wrote:
>>     > > Hi solr-user,
>>     > > I always uses CJKTokenizer on appropriate amount of Chinese news
>>     > > articles. Say in Chinese, character C1 has same meaning as
>>     > > character C2 (e.g 台=臺), Is it possible that I only add this
>>     > line in
>>     > > synonym.txt:
>>     > > C1,C2 (and in true exmaple: 台, 臺)
>>     > > and by applying CJKTokenizer and SynonymFilter, I only have to
>>     > query
>>     > > "C1Cm..." (say Cm is arbitrary Chinese character) and Solr will
>>     > > return documents that matche whether "C1Cm" or "C2Cm"?
>>     > > Scott Chu，scott.chu@udngroup.com
>>     <ma...@udngroup.com>
>>     > <mailto:%20scott.chu@udngroup.com
>>     <ma...@udngroup.com>>
>>     <mailto:scott.chu@udngroup.com <ma...@udngroup.com>
>>     > <mailto:%20scott.chu@udngroup.com
>>     <ma...@udngroup.com>>>
>>     > > 2015/10/22
>>     > >
>>     >
>>     > --
>>     > Monitoring * Alerting * Anomaly Detection * Centralized Log
>>     Management
>>     > Solr & Elasticsearch Support * http://sematext.com/
>>     >
>>     >
>>     >
>>     >
>>     > -----
>>     > 未在此訊息中找到病毒。
>>     > 已透過 AVG 檢查 - www.avg.com
>>     > 版本: 2015.0.6172 / 病毒庫: 4450/10867 - 發佈日期: 10/21/15
>>     >
>>
>>     --     Monitoring * Alerting * Anomaly Detection * Centralized Log
>> Management
>>     Solr & Elasticsearch Support * http://sematext.com/
>>
>>
>>
>>
>>     -----
>>     未在此訊息中找到病毒。
>>     已透過 AVG 檢查 - www.avg.com
>>     版本: 2015.0.6172 / 病毒庫: 4450/10867 - 發佈日期: 10/21/15
>>
>
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>

Re: Is it possible to specigfy only one-character term synonymfor2-gram tokenizer?

Posted by Emir Arnautovic <em...@sematext.com>.

Hi Scott,
This replacement will only be in index terms and not in stored field so 
you are fine - problem you mention is related to case when you do 
replacement in raw text. However, this would be part of analysis chain 
(both index and query)  so has no effect on presentation (unless you are 
using index to reconstruct your text - which I assume you don't).

Thanks,
Emir

On 23.10.2015 03:26, Scott Chu wrote:
> Hi Emir,
> Very weirdly. I've reply to your email at home many times yesterday 
> but they never show up in the solr-user email list again. Don't know 
> why. So I reply this again at office. Hope this will show up.
> Thanks to your explanation. I'll see PatternReplaceCharFilter as a 
> workaround (As I know, Character filter are dealing with input stream 
> before the tokenizer. In some way, indexed data no longer has original 
> C1 if I do the replacement.) What I deal wth are published news 
> articles and I don't know how the author of these articles feel about 
> when they see C1 in their articles become C2 since some term 
> containing C1 are proper nouns or terminologies. I'll talk to them to 
> see if this is ok. Thanks anyway.
> Scott Chu，scott.chu@udngroup.com <ma...@udngroup.com>
> 2015/10/23
>
>     ----- Original Message -----
>     *From: *Emir Arnautovic <ma...@sematext.com>
>     *To: *solr-user <ma...@lucene.apache.org>
>     *Date: *2015-10-22, 18:20:38
>     *Subject: *Re: Is it possible to specigfy only one-character term
>     synonymfor2-gram tokenizer?
>
>     Hi Scott,
>     Using PatternReplaceCharFilter is not same as replacing raw data
>     (replacing raw data is not proper solution as it does not solve issue
>     when searching with "other" character). This is part of token
>     standardization, no different than lower casing - it is standard
>     approach as well when it comes to Latin characters:
>     <charFilter class="solr.MappingCharFilterFactory"
>     mapping="mapping-ISOLatin1Accent.txt"/>
>
>     Quick search of "MappingCharFilterFactory chinese" shows it is used -
>     you should check if suitable for your case.
>
>     Thanks,
>     Emir
>
>     On 22.10.2015 11:48, Scott Chu wrote:
>     > Hi solr-user,
>     > Ya, I thought about replacing C1 with C2 in the underground raw
>     data.
>     > However, it's a huge data set (over 10M news articles) so I give up
>     > this strategy eariler. My current temporary solution is going
>     back to
>     > use 1-gram tokenizer ((i.e.StandardTokenizer) so I can only set 1
>     > rule. But it is kinda ugly, especially when applying highlight,
>     e.g.
>     > search "C1C2" Solr returns highlight snippet such as
>     > "...<em>C1</em><em>C2<em>...".
>     > Scott Chu，scott.chu@udngroup.com
>     <ma...@udngroup.com> <mailto:scott.chu@udngroup.com
>     <ma...@udngroup.com>>
>     > 2015/10/22
>     >
>     > ----- Original Message -----
>     > *From: *Emir Arnautovic <mailto:emir.arnautovic@sematext.com
>     <ma...@sematext.com>>
>     > *To: *solr-user <mailto:solr-user@lucene.apache.org
>     <ma...@lucene.apache.org>>
>     > *Date: *2015-10-22, 17:08:26
>     > *Subject: *Re: Is it possible to specigfy only one-character term
>     > synonym for2-gram tokenizer?
>     >
>     > Hi Scott,
>     > I don't have experience with Chinese, but SynonymFilter works on
>     > tokens,
>     > so if CJKTokenizer recognizes C1 and Cm as tokens, it should
>     work. If
>     > not, than you can try configuring PatternReplaceCharFilter to
>     > replace C1
>     > to C2 during indexing and searching and get a match.
>     >
>     > Thanks,
>     > Emir
>     >
>     > On 22.10.2015 10:53, Scott Chu wrote:
>     > > Hi solr-user,
>     > > I always uses CJKTokenizer on appropriate amount of Chinese news
>     > > articles. Say in Chinese, character C1 has same meaning as
>     > > character C2 (e.g 台=臺), Is it possible that I only add this
>     > line in
>     > > synonym.txt:
>     > > C1,C2 (and in true exmaple: 台, 臺)
>     > > and by applying CJKTokenizer and SynonymFilter, I only have to
>     > query
>     > > "C1Cm..." (say Cm is arbitrary Chinese character) and Solr will
>     > > return documents that matche whether "C1Cm" or "C2Cm"?
>     > > Scott Chu，scott.chu@udngroup.com
>     <ma...@udngroup.com>
>     > <mailto:%20scott.chu@udngroup.com
>     <ma...@udngroup.com>>
>     <mailto:scott.chu@udngroup.com <ma...@udngroup.com>
>     > <mailto:%20scott.chu@udngroup.com
>     <ma...@udngroup.com>>>
>     > > 2015/10/22
>     > >
>     >
>     > --
>     > Monitoring * Alerting * Anomaly Detection * Centralized Log
>     Management
>     > Solr & Elasticsearch Support * http://sematext.com/
>     >
>     >
>     >
>     >
>     > -----
>     > 未在此訊息中找到病毒。
>     > 已透過 AVG 檢查 - www.avg.com
>     > 版本: 2015.0.6172 / 病毒庫: 4450/10867 - 發佈日期: 10/21/15
>     >
>
>     -- 
>     Monitoring * Alerting * Anomaly Detection * Centralized Log Management
>     Solr & Elasticsearch Support * http://sematext.com/
>
>
>
>
>     -----
>     未在此訊息中找到病毒。
>     已透過 AVG 檢查 - www.avg.com
>     版本: 2015.0.6172 / 病毒庫: 4450/10867 - 發佈日期: 10/21/15
>

-- 
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/

Re: Is it possible to specigfy only one-character term synonymfor2-gram tokenizer?

Posted by Scott Chu <sc...@udngroup.com>.

Hi Emir,

Very weirdly. I've reply to your email at home many times yesterday but they never show up in the solr-user email list again. Don't know why. So I reply this again at office. Hope this will show up.

Thanks to your explanation. I'll see PatternReplaceCharFilter as a workaround (As I know, Character filter are dealing with input stream before the tokenizer. In some way, indexed data no longer has original C1 if I do the replacement.) What I deal wth are published news articles and I don't know how the author of these articles feel about when they see C1 in their articles become C2 since some term containing C1 are proper nouns or terminologies. I'll talk to them to see if this is ok. Thanks anyway.

Scott Chu，scott.chu@udngroup.com
2015/10/23 
----- Original Message ----- 
From: Emir Arnautovic 
To: solr-user 
Date: 2015-10-22, 18:20:38
Subject: Re: Is it possible to specigfy only one-character term synonymfor2-gram tokenizer?

Hi Scott,
Using PatternReplaceCharFilter is not same as replacing raw data 
(replacing raw data is not proper solution as it does not solve issue 
when searching with "other" character). This is part of token 
standardization, no different than lower casing - it is standard 
approach as well when it comes to Latin characters:
<charFilter class="solr.MappingCharFilterFactory" 
mapping="mapping-ISOLatin1Accent.txt"/>

Quick search of "MappingCharFilterFactory chinese" shows it is used - 
you should check if suitable for your case.

Thanks,
Emir

On 22.10.2015 11:48, Scott Chu wrote:
> Hi solr-user,
> Ya, I thought about replacing C1 with C2 in the underground raw data. 
> However, it's a huge data set (over 10M news articles) so I give up 
> this strategy eariler. My current temporary solution is going back to 
> use 1-gram tokenizer ((i.e.StandardTokenizer) so I can only set 1 
> rule. But it is kinda ugly, especially when applying highlight, e.g. 
> search "C1C2" Solr returns highlight snippet such as 
> "...<em>C1</em><em>C2<em>...".
> Scott Chu，scott.chu@udngroup.com <ma...@udngroup.com>
> 2015/10/22
>
> ----- Original Message -----
> *From: *Emir Arnautovic <ma...@sematext.com>
> *To: *solr-user <ma...@lucene.apache.org>
> *Date: *2015-10-22, 17:08:26
> *Subject: *Re: Is it possible to specigfy only one-character term
> synonym for2-gram tokenizer?
>
> Hi Scott,
> I don't have experience with Chinese, but SynonymFilter works on
> tokens,
> so if CJKTokenizer recognizes C1 and Cm as tokens, it should work. If
> not, than you can try configuring PatternReplaceCharFilter to
> replace C1
> to C2 during indexing and searching and get a match.
>
> Thanks,
> Emir
>
> On 22.10.2015 10:53, Scott Chu wrote:
> > Hi solr-user,
> > I always uses CJKTokenizer on appropriate amount of Chinese news
> > articles. Say in Chinese, character C1 has same meaning as
> > character C2 (e.g 台=臺), Is it possible that I only add this
> line in
> > synonym.txt:
> > C1,C2 (and in true exmaple: 台, 臺)
> > and by applying CJKTokenizer and SynonymFilter, I only have to
> query
> > "C1Cm..." (say Cm is arbitrary Chinese character) and Solr will
> > return documents that matche whether "C1Cm" or "C2Cm"?
> > Scott Chu，scott.chu@udngroup.com
> <ma...@udngroup.com> <mailto:scott.chu@udngroup.com
> <ma...@udngroup.com>>
> > 2015/10/22
> >
>
> -- 
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>
>
>
>
> -----
> 未在此訊息中找到病毒。
> 已透過 AVG 檢查 - www.avg.com
> 版本: 2015.0.6172 / 病毒庫: 4450/10867 - 發佈日期: 10/21/15
>

-- 
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/

-----
未在此訊息中找到病毒。
已透過 AVG 檢查 - www.avg.com
版本: 2015.0.6172 / 病毒庫: 4450/10867 - 發佈日期: 10/21/15

Re: Is it possible to specigfy only one-character term synonym for2-gram tokenizer?

Posted by Emir Arnautovic <em...@sematext.com>.

Hi Scott,
Using PatternReplaceCharFilter is not same as replacing raw data 
(replacing raw data is not proper solution as it does not solve issue 
when searching with "other" character). This is part of token 
standardization, no different than lower casing - it is standard 
approach as well when it comes to Latin characters:
<charFilter class="solr.MappingCharFilterFactory" 
mapping="mapping-ISOLatin1Accent.txt"/>

Quick search of "MappingCharFilterFactory chinese" shows it is used - 
you should check if suitable for your case.

Thanks,
Emir

On 22.10.2015 11:48, Scott Chu wrote:
> Hi solr-user,
> Ya, I thought about replacing C1 with C2 in the underground raw data. 
> However, it's a huge data set (over 10M news articles) so I give up 
> this strategy eariler. My current temporary solution is going back to 
> use 1-gram tokenizer ((i.e.StandardTokenizer) so I can only set 1 
> rule. But it is kinda ugly, especially when applying highlight, e.g. 
> search "C1C2" Solr returns highlight snippet such as 
> "...<em>C1</em><em>C2<em>...".
> Scott Chu，scott.chu@udngroup.com <ma...@udngroup.com>
> 2015/10/22
>
>     ----- Original Message -----
>     *From: *Emir Arnautovic <ma...@sematext.com>
>     *To: *solr-user <ma...@lucene.apache.org>
>     *Date: *2015-10-22, 17:08:26
>     *Subject: *Re: Is it possible to specigfy only one-character term
>     synonym for2-gram tokenizer?
>
>     Hi Scott,
>     I don't have experience with Chinese, but SynonymFilter works on
>     tokens,
>     so if CJKTokenizer recognizes C1 and Cm as tokens, it should work. If
>     not, than you can try configuring PatternReplaceCharFilter to
>     replace C1
>     to C2 during indexing and searching and get a match.
>
>     Thanks,
>     Emir
>
>     On 22.10.2015 10:53, Scott Chu wrote:
>     > Hi solr-user,
>     > I always uses CJKTokenizer on appropriate amount of Chinese news
>     > articles. Say in Chinese, character C1 has same meaning as
>     > character C2 (e.g 台=臺), Is it possible that I only add this
>     line in
>     > synonym.txt:
>     > C1,C2 (and in true exmaple: 台, 臺)
>     > and by applying CJKTokenizer and SynonymFilter, I only have to
>     query
>     > "C1Cm..." (say Cm is arbitrary Chinese character) and Solr will
>     > return documents that matche whether "C1Cm" or "C2Cm"?
>     > Scott Chu，scott.chu@udngroup.com
>     <ma...@udngroup.com> <mailto:scott.chu@udngroup.com
>     <ma...@udngroup.com>>
>     > 2015/10/22
>     >
>
>     -- 
>     Monitoring * Alerting * Anomaly Detection * Centralized Log Management
>     Solr & Elasticsearch Support * http://sematext.com/
>
>
>
>
>     -----
>     未在此訊息中找到病毒。
>     已透過 AVG 檢查 - www.avg.com
>     版本: 2015.0.6172 / 病毒庫: 4450/10867 - 發佈日期: 10/21/15
>

-- 
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/

Re: Is it possible to specigfy only one-character term synonym for2-gram tokenizer?

Posted by Scott Chu <sc...@udngroup.com>.

Hi solr-user,

Ya, I thought about replacing C1 with C2 in the underground raw data. However, it's a huge data set (over 10M news articles) so I give up this strategy eariler. My current temporary solution is going back to use 1-gram tokenizer ((i.e.StandardTokenizer) so I can only set 1 rule. But it is kinda ugly, especially when applying highlight, e.g. search "C1C2" Solr returns highlight snippet such as "...<em>C1</em><em>C2<em>...".

Scott Chu，scott.chu@udngroup.com
2015/10/22 
----- Original Message ----- 
From: Emir Arnautovic 
To: solr-user 
Date: 2015-10-22, 17:08:26
Subject: Re: Is it possible to specigfy only one-character term synonym for2-gram tokenizer?

Hi Scott,
I don't have experience with Chinese, but SynonymFilter works on tokens, 
so if CJKTokenizer recognizes C1 and Cm as tokens, it should work. If 
not, than you can try configuring PatternReplaceCharFilter to replace C1 
to C2 during indexing and searching and get a match.

Thanks,
Emir

On 22.10.2015 10:53, Scott Chu wrote:
> Hi solr-user,
> I always uses CJKTokenizer on appropriate amount of Chinese news 
> articles. Say in Chinese, character C1 has same meaning as 
> character C2 (e.g 台=臺), Is it possible that I only add this line in 
> synonym.txt:
> C1,C2 (and in true exmaple: 台, 臺)
> and by applying CJKTokenizer and SynonymFilter, I only have to query 
> "C1Cm..." (say Cm is arbitrary Chinese character) and Solr will 
> return documents that matche whether "C1Cm" or "C2Cm"?
> Scott Chu，scott.chu@udngroup.com <ma...@udngroup.com>
> 2015/10/22
>

-- 
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/

-----
未在此訊息中找到病毒。
已透過 AVG 檢查 - www.avg.com
版本: 2015.0.6172 / 病毒庫: 4450/10867 - 發佈日期: 10/21/15

Re: Is it possible to specigfy only one-character term synonym for 2-gram tokenizer?

Posted by Emir Arnautovic <em...@sematext.com>.

Hi Scott,
I don't have experience with Chinese, but SynonymFilter works on tokens, 
so if CJKTokenizer recognizes C1 and Cm as tokens, it should work. If 
not, than you can try configuring PatternReplaceCharFilter to replace C1 
to C2 during indexing and searching and get a match.

Thanks,
Emir

On 22.10.2015 10:53, Scott Chu wrote:
> Hi solr-user,
> I always uses CJKTokenizer on appropriate amount of Chinese news 
> articles. Say in Chinese, character C1 has same meaning as 
> character C2 (e.g 台=臺), Is it possible that I only add this line in 
> synonym.txt:
> C1,C2 (and in true exmaple: 台, 臺)
> and by applying CJKTokenizer and SynonymFilter, I only have to query 
> "C1Cm..."  (say Cm is arbitrary Chinese character) and Solr will 
> return documents that matche whether "C1Cm" or "C2Cm"?
> Scott Chu，scott.chu@udngroup.com <ma...@udngroup.com>
> 2015/10/22
>

-- 
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/