You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Shivashankar Maddanimath <sh...@yahoo.in> on 2015/01/28 08:00:45 UTC
Can we configure analyzers to not exclude specific characters
Hi,
I am using Lucene standard and uax29urlemailtokenizer. These analysers are excluding some characters like "+" ( I can't search C++). Is there any way we can configure analyzers to include specific characters in analyzers while tokenising?
Regards,
Shiv
-----Original Message-----
From: "Luis A Lastras" <la...@us.ibm.com>
Sent: 25-01-2015 08:05 AM
To: "java-user@lucene.apache.org" <ja...@lucene.apache.org>
Subject: Absolute term position in scoring
Is it possible to incorporate in Lucene's scoring function the position of a matching term (say as measured from the top of the document). The scenario is, if the set of documents tend to lk about the most important stuff at the beginning of the document, then we would like to give preference to documents that mention a term close to the top.
Thanks,
Luis
Luis A Lastras, Ph.D.
Research Staff Member & Manager, Concept Analytics, IBM Watson
Member of the iBM Academy of Technology
IBM Master Inventor
email: lastrasl@us.ibm.com | Tel: 914-945-3613 | Cell: 914-382-1879
address: 1101 Kitchawan Rd, Office 28-132, Yorktown Heights, NY, 10598
Re: Can we configure analyzers to not exclude specific characters
Posted by Michael Sokolov <ms...@safaribooksonline.com>.
To do that you need to create multiple filters
-Mike
On 01/29/2015 03:36 PM, Shivashankar Maddanimath wrote:
> Thanks Michael,
>
> I am using lucene library so below I how I used your suggestion. Its works but if I need to add multiple patterns and replacements then its not working. It picks the last entry. Is there any way we can add multiple patterns and replacements to PatternReplaceCharFilterFactory?
>
> TokenStream ts;
> Map ruleExplained = new HashMap();
> ruleExplained.put("pattern", "([cC])\\+\\+");
> ruleExplained.put("replacement", "CPlusPlus");
> PatternReplaceCharFilterFactory myRules = new PatternReplaceCharFilterFactory(ruleExplained);
> Reader myreader = myRules.create(new BufferedReader(new InputStreamReader(new FileInputStream(TestFile),StandardCharsets.UTF_8)));
> ts = new UAX29URLEmailTokenizer(Version.LUCENE_48,myreader);
>
>
> Regards,
> Shiv
>
> -----Original Message-----
> From: "Michael Sokolov" <ms...@safaribooksonline.com>
> Sent: 29-01-2015 01:32 AM
> To: "java-user@lucene.apache.org" <ja...@lucene.apache.org>
> Subject: Re: Can we configure analyzers to not exclude specific characters
>
> It's a bit of a hack, but we do this:
>
> <charFilter class="solr.PatternReplaceCharFilterFactory"
> pattern="([A-Za-z])\+\+" replacement="$1plusplus" />
> <charFilter class="solr.PatternReplaceCharFilterFactory"
> pattern="([A-Za-z])\#" replacement="$1sharp" />
>
>
> On 1/28/2015 2:00 AM, Shivashankar Maddanimath wrote:
>> Hi,
>>
>> I am using Lucene standard and uax29urlemailtokenizer. These analysers are excluding some characters like "+" ( I can't search C++). Is there any way we can configure analyzers to include specific characters in analyzers while tokenising?
>>
>> Regards,
>> Shiv
>>
>> -----Original Message-----
>> From: "Luis A Lastras" <la...@us.ibm.com>
>> Sent: 25-01-2015 08:05 AM
>> To: "java-user@lucene.apache.org" <ja...@lucene.apache.org>
>> Subject: Absolute term position in scoring
>>
>> Is it possible to incorporate in Lucene's scoring function the position of a matching term (say as measured from the top of the document). The scenario is, if the set of documents tend to lk about the most important stuff at the beginning of the document, then we would like to give preference to documents that mention a term close to the top.
>>
>> Thanks,
>>
>> Luis
>>
>>
>>
>>
>>
>> Luis A Lastras, Ph.D.
>> Research Staff Member & Manager, Concept Analytics, IBM Watson
>> Member of the iBM Academy of Technology
>> IBM Master Inventor
>> email: lastrasl@us.ibm.com | Tel: 914-945-3613 | Cell: 914-382-1879
>> address: 1101 Kitchawan Rd, Office 28-132, Yorktown Heights, NY, 10598
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
RE: Can we configure analyzers to not exclude specific characters
Posted by Shivashankar Maddanimath <sh...@yahoo.in>.
Thanks Michael,
I am using lucene library so below I how I used your suggestion. Its works but if I need to add multiple patterns and replacements then its not working. It picks the last entry. Is there any way we can add multiple patterns and replacements to PatternReplaceCharFilterFactory?
TokenStream ts;
Map ruleExplained = new HashMap();
ruleExplained.put("pattern", "([cC])\\+\\+");
ruleExplained.put("replacement", "CPlusPlus");
PatternReplaceCharFilterFactory myRules = new PatternReplaceCharFilterFactory(ruleExplained);
Reader myreader = myRules.create(new BufferedReader(new InputStreamReader(new FileInputStream(TestFile),StandardCharsets.UTF_8)));
ts = new UAX29URLEmailTokenizer(Version.LUCENE_48,myreader);
Regards,
Shiv
-----Original Message-----
From: "Michael Sokolov" <ms...@safaribooksonline.com>
Sent: 29-01-2015 01:32 AM
To: "java-user@lucene.apache.org" <ja...@lucene.apache.org>
Subject: Re: Can we configure analyzers to not exclude specific characters
It's a bit of a hack, but we do this:
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="([A-Za-z])\+\+" replacement="$1plusplus" />
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="([A-Za-z])\#" replacement="$1sharp" />
On 1/28/2015 2:00 AM, Shivashankar Maddanimath wrote:
> Hi,
>
> I am using Lucene standard and uax29urlemailtokenizer. These analysers are excluding some characters like "+" ( I can't search C++). Is there any way we can configure analyzers to include specific characters in analyzers while tokenising?
>
> Regards,
> Shiv
>
> -----Original Message-----
> From: "Luis A Lastras" <la...@us.ibm.com>
> Sent: 25-01-2015 08:05 AM
> To: "java-user@lucene.apache.org" <ja...@lucene.apache.org>
> Subject: Absolute term position in scoring
>
> Is it possible to incorporate in Lucene's scoring function the position of a matching term (say as measured from the top of the document). The scenario is, if the set of documents tend to lk about the most important stuff at the beginning of the document, then we would like to give preference to documents that mention a term close to the top.
>
> Thanks,
>
> Luis
>
>
>
>
>
> Luis A Lastras, Ph.D.
> Research Staff Member & Manager, Concept Analytics, IBM Watson
> Member of the iBM Academy of Technology
> IBM Master Inventor
> email: lastrasl@us.ibm.com | Tel: 914-945-3613 | Cell: 914-382-1879
> address: 1101 Kitchawan Rd, Office 28-132, Yorktown Heights, NY, 10598
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Can we configure analyzers to not exclude specific characters
Posted by Michael Sokolov <ms...@safaribooksonline.com>.
It's a bit of a hack, but we do this:
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="([A-Za-z])\+\+" replacement="$1plusplus" />
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="([A-Za-z])\#" replacement="$1sharp" />
On 1/28/2015 2:00 AM, Shivashankar Maddanimath wrote:
> Hi,
>
> I am using Lucene standard and uax29urlemailtokenizer. These analysers are excluding some characters like "+" ( I can't search C++). Is there any way we can configure analyzers to include specific characters in analyzers while tokenising?
>
> Regards,
> Shiv
>
> -----Original Message-----
> From: "Luis A Lastras" <la...@us.ibm.com>
> Sent: 25-01-2015 08:05 AM
> To: "java-user@lucene.apache.org" <ja...@lucene.apache.org>
> Subject: Absolute term position in scoring
>
> Is it possible to incorporate in Lucene's scoring function the position of a matching term (say as measured from the top of the document). The scenario is, if the set of documents tend to lk about the most important stuff at the beginning of the document, then we would like to give preference to documents that mention a term close to the top.
>
> Thanks,
>
> Luis
>
>
>
>
>
> Luis A Lastras, Ph.D.
> Research Staff Member & Manager, Concept Analytics, IBM Watson
> Member of the iBM Academy of Technology
> IBM Master Inventor
> email: lastrasl@us.ibm.com | Tel: 914-945-3613 | Cell: 914-382-1879
> address: 1101 Kitchawan Rd, Office 28-132, Yorktown Heights, NY, 10598
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org