You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Shalin Shekhar Mangar <sh...@gmail.com> on 2009/10/05 11:46:54 UTC

Re: Question about PatternReplace filter and automatic Synonym generation

On Fri, Oct 2, 2009 at 11:31 PM, Prasanna Ranganathan <
pranganathan@netflix.com> wrote:

>
>  Does the PatternReplaceFilter have an option where you can keep the
> original token in addition to the modified token? From what I looked at it
> does not seem to but I want to confirm the same.
>
>
No, it does not.


> Alternatively, is there a filter available which takes in a pattern and
> produces additional forms of the token depending on the pattern? The use
> case I am looking at here is using such a filter to automate synonym
> generation. In our application, quite a few of the synonym file entries
> match a specific pattern and having such a filter would make it easier I
> believe. Pl. do correct me in case I am missing some unwanted side-effect
> with this approach.
>
>
I do not understand this. TokenFilters are used for things like stemming,
replacing patterns, lowercasing, n-gramming etc. The synonym filter inserts
additional tokens (synonyms) from a file for each token.

What exactly are you trying to do with synonyms? I guess you could do
stemming etc with synonyms but why do you want to do that?


> Continuing on that line, what is the performance hit in having additional
> index-time filters as opposed to using a synonym file with more entries?
> How
> does the overhead of using a bigger synonym file as opposed to additional
> filters compare?
>
>
Note that a change in synonym file needs a re-index of the affected
documents. Also, the synonym map is kept in memory.

-- 
Regards,
Shalin Shekhar Mangar.

Re: Question about PatternReplace filter and automatic Synonym generation

Posted by Chris Hostetter <ho...@fucit.org>.
:  There is a Solr.PatternTokenizerFactory class which likely fits the bill in
: this case. The related question I have is this - is it possible to have
: multiple Tokenizers in your analysis chain?

No .. Tokenizers consume CharReaders and produce a TokenStream ... what's 
needed here is a TokenFilter that comsumes a TOkenStream and produces a 
TokenStream





-Hoss


Re: Question about PatternReplace filter and automatic Synonym generation

Posted by Prasanna Ranganathan <pr...@netflix.com>.

On 10/6/09 3:32 PM, "Chris Hostetter" <ho...@fucit.org> wrote:

> 
> :  I ll try to explain with an example. Given the term 'it!' in the title, it
> : should match both 'it' and 'it!' in the query as an exact match. Currently,
> : this is done by using a synonym entry  (and index time SynonymFilter) as
> : follows:
> : 
> :  it! => it, it!
> : 
> :  Now, the above holds true for all cases where you have a title token of the
> : form [aA-zZ]*!. Handling all of those cases requires adding synonyms
> : manually for each case which is not easy to manage and does not scale.
> : 
> :  I am hoping to do the same by using a index time filter that takes in a
> : pattern like the PatternReplace filter and adds the newly created token
> : instead of replacing the original one. Does this make sense? Am I missing
> : something that would break this approach?
> 
> something like this would be fairly easy to implement in Lucene, but
> somewhat confusing to try and configure in Solr.  I was going to suggest
> that you use something like...
>  <filter class="solr.PatternReplaceFilterFactory"
>                 pattern="(^.*)\!?$)" replacement="$1 $2" replace="all" />
> 
> ..and then have a subsequent filter that splits the tokens on the
> whitespace (or any other special character you could use in the
> replacement) ... but aparently we don't have any built in filters that
> will just split tokens on a character/pattern for you.  that would also be
> fairly easy to write if someone wnats to submit a patch.

 There is a Solr.PatternTokenizerFactory class which likely fits the bill in
this case. The related question I have is this - is it possible to have
multiple Tokenizers in your analysis chain?

Prasanna.


Re: Question about PatternReplace filter and automatic Synonym generation

Posted by Chris Hostetter <ho...@fucit.org>.
:  I ll try to explain with an example. Given the term 'it!' in the title, it
: should match both 'it' and 'it!' in the query as an exact match. Currently,
: this is done by using a synonym entry  (and index time SynonymFilter) as
: follows:
: 
:  it! => it, it!
: 
:  Now, the above holds true for all cases where you have a title token of the
: form [aA-zZ]*!. Handling all of those cases requires adding synonyms
: manually for each case which is not easy to manage and does not scale.
: 
:  I am hoping to do the same by using a index time filter that takes in a
: pattern like the PatternReplace filter and adds the newly created token
: instead of replacing the original one. Does this make sense? Am I missing
: something that would break this approach?

something like this would be fairly easy to implement in Lucene, but 
somewhat confusing to try and configure in Solr.  I was going to suggest 
that you use something like...
 <filter class="solr.PatternReplaceFilterFactory"
                pattern="(^.*)\!?$)" replacement="$1 $2" replace="all" />

..and then have a subsequent filter that splits the tokens on the 
whitespace (or any other special character you could use in the 
replacement) ... but aparently we don't have any built in filters that 
will just split tokens on a character/pattern for you.  that would also be 
fairly easy to write if someone wnats to submit a patch.


-Hoss


Re: Question about PatternReplace filter and automatic Synonym generation

Posted by Prasanna Ranganathan <pr...@netflix.com>.
On 10/5/09 8:59 PM, "Christian Zambrano" <cz...@gmail.com> wrote:

> 
> Wouldn't it be better to use built-in token filters at both index and
> query that will convert 'it!' to just 'it'? I believe the
> WorkDelimeterFilterFactory will do that for you.
> 

 We do have a field that uses WordDelimiterFilter but it also uses a Stemmer
and Stopword filter. That field is used for a stemmed match with a nominal
boost. However, the field I am talking about is for an exact match (only
lowercase and synonym filter) with a higher boost than the field with the
WordDelimiterFilter.

Prasanna.


Re: Question about PatternReplace filter and automatic Synonym generation

Posted by Christian Zambrano <cz...@gmail.com>.
Prasanna,

Wouldn't it be better to use built-in token filters at both index and  
query that will convert 'it!' to just 'it'? I believe the  
WorkDelimeterFilterFactory will do that for you.

Christian

On Oct 5, 2009, at 7:31 PM, Prasanna Ranganathan <pranganathan@netflix.com 
 > wrote:

>
>
>
> On 10/5/09 2:46 AM, "Shalin Shekhar Mangar" <sh...@gmail.com>  
> wrote:
>
>>> Alternatively, is there a filter available which takes in a  
>>> pattern and
>>> produces additional forms of the token depending on the pattern?  
>>> The use
>>> case I am looking at here is using such a filter to automate synonym
>>> generation. In our application, quite a few of the synonym file  
>>> entries
>>> match a specific pattern and having such a filter would make it  
>>> easier I
>>> believe. Pl. do correct me in case I am missing some unwanted side- 
>>> effect
>>> with this approach.
>>>
>>>
>> I do not understand this. TokenFilters are used for things like  
>> stemming,
>> replacing patterns, lowercasing, n-gramming etc. The synonym filter  
>> inserts
>> additional tokens (synonyms) from a file for each token.
>>
>> What exactly are you trying to do with synonyms? I guess you could do
>> stemming etc with synonyms but why do you want to do that?
>
> I ll try to explain with an example. Given the term 'it!' in the  
> title, it
> should match both 'it' and 'it!' in the query as an exact match.  
> Currently,
> this is done by using a synonym entry  (and index time  
> SynonymFilter) as
> follows:
>
> it! => it, it!
>
> Now, the above holds true for all cases where you have a title token  
> of the
> form [aA-zZ]*!. Handling all of those cases requires adding synonyms
> manually for each case which is not easy to manage and does not scale.
>
> I am hoping to do the same by using a index time filter that takes  
> in a
> pattern like the PatternReplace filter and adds the newly created  
> token
> instead of replacing the original one. Does this make sense? Am I  
> missing
> something that would break this approach?
>
>>
>> Note that a change in synonym file needs a re-index of the affected
>> documents. Also, the synonym map is kept in memory.
>
> What is the overhead incurred in having an additional filter applied  
> during
> indexing? It is strictly CPU only?
>
> Thanks a lot for your valuable input.
>
> Regards,
>
> Prasanna.
>

Re: Question about PatternReplace filter and automatic Synonym generation

Posted by Prasanna Ranganathan <pr...@netflix.com>.


On 10/5/09 2:46 AM, "Shalin Shekhar Mangar" <sh...@gmail.com> wrote:

>> Alternatively, is there a filter available which takes in a pattern and
>> produces additional forms of the token depending on the pattern? The use
>> case I am looking at here is using such a filter to automate synonym
>> generation. In our application, quite a few of the synonym file entries
>> match a specific pattern and having such a filter would make it easier I
>> believe. Pl. do correct me in case I am missing some unwanted side-effect
>> with this approach.
>> 
>> 
> I do not understand this. TokenFilters are used for things like stemming,
> replacing patterns, lowercasing, n-gramming etc. The synonym filter inserts
> additional tokens (synonyms) from a file for each token.
> 
> What exactly are you trying to do with synonyms? I guess you could do
> stemming etc with synonyms but why do you want to do that?
 
 I ll try to explain with an example. Given the term 'it!' in the title, it
should match both 'it' and 'it!' in the query as an exact match. Currently,
this is done by using a synonym entry  (and index time SynonymFilter) as
follows:

 it! => it, it!

 Now, the above holds true for all cases where you have a title token of the
form [aA-zZ]*!. Handling all of those cases requires adding synonyms
manually for each case which is not easy to manage and does not scale.

 I am hoping to do the same by using a index time filter that takes in a
pattern like the PatternReplace filter and adds the newly created token
instead of replacing the original one. Does this make sense? Am I missing
something that would break this approach?

> 
> Note that a change in synonym file needs a re-index of the affected
> documents. Also, the synonym map is kept in memory.

 What is the overhead incurred in having an additional filter applied during
indexing? It is strictly CPU only?

 Thanks a lot for your valuable input.

Regards,

Prasanna.