You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Dotan Cohen <do...@gmail.com> on 2013/05/24 09:03:54 UTC

Why would one not use RemoveDuplicatesTokenFilterFactory?

I am looking through the schema of a Solr installation that I
inherited last year. The original dev, who is unavailable for comment,
has two types of text fields: one with
RemoveDuplicatesTokenFilterFactory and one without. These fields are
intended for full-text search.

Why would someone _not_ use RemoveDuplicatesTokenFilterFactory on a
field intended for full-text search? What are the drawbacks to using
it? This application is very, very write heavy (hundreds of writes per
minute) if that matters. It was running on websolr.com at the time,
I've now moved it to Amazon Web Services.

Thanks.

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Why would one not use RemoveDuplicatesTokenFilterFactory?

Posted by Dotan Cohen <do...@gmail.com>.

On Sun, May 26, 2013 at 8:16 PM, Jack Krupansky <ja...@basetechnology.com> wrote:
> The only comment I was trying to make here is the relationship between the
> RemoveDuplicatesTokenFilterFactory and the KeywordRepeatFilterFactory.
>
> No, stemmed terms are not considered the same text as the original word. By
> definition, they are a new value for the term text.
>
>

I see, for some reason I did not concentrate on this key quote of yours:
"...to remove the tokens that did not produce a stem ..."

Now it makes perfect sense.

Thank you, Jack!


--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Why would one not use RemoveDuplicatesTokenFilterFactory?

Posted by Jack Krupansky <ja...@basetechnology.com>.

The only comment I was trying to make here is the relationship between the 
RemoveDuplicatesTokenFilterFactory and the KeywordRepeatFilterFactory.

No, stemmed terms are not considered the same text as the original word. By 
definition, they are a new value for the term text.

-- Jack Krupansky

-----Original Message----- 
From: Dotan Cohen
Sent: Sunday, May 26, 2013 12:55 PM
To: solr-user@lucene.apache.org
Subject: Re: Why would one not use RemoveDuplicatesTokenFilterFactory?

On Fri, May 24, 2013 at 4:04 PM, Jack Krupansky <ja...@basetechnology.com> 
wrote:
> The primary purpose of this filter is in conjunction with the
> KeywordRepeatFilterFactory and a stemmer, to remove the tokens that did 
> not
> produce a stem from the original token, so the keyword duplicate is no
> longer needed. The goal is to index both the stemmed and unstemmed terms 
> at
> the same position.
>
> Whether your app is using the filter for that purpose remains to be seen.
>
> Removing duplicates from the raw input token stream would impact the term
> frequency.
>
> -- Jack Krupansky
>

Thank you Jack. I thought that the filter only removed tokens with
both identical position and identical text:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.RemoveDuplicatesTokenFilterFactory

Are stemmed terms considered the same text as the original word, such
that they will show as a dupe fo the
RemoveDuplicatesTokenFilterFactory? That seems odd.

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Why would one not use RemoveDuplicatesTokenFilterFactory?

Posted by Dotan Cohen <do...@gmail.com>.

On Fri, May 24, 2013 at 4:04 PM, Jack Krupansky <ja...@basetechnology.com> wrote:
> The primary purpose of this filter is in conjunction with the
> KeywordRepeatFilterFactory and a stemmer, to remove the tokens that did not
> produce a stem from the original token, so the keyword duplicate is no
> longer needed. The goal is to index both the stemmed and unstemmed terms at
> the same position.
>
> Whether your app is using the filter for that purpose remains to be seen.
>
> Removing duplicates from the raw input token stream would impact the term
> frequency.
>
> -- Jack Krupansky
>

Thank you Jack. I thought that the filter only removed tokens with
both identical position and identical text:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.RemoveDuplicatesTokenFilterFactory

Are stemmed terms considered the same text as the original word, such
that they will show as a dupe fo the
RemoveDuplicatesTokenFilterFactory? That seems odd.

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Why would one not use RemoveDuplicatesTokenFilterFactory?

Posted by Jack Krupansky <ja...@basetechnology.com>.

The primary purpose of this filter is in conjunction with the 
KeywordRepeatFilterFactory and a stemmer, to remove the tokens that did not 
produce a stem from the original token, so the keyword duplicate is no 
longer needed. The goal is to index both the stemmed and unstemmed terms at 
the same position.

Whether your app is using the filter for that purpose remains to be seen.

Removing duplicates from the raw input token stream would impact the term 
frequency.

-- Jack Krupansky

-----Original Message----- 
From: Dotan Cohen
Sent: Friday, May 24, 2013 3:03 AM
To: solr-user@lucene.apache.org
Subject: Why would one not use RemoveDuplicatesTokenFilterFactory?

I am looking through the schema of a Solr installation that I
inherited last year. The original dev, who is unavailable for comment,
has two types of text fields: one with
RemoveDuplicatesTokenFilterFactory and one without. These fields are
intended for full-text search.

Why would someone _not_ use RemoveDuplicatesTokenFilterFactory on a
field intended for full-text search? What are the drawbacks to using
it? This application is very, very write heavy (hundreds of writes per
minute) if that matters. It was running on websolr.com at the time,
I've now moved it to Amazon Web Services.

Thanks.

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com