You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by JCodina <jo...@barcelonamedia.org> on 2009/07/01 10:44:45 UTC

Re: facets and stopwords

Sorry , I was too cryptic.

I you follow this link 

http://projecte01.development.barcelonamedia.org/fonetic/
you will see a "Top Words" list (in Spanish and stemmed) in the list there
is the word "si" which is in  20649 documents.
If you click at this word, the system will perform the query 
      (x) content:si, with no answers at all
The same for "la" it is in 17881 documents, but the query  "content:la" will
give no answers at all

the facets list is generated by the query 
http://projecte01.development.barcelonamedia.org/solr/select/?&rows=0&start=0&q=*:*&facet=true&facet.limit=-1&facet.field=content&facet.field=entities_misc&wt=json&json.wrf=jsonp1246437157825&jsoncallback=jsonp1246437157825&_=1246437158023

but the question is why these two words (among others) are there if they are
stop words?

To see what's going on on the index I have tested with the analyzer
http://projecte01.development.barcelonamedia.org/solr/admin/analysis.jsp

If I select the field content and I write the text

"las cosas que si no pasan la proxima vez si que no veràs"
 
i get the following tokens at the end of the analyzer

las	cosa	pasan 	proxima	vez sí 	verà

where que, si, no, la  are removed as treated as stop words.

but... in the schema browser  
http://projecte01.development.barcelonamedia.org/solr/admin/schema.jsp
in the field content "que" is the 3rd word "no" the 4th  "si" and "la" are  
between the top 40 terms...

the analyzer for the content can be seen in this page and has the following
analyzers 


Tokenizer Class:  org.apache.solr.analysis.WhitespaceTokenizerFactory

Filters:

   1. org.apache.solr.analysis.StopFilterFactory
args:{enablePositionIncrements: true words: stopwords.txt ignoreCase: true }
   2. org.apache.solr.analysis.WordDelimiterFilterFactory
args:{catenateWords: 1 catenateNumbers: 1 splitOnCaseChange: 1 catenateAll:
0 generateNumberParts: 1 generateWordParts: 1 }
   3. org.apache.solr.analysis.LowerCaseFilterFactory args:{}
   4. org.apache.solr.analysis.SnowballPorterFilterFactory args:{languange:
Spanish }
   5. org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory args:{}

The field is indexed, tokenized, stored and termvectors are stored.

So, why the stopwords are in the index?





-- 
View this message in context: http://www.nabble.com/facets-and-stopwords-tp23952823p24286283.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: facets and stopwords

Posted by JCodina <jo...@barcelonamedia.org>.


hossman wrote:
> 
> 
> but are you sure that example would actually cause a problem?
> i suspect if you index thta exact sentence as is you wouldn't see the 
> facet count for "si" or "que" increase at all.
> 
> If you do a query for "{!raw field=content}que" you bypass the query 
> parsers (which is respecting your stopwords file) and see all docs that 
> contain the raw term "que" in the content field.
> 
> if you look at some of the docs that match, and paste their content field 
> into the analysis tool, i think you'll see that the problem comes from 
> using the whitespace tokenizer, and is masked by using the WDF 
> after the stop filter ... things like "Que?" are getting ignored by the 
> stopfilter, but ultimately winding up in your index as "que"
> 
> 
> -Hoss
> 
> 

Yes your are right, que? que, que... i need to change the analyzer. They are
not detected by the stopwords analyzer because i use the whitespace
tokenizer, I will use the StanadardTokenizer

Thanks Hoss

-- 
View this message in context: http://www.nabble.com/facets-and-stopwords-tp23952823p24390157.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: facets and stopwords

Posted by Chris Hostetter <ho...@fucit.org>.
: http://projecte01.development.barcelonamedia.org/fonetic/
: you will see a "Top Words" list (in Spanish and stemmed) in the list there
: is the word "si" which is in  20649 documents.
: If you click at this word, the system will perform the query 
:       (x) content:si, with no answers at all
: The same for "la" it is in 17881 documents, but the query  "content:la" will
: give no answers at all
	...
: To see what's going on on the index I have tested with the analyzer
: http://projecte01.development.barcelonamedia.org/solr/admin/analysis.jsp
	...
: "las cosas que si no pasan la proxima vez si que no veràs"

but are you sure that example would actually cause a problem?
i suspect if you index thta exact sentence as is you wouldn't see the 
facet count for "si" or "que" increase at all.

If you do a query for "{!raw field=content}que" you bypass the query 
parsers (which is respecting your stopwords file) and see all docs that 
contain the raw term "que" in the content field.

if you look at some of the docs that match, and paste their content field 
into the analysis tool, i think you'll see that the problem comes from 
using the whitespace tokenizer, and is masked by using the WDF 
after the stop filter ... things like "Que?" are getting ignored by the 
stopfilter, but ultimately winding up in your index as "que"


-Hoss