You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by JCodina <jo...@barcelonamedia.org> on 2009/07/01 10:44:45 UTC
Re: facets and stopwords
Sorry , I was too cryptic.
I you follow this link
http://projecte01.development.barcelonamedia.org/fonetic/
you will see a "Top Words" list (in Spanish and stemmed) in the list there
is the word "si" which is in 20649 documents.
If you click at this word, the system will perform the query
(x) content:si, with no answers at all
The same for "la" it is in 17881 documents, but the query "content:la" will
give no answers at all
the facets list is generated by the query
http://projecte01.development.barcelonamedia.org/solr/select/?&rows=0&start=0&q=*:*&facet=true&facet.limit=-1&facet.field=content&facet.field=entities_misc&wt=json&json.wrf=jsonp1246437157825&jsoncallback=jsonp1246437157825&_=1246437158023
but the question is why these two words (among others) are there if they are
stop words?
To see what's going on on the index I have tested with the analyzer
http://projecte01.development.barcelonamedia.org/solr/admin/analysis.jsp
If I select the field content and I write the text
"las cosas que si no pasan la proxima vez si que no veràs"
i get the following tokens at the end of the analyzer
las cosa pasan proxima vez sí verà
where que, si, no, la are removed as treated as stop words.
but... in the schema browser
http://projecte01.development.barcelonamedia.org/solr/admin/schema.jsp
in the field content "que" is the 3rd word "no" the 4th "si" and "la" are
between the top 40 terms...
the analyzer for the content can be seen in this page and has the following
analyzers
Tokenizer Class: org.apache.solr.analysis.WhitespaceTokenizerFactory
Filters:
1. org.apache.solr.analysis.StopFilterFactory
args:{enablePositionIncrements: true words: stopwords.txt ignoreCase: true }
2. org.apache.solr.analysis.WordDelimiterFilterFactory
args:{catenateWords: 1 catenateNumbers: 1 splitOnCaseChange: 1 catenateAll:
0 generateNumberParts: 1 generateWordParts: 1 }
3. org.apache.solr.analysis.LowerCaseFilterFactory args:{}
4. org.apache.solr.analysis.SnowballPorterFilterFactory args:{languange:
Spanish }
5. org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory args:{}
The field is indexed, tokenized, stored and termvectors are stored.
So, why the stopwords are in the index?
--
View this message in context: http://www.nabble.com/facets-and-stopwords-tp23952823p24286283.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: facets and stopwords
Posted by JCodina <jo...@barcelonamedia.org>.
hossman wrote:
>
>
> but are you sure that example would actually cause a problem?
> i suspect if you index thta exact sentence as is you wouldn't see the
> facet count for "si" or "que" increase at all.
>
> If you do a query for "{!raw field=content}que" you bypass the query
> parsers (which is respecting your stopwords file) and see all docs that
> contain the raw term "que" in the content field.
>
> if you look at some of the docs that match, and paste their content field
> into the analysis tool, i think you'll see that the problem comes from
> using the whitespace tokenizer, and is masked by using the WDF
> after the stop filter ... things like "Que?" are getting ignored by the
> stopfilter, but ultimately winding up in your index as "que"
>
>
> -Hoss
>
>
Yes your are right, que? que, que... i need to change the analyzer. They are
not detected by the stopwords analyzer because i use the whitespace
tokenizer, I will use the StanadardTokenizer
Thanks Hoss
--
View this message in context: http://www.nabble.com/facets-and-stopwords-tp23952823p24390157.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: facets and stopwords
Posted by Chris Hostetter <ho...@fucit.org>.
: http://projecte01.development.barcelonamedia.org/fonetic/
: you will see a "Top Words" list (in Spanish and stemmed) in the list there
: is the word "si" which is in 20649 documents.
: If you click at this word, the system will perform the query
: (x) content:si, with no answers at all
: The same for "la" it is in 17881 documents, but the query "content:la" will
: give no answers at all
...
: To see what's going on on the index I have tested with the analyzer
: http://projecte01.development.barcelonamedia.org/solr/admin/analysis.jsp
...
: "las cosas que si no pasan la proxima vez si que no veràs"
but are you sure that example would actually cause a problem?
i suspect if you index thta exact sentence as is you wouldn't see the
facet count for "si" or "que" increase at all.
If you do a query for "{!raw field=content}que" you bypass the query
parsers (which is respecting your stopwords file) and see all docs that
contain the raw term "que" in the content field.
if you look at some of the docs that match, and paste their content field
into the analysis tool, i think you'll see that the problem comes from
using the whitespace tokenizer, and is masked by using the WDF
after the stop filter ... things like "Que?" are getting ignored by the
stopfilter, but ultimately winding up in your index as "que"
-Hoss