You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Kundig, Andreas" <an...@wipo.int> on 2009/09/25 11:34:14 UTC
problem with HTMLStripStandardTokenizerFactory
Hello
I can't bring HTMLStripStandardTokenizerFactory to remove the content of the style tag, as the documentation says it should.
A search for 'mso' returns a document where the search term only appears in the style tag (it's a word document saved as html). Here is the highlight returned by solr (by the way: the wrong word is highlighted).
"vetica; \n\tpanose-1:2 11 5 4 2 2 2 2 2 4;&<em>#13</em>;\n\tmso-font-charset:0;&<em>#13</em>;\n\tmso-generic-font-family:swiss;&<em>#13</em>"
I am using solr 1.3. Here is how I configured the tokenizer in schema.xml
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.HTMLStripStandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.HTMLStripStandardTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
Am I doing something wrong?
thank you
Andréas Kündig
World Intellectual Property Organization Disclaimer:
This electronic message may contain privileged, confidential and
copyright protected information. If you have received this e-mail
by mistake, please immediately notify the sender and delete this
e-mail and all its attachments. Please ensure all e-mail attachments
are scanned for viruses prior to opening or using.
Re: problem with HTMLStripStandardTokenizerFactory
Posted by Yonik Seeley <yo...@lucidimagination.com>.
Can you give a small test file that demonstrates the problem?
-Yonik
http://www.lucidimagination.com
On Fri, Sep 25, 2009 at 5:34 AM, Kundig, Andreas
<an...@wipo.int> wrote:
> Hello
>
> I can't bring HTMLStripStandardTokenizerFactory to remove the content of the style tag, as the documentation says it should.
>
> A search for 'mso' returns a document where the search term only appears in the style tag (it's a word document saved as html). Here is the highlight returned by solr (by the way: the wrong word is highlighted).
>
> "vetica; \n\tpanose-1:2 11 5 4 2 2 2 2 2 4;&<em>#13</em>;\n\tmso-font-charset:0;&<em>#13</em>;\n\tmso-generic-font-family:swiss;&<em>#13</em>"
>
> I am using solr 1.3. Here is how I configured the tokenizer in schema.xml
>
> <fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
> <analyzer type="index">
> <tokenizer class="solr.HTMLStripStandardTokenizerFactory"/>
> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> </analyzer>
> <analyzer type="query">
> <tokenizer class="solr.HTMLStripStandardTokenizerFactory"/>
> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> </analyzer>
> </fieldType>
>
> Am I doing something wrong?
>
> thank you
> Andréas Kündig
>
> World Intellectual Property Organization Disclaimer:
>
> This electronic message may contain privileged, confidential and
> copyright protected information. If you have received this e-mail
> by mistake, please immediately notify the sender and delete this
> e-mail and all its attachments. Please ensure all e-mail attachments
> are scanned for viruses prior to opening or using.
>