You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Dmitry Kan <so...@gmail.com> on 2014/03/15 18:58:09 UTC

[solr 4.7.0] analysis page: issue with HTMLStripCharFilterFactory

Hello,

The following type does not get analyzed properly on the solr 4.7.0
analysis page:

    <fieldType name="text_en_splitting" class="solr.TextField"
positionIncrementGap="100" autoGeneratePhraseQueries="true">
      <analyzer type="index">
    <charFilter class="solr.HTMLStripCharFilterFactory"/>
<!-- <tokenizer class="solr.WhitespaceTokenizerFactory"/> -->
<tokenizer class="solr.StandardTokenizerFactory" />
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="lang/stopwords_en.txt"
                />
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory" />
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="lang/stopwords_en.txt"
                />
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
    </fieldType>

Example text:
fox jumps

Screenshot:
http://pbrd.co/1lEVEIa

This works fine in solr 4.6.1.

-- 
Dmitry
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan

Re: [solr 4.7.0] analysis page: issue with HTMLStripCharFilterFactory

Posted by Dmitry Kan <so...@gmail.com>.

Stefan,

no worries. The patch has fixed the issue.


On Sun, Mar 16, 2014 at 10:39 PM, Stefan Matheis
<ma...@gmail.com>wrote:

> Oh .. i'm sorry .. late to the party - didn't see the response from Doug
> .. so feel free to ignore that mail (:
>
>
> On Sunday, March 16, 2014 at 9:38 PM, Stefan Matheis wrote:
>
> > Hey Dmitry
> >
> > We had a similar issue reported and already fixed:
> https://issues.apache.org/jira/browse/SOLR-5800
> > i'd suspect that this patch fixes your issue too? would like to hear
> back from you, if that's the case :)
> >
> > -Stefan
> >
> > On Saturday, March 15, 2014 at 6:58 PM, Dmitry Kan wrote:
> >
> > > Hello,
> > >
> > > The following type does not get analyzed properly on the solr 4.7.0
> > > analysis page:
> > >
> > > <fieldType name="text_en_splitting" class="solr.TextField"
> > > positionIncrementGap="100" autoGeneratePhraseQueries="true">
> > > <analyzer type="index">
> > > <charFilter class="solr.HTMLStripCharFilterFactory"/>
> > > <!-- <tokenizer class="solr.WhitespaceTokenizerFactory"/> -->
> > > <tokenizer class="solr.StandardTokenizerFactory" />
> > > <filter class="solr.StopFilterFactory"
> > > ignoreCase="true"
> > > words="lang/stopwords_en.txt"
> > > />
> > > <filter class="solr.WordDelimiterFilterFactory"
> > > generateWordParts="1" generateNumberParts="1" catenateWords="1"
> > > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> > > <filter class="solr.LowerCaseFilterFactory"/>
> > > <filter class="solr.KeywordMarkerFilterFactory"
> > > protected="protwords.txt"/>
> > > <filter class="solr.PorterStemFilterFactory"/>
> > > </analyzer>
> > > <analyzer type="query">
> > > <tokenizer class="solr.StandardTokenizerFactory" />
> > > <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> > > ignoreCase="true" expand="true"/>
> > > <filter class="solr.StopFilterFactory"
> > > ignoreCase="true"
> > > words="lang/stopwords_en.txt"
> > > />
> > > <filter class="solr.WordDelimiterFilterFactory"
> > > generateWordParts="1" generateNumberParts="1" catenateWords="0"
> > > catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> > > <filter class="solr.LowerCaseFilterFactory"/>
> > > <filter class="solr.KeywordMarkerFilterFactory"
> > > protected="protwords.txt"/>
> > > <filter class="solr.PorterStemFilterFactory"/>
> > > </analyzer>
> > > </fieldType>
> > >
> > > Example text:
> > > fox jumps
> > >
> > > Screenshot:
> > > http://pbrd.co/1lEVEIa
> > >
> > > This works fine in solr 4.6.1.
> > >
> > > --
> > > Dmitry
> > > Blog: http://dmitrykan.blogspot.com
> > > Twitter: http://twitter.com/dmitrykan
> > >
> > >
> > >
> >
> >
>
>


-- 
Dmitry
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan

Re: [solr 4.7.0] analysis page: issue with HTMLStripCharFilterFactory

Posted by Stefan Matheis <ma...@gmail.com>.

Oh .. i'm sorry .. late to the party - didn't see the response from Doug .. so feel free to ignore that mail (: 


On Sunday, March 16, 2014 at 9:38 PM, Stefan Matheis wrote:

> Hey Dmitry 
> 
> We had a similar issue reported and already fixed: https://issues.apache.org/jira/browse/SOLR-5800
> i'd suspect that this patch fixes your issue too? would like to hear back from you, if that's the case :)
> 
> -Stefan 
> 
> On Saturday, March 15, 2014 at 6:58 PM, Dmitry Kan wrote:
> 
> > Hello,
> > 
> > The following type does not get analyzed properly on the solr 4.7.0
> > analysis page:
> > 
> > <fieldType name="text_en_splitting" class="solr.TextField"
> > positionIncrementGap="100" autoGeneratePhraseQueries="true">
> > <analyzer type="index">
> > <charFilter class="solr.HTMLStripCharFilterFactory"/>
> > <!-- <tokenizer class="solr.WhitespaceTokenizerFactory"/> -->
> > <tokenizer class="solr.StandardTokenizerFactory" />
> > <filter class="solr.StopFilterFactory"
> > ignoreCase="true"
> > words="lang/stopwords_en.txt"
> > />
> > <filter class="solr.WordDelimiterFilterFactory"
> > generateWordParts="1" generateNumberParts="1" catenateWords="1"
> > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> > <filter class="solr.LowerCaseFilterFactory"/>
> > <filter class="solr.KeywordMarkerFilterFactory"
> > protected="protwords.txt"/>
> > <filter class="solr.PorterStemFilterFactory"/>
> > </analyzer>
> > <analyzer type="query">
> > <tokenizer class="solr.StandardTokenizerFactory" />
> > <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> > ignoreCase="true" expand="true"/>
> > <filter class="solr.StopFilterFactory"
> > ignoreCase="true"
> > words="lang/stopwords_en.txt"
> > />
> > <filter class="solr.WordDelimiterFilterFactory"
> > generateWordParts="1" generateNumberParts="1" catenateWords="0"
> > catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> > <filter class="solr.LowerCaseFilterFactory"/>
> > <filter class="solr.KeywordMarkerFilterFactory"
> > protected="protwords.txt"/>
> > <filter class="solr.PorterStemFilterFactory"/>
> > </analyzer>
> > </fieldType>
> > 
> > Example text:
> > fox jumps
> > 
> > Screenshot:
> > http://pbrd.co/1lEVEIa
> > 
> > This works fine in solr 4.6.1.
> > 
> > -- 
> > Dmitry
> > Blog: http://dmitrykan.blogspot.com
> > Twitter: http://twitter.com/dmitrykan
> > 
> > 
> > 
> 
>

Re: [solr 4.7.0] analysis page: issue with HTMLStripCharFilterFactory

Posted by Stefan Matheis <ma...@gmail.com>.

Hey Dmitry 

We had a similar issue reported and already fixed: https://issues.apache.org/jira/browse/SOLR-5800
i'd suspect that this patch fixes your issue too? would like to hear back from you, if that's the case :)

-Stefan 


On Saturday, March 15, 2014 at 6:58 PM, Dmitry Kan wrote:

> Hello,
> 
> The following type does not get analyzed properly on the solr 4.7.0
> analysis page:
> 
> <fieldType name="text_en_splitting" class="solr.TextField"
> positionIncrementGap="100" autoGeneratePhraseQueries="true">
> <analyzer type="index">
> <charFilter class="solr.HTMLStripCharFilterFactory"/>
> <!-- <tokenizer class="solr.WhitespaceTokenizerFactory"/> -->
> <tokenizer class="solr.StandardTokenizerFactory" />
> <filter class="solr.StopFilterFactory"
> ignoreCase="true"
> words="lang/stopwords_en.txt"
> />
> <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.KeywordMarkerFilterFactory"
> protected="protwords.txt"/>
> <filter class="solr.PorterStemFilterFactory"/>
> </analyzer>
> <analyzer type="query">
> <tokenizer class="solr.StandardTokenizerFactory" />
> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
> <filter class="solr.StopFilterFactory"
> ignoreCase="true"
> words="lang/stopwords_en.txt"
> />
> <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.KeywordMarkerFilterFactory"
> protected="protwords.txt"/>
> <filter class="solr.PorterStemFilterFactory"/>
> </analyzer>
> </fieldType>
> 
> Example text:
> fox jumps
> 
> Screenshot:
> http://pbrd.co/1lEVEIa
> 
> This works fine in solr 4.6.1.
> 
> -- 
> Dmitry
> Blog: http://dmitrykan.blogspot.com
> Twitter: http://twitter.com/dmitrykan
> 
>